论文精读：MISSRec

A homework

three sections:

conclusion
discussion
thinking

conclusion

I conclude this work from two aspects

motivation, what is the problem (based on introduction/related work)
how to solve it (based on method)

(Actually, I originally intended to write three parts and the third part is how to demonstrate the efficacy based on method/experiment, which I wanna place in the discussion after hesitation)

part1

There are defects of traditional sequential recommenders based on ID features:

underperformce with sparse IDs and struggling with the cold-start problem
inconsistent ID mappings hinder the model’s transferability
a popularity bias toward popular IDs, posing the fairness issue

So, we need to leverage multi-modal data, which is promising to remedy the above shortcomings

However, challenges still exist:

the multi-modal synergy within each item is userdependent and dynamic, making it difficult to design effective multi-modal
fusion
information redundancy can overwhelm essential user interests, i.e. homogeneous items result in overemphasis on specific kinds of items

Fortunately, MISSRec would tackle them, details in part2

part2

We divide the method into three parts to discuss, i.e., Model design, Goal of SR and Training strategy

About Model design, In my view, it mainly based on two design philosophies

Two deign philosophies:

Multi-modal Learning: Typically, there are: 1) encoder (i.e., BERT for text and ViT for img); 2) Feature process before fusion (i.e., Dropout for feature augmentation, text/visual adapters and concatenation); 3) Feature fusion (MID Module)
Transformer: Typically, there are: 1) encoder (i.e., Text encoder receiving input of token seq. after concatenation); 2) decoder (i.e., Interest-aware Decoder receiving input of fusion fearture after MID Module); 3) Q-K-V in the interaction between encoders and decoders (i.e., K/V: encoded token by Text encoder and Q: interest token by MID Module)

At present, we have lots of modules with many functions that can be analyzed. So, we need to think about how to organize them or why we do these things (based on I know the details of MISSRec).

It’s all about the goal of SR:

the representation of user: aggregation of the decoded embeddings (output of Interest-aware Decoder, which contain rich information about user interest) by mean pooling
the representation of item: lightweight fusion module to generate user-adaptive item representations (based on user representation and the output of text/visual adapters)
matching score between user and item: a factorizable formula containing user repr. and item repr.

Finally we talk about the training:

Pre-training, two main loss fuctions and a regularization: 1) loss in Eq.5 to capture the correspondence between interaction sequences and candidate items; 2) loss in Eq.6 which is based on two repr. with the same original semantic to enhance the robustness of the ability of model to generate the repr.; 3) an orthogonal regularization to diversify interest-aware decoded results
Fine-tunining, aim to transfer the knowledge: 1)Inductive Transfer: just fine-tune the modality-specific adapters with loss in Eq.5 and orthogonal regularization; 2)Transductive Transfer: Inductive Transfer. with ID embedding

Up to now, we make the MISSRec which has an effective multi-modal fusion and obvisious transfer ability to tackle the cold-start problem, deal with challenging data distribution and perform well. There are some remaining details so let’s start a discussion.