Mitigating Spurious Correlations for Self-supervised Recommendation

Recent years have witnessed the great success of self-supervised learning (SSL) in recommendation systems. However, SSL recommender models are likely to suffer from spurious correlations, leading to poor generalization. To mitigate spurious correlations, existing work usually pursues ID-based SSL recommendation or utilizes feature engineering to identify spurious features. Nevertheless, ID-based SSL approaches sacrifice the positive impact of invariant features, while feature engineering methods require high-cost human labeling. To address the problems, we aim to automatically mitigate the effect of spurious correlations. This objective requires to 1) automatically mask spurious features without supervision, and 2) block the negative effect transmission from spurious features to other features during SSL. To handle the two challenges, we propose an invariant feature learning framework, which first divides user-item interactions into multiple environments with distribution shifts and then learns a feature mask mechanism to capture invariant features across environments. Based on the mask mechanism, we can remove the spurious features for robust predictions and block the negative effect transmission via mask-guided feature augmentation. Extensive experiments on two datasets demonstrate the effectiveness of the proposed framework in mitigating spurious correlations and improving the generalization abilities of SSL models.


Introduction
Self-supervised learning (SSL) approaches have recently become state-of-the-art (SOTA) for personalized recommendation [1,2].The core idea of SSL in recommendation is to learn better user and item representations via an additional selfdiscrimination task [3,4], which contrasts the augmentations over user-item features [5] or useritem interaction graphs [1,6] to discover the correlation relationships among features and interactions [2].Despite the great success, SSL-based recommender models are vulnerable to spurious correlations due to fitting the correlations from the input features to interactions.Because of the selection bias in the data collection process [7], spurious correlations inevitably exist in the training data, where some spurious features show strong correlations with users' positive interactions (e.g., clicks).As illustrated in Fig. 1, users with 4 or 6 years of work experience are likely to have positive interactions with full-time jobs, while those with 5 or 7 years of experience are easy to have negative interactions.Such correlations between user experience and interactions are not reliable because users' preference for full-time jobs should not be significantly changed by only one year of experience gap.By the self-discrimination task, SSL models tend to capture these spurious correlations, resulting in poor generalization ability.
To alleviate the harmful effect of spurious correlations on SSL models, existing solutions mainly fall into three categories.Specifically, • ID-based SSL methods [1], which only utilize IDs of users and items for collaborative filtering, and thus can avoid the harmful influence of some spurious features.However, the user and item features are still useful in the recommendation, especially for users with sparse interactions [2].It is necessary to consider some invariant features that causally affect the interactions.For instance, accounting students usually prefer accountancy-related jobs.• Feature engineering methods, which are able to identify a set of spurious features manually or using human-machine hybrid approaches [8,9].Thereafter, we can train the SSL recommender models by discarding the identified features.Nevertheless, feature engineering methods require extensive human-labeling work and thus are not applicable to large-scale recommendations with extensive user and item features.• Informative feature selection methods, which are capable of automatically recognizing the informative cross features and removing the redundant ones in the training process [10,11].For instance, [11] proposes a two-stage training strategy to identify informative feature interactions by a regularized optimizer and retrain the model after removing all the redundant features.Nevertheless, spurious features might be very informative for the interaction prediction in the training data, and thus degrade the generalization ability.
To solve the problems, we require the SSL models to automatically mitigate the effect of spurious correlations.In order to achieve this objective, there exist two essential challenges: • It is non-trivial to mask spurious features without supervision.SSL recommender models are expected to automatically identify the spurious features and drop them for robust predictions.
As such, we should dig out the signals from the correlated data to guide the identification of spurious features.• Blocking the effect transmission from spurious features to other features is of vital importance.SSL models usually maximize the features' mutual information via feature augmentation [12] (e.g., correlated feature masking [2]) and contrastive learning, and thus the spurious features and other correlated features might have similar representations, transferring the detrimental effect of spurious features to other features.For example, users' experience in Fig. 1 might be correlated with the users' age, and SSL models are likely to learn similar representations for the experience and age via self-discrimination.
To address the two challenges, we consider learning a feature mask mechanism from multiple environments to estimate the probabilities of spurious features and then adopt the mask mechanism to guide the feature augmentation in SSL models.Specifically, 1) we can cluster the interactions into multiple environments, where each environment has similar feature distributions, but the distributions shift between environments.The distribution shifts will guide the mask mechanism to capture invariant features across environments and exclude spurious features [13].2) Besides, we can utilize the mask mechanism to drop the spurious features as the augmented sample and then maximize the mutual information between the invariant features in the augmented sample and all the input features in the factual sample, pushing SSL models to ignore the spurious features and cut off the negative effect transmission from spurious features to invariant features.
To this end, we propose an Invariant Feature Learning (IFL) framework for SSL recommender models to mitigate spurious correlations.In particular, IFL clusters the training interactions into multiple environments and leverages a masking mechanism with learnable parameters in [0, 1] to shield spurious correlations.To optimize the mask parameters, IFL adopts a variance loss to identify invariant features and achieve robust predictions across environments.As for the self-discrimination task, we drop the spurious features based on the mask parameters as the augmented sample, and then maximize the mutual information between the factual and augmented samples via contrastive loss, which pushes the SSL model to ignore spurious features.We instantiate IFL on a SOTA SSL model [2], and extensive experiments on two realworld datasets validate the effectiveness of the proposed IFL in mitigating spurious correlations.
In summary, our contributions are summarized as follows: • We point out the spurious correlations in SSL recommendation and consider learning invariant features from multiple environments.
• We propose a model-agnostic IFL framework, which leverages a feature mask mechanism and mask-guided contrastive learning to reduce spurious correlations for SSL models.• Empirical results on two public datasets verify the superiority of our proposed IFL in masking spurious features and enhancing the generalization ability of SSL models.

Method
In this section, we first introduce the recommendation task, SSL recommendation, and spurious correlations in Section 2.1.And then, we present our IFL framework for SSL recommendation in Section 2.2, including feature mask learning and mask-guided contrastive learning.

Recommender Formulation
The general idea of the recommendation task is to learn user preference from collected interactions between the users U and items I, where each user u ∈ U has N features such as ID, experience year, and country, denoted by X u = {x 1 u , x 2 u , . . ., x N u }.Each entry in X u is a one-hot vector indicating a specific feature value (e.g., experience year=5).Likewise, an item i has M Given a useritem interaction dataset D = {(X u , X i , y ui )} with y ui ∈ {0, 1} indicating whether u interacts with i, the recommender model aims to learn a function f (u, i|θ) to capture user preference.θ denotes the learnable parameters optimized over dataset D via the collaborative filtering (CF) loss, such as BPR loss [14].

Self-supervised Recommendation
The SSL recommendation introduces an extra self-discrimination task to learn better user and item representations.The self-discrimination task includes two key steps: data augmentation and contrastive learning.SSL first augments factual samples by randomly dropping user and item features [2], or conducting edge and node dropouts in the user-item interaction graph [1].Then, based on the augmented samples, the positive and negative pairs are constructed for contrastive learning.
In the self-discrimination task, SSL models are essentially exploring the relationships between the user and item features, and the interactions [2].As such, they are likely to fit spurious correlations from the spurious features to interactions.

Spurious correlations
Spurious correlations broadly exist in the training dataset D due to the selection bias in data collection [7].Intuitively, some spurious features do not causally affect the interactions but have a strong correlation with interactions purely because of the selection bias (see the example in Fig. 1).By the normal recommender training over D and the additional self-discrimination task, SSL models will easily capture these shortcut correlations, suffering from poor generalization when the data distribution shifts.Moreover, the selfdiscrimination task via feature augmentation will maximize the mutual information between user and item features [2], transferring the detrimental effect from spurious features to other invariant features.Such effect transmission in SSL models will further intensify the negative influence of spurious correlations.

Invariant Feature Learning
To mitigate the spurious correlations, we propose an IFL framework that can automatically identify spurious features via the feature mask mechanism and block the negative effect transmission by mask-guided contrastive learning.
The overall IFL framework is demonstrated in Fig. 2. Given a pair of user and item features (X u ,X i ), we first look up their embeddings via the feature embedding layer, and then utilize the feature mask mechanism to mask the embeddings of spurious features for predictions.We utilize the CF loss to optimize the feature embeddings while adopting a variance loss to supervise the learning of the mask mechanism.Regarding self-discrimination, we feed the user and item features into the mask-guided augmentation layer and conduct contrastive learning over the factual and augmented samples.

Feature Mask Learning
In order to identify spurious features and remove their harmful influence, we introduce a feature mask mechanism.
Specifically, we define two feature masks m u ∈ R N , and m i ∈ R M , which are shared with all users and items, respectively.m u and m i with the range [0, 1], denoting the probability of being invariant features, and thus a feature with a smaller mask value is more likely to be a spurious feature.To estimate the two masks, we draw m u , and m i from the clipped Gaussian distributions [15] parameterized by γ 1 ∈ R N and γ 2 ∈ R M , respectively.Formally, for each we have where the noise ϵ is drawn from N 0, σ 2 ϵ .To ensure the two masks well represent the probabilities, we clip the values of m u and m i into [0, 1].Thereafter, we apply the masks to shield the spurious features before feeding the feature embeddings into the Deep Neural Network (DNN) encoders.In detail, we represent the user and item features via the embeddings X u1 ∈ R N ×D and X i ∈ R M ×D , where D is the embedding dimension.For each where ⊙ is the element-wise multiplication.As such, the feature embeddings are masked via m u and m i .Next, we concatenate N feature embeddings in X ′ u into a vector and then feed it into the DNN encoders to obtain the user representation z u and item representation z i .
• Environment Division.The key challenge of automatically recognizing spurious features is to extract the supervision signals from the correlated data.To address this challenge, we cluster the interactions into multiple environments with distribution shifts and utilize the shifts to discover invariant features.Specifically, we obtain the representation of each user-item pair z ui by concatenating the user and item representations: Thereafter, we adopt K-means to cluster the interaction representations z ui into C environments: where z c ui denotes the interaction representation that belongs to environment c.By clustering, similar features will be divided into the same environment while the feature distribution shifts across environments.The interactions with the same spurious features are similar and thus are easy to be clustered into the same environments.As illustrated in Fig. 3, the spurious features show different distributions across environments, and only invariant features have consistent distributions.As such, pursuing robust predictions across environments will push the mask mechanism to discover the invariant features and exclude spurious features.
• CF and Variance Loss.To push the mask mechanism to exclude spurious features, we incorporate a variance loss.Before the variance loss, we first detail the CF loss for the normal recommender training.Following [2], we adopt the batch-softmax loss for a batch of interactions, i.e., where B is the batch number, τ refers to a temperature hyper-parameter, and s(•) denotes the cosine similarity function.
As to the variance loss, we separately calculate L CF within each environment, i.e., L c with c varying from 1 to C. Formally, where z c u k , z c i k represent the k-th user and item representation of environment c, and B c is the number of interactions that belong to environment c.And then, we regulate the mask mechanism by minimizing the gradient variance L v = L vu + L vi , where L vu and L vi are as follows: where ∇ mu L c is the gradients w.r.t. the mask m u in environment c and ∇ mu L CF denotes the average gradients w.r.t.m u across C environments.L vu thus reflects the gradient variance in C environments.Here, the utility of gradient variance follows the idea of invariant learning in [16], where the gradient variance is simplified from the variance penalty regularizer proposed in [17] for feature selection scenarios.Minimizing the gradient variance will regulate m u to have similar gradients and close performance in multiple environments [18].The gradient loss over L vi is similar to L vu for the regularization of m i .As such, optimizing the variance loss avoids the situation that most environments are well predicted via capturing spurious correlations, while few environments have inferior performance because of the spurious correlations.

Mask-guided Contrastive Learning
By incorporating the feature mask mechanism into the self-discrimination task, we require the model to ignore the spurious features and cut off the negative effect transmission from spurious features to invariant features.
• Data Augmentation.Instead of random feature dropout for augmentation, we consider a dropping probability to drop the spurious features.The dropping probability of each feature is proportional to its probability of being a spurious feature in the mask mechanism.As shown in Fig. 4, we then utilize contrastive learning to maximize the mutual information between the factual sample and augmented samples.• Contrastive Loss.Fig. 4(c) illustrates the contrastive loss in a batch of samples.Formally, where z k f ac stands for the representation of a factual sample with all input features, and z kinv is for the augmented sample that drops spurious features.By considering that z k f ac and z kinv are positive pairs, the SSL models will ignore the spurious features recognized by the mask mechanism, blocking the effect transmission from spurious features to other invariant features.
• Optimization.In our proposed IFL, the overall objective function is formulated as: where θ denotes the model's learnable parameters, and ∥θ∥2 is the L2 norm regularization term to avoid over-fitting.Besides, α, β, and θ are the hyper-parameters to adjust the effect of the contrastive loss, variance loss, and regularization.Finally, we utilize this overall loss to optimize the parameters by gradient descent.

Experiment
We conduct extensive experiments to answer the following research questions: • RQ1: Can the IFL framework outperform baselines on the recommendation performance when spurious features exist?• RQ2: How do the components of IFL affect its effectiveness?• RQ3: How do the hyper-parameters affect the validity of the proposed method?

Experimental Settings
Datasets.We conduct experiments on two realworld datasets, Meituan 2 and XING3 .The statistics of them are shown in Table 1.
• Meituan is a food recommendation dataset with rich user consumption of food.For each sample, we keep the following important attributes: user ID, user income, item ID, and item price.• XING is a job recommendation dataset, including several types of interactions and abundant features of users and jobs.For simplicity, we delete some unimportant features.We merge three types of interactions (types of 1 − 3) to reflect user interests, making a sample negative only if the three types of interaction are all negative.
We apply the 5-core setting [19] to filter the datasets and randomly split them into training, validation, and testing sets with the ratio of 8:1:1, where the testing set is used for the IID testing setting, i.e., the setting that the training and testing have the independent and identical distribution.Besides, to better evaluate the models' ability to deal with spurious correlations, we also construct an OOD testing set, in which the spurious correlations are manually controlled.For example, we discover the work experience as the spurious feature for dataset XING.Then we forcibly filter out some interactions such that the OOD set and the training set have significantly different conditional distributions of the interaction given this spurious feature, as shown in Fig. 5.
Baselines.We compare the proposed IFL framework with the following five representative recommender models: Table 2 Performance comparisons between baselines and IFL on Meituan and XING.The best results are highlighted in bold and the sub-optimal results are underlined.

Meituan OOD testing IID testing Model
Recall@50 Recall@100 NDCG@50 NDCG@100 Recall@50 Recall@100 NDCG@50 NDCG@100 FM 0. The conditional distribution of the interaction given spurious features for the IID and OOD sets of XING, where the spurious feature is the user work experience and the interaction is about interacting with the full-time job.
• FM [20], NFM Evaluation.Following the common setting of existing work, we adopt the full-ranking protocol to evaluate the top-K recommendation performance with the averaged Recall@K and NDCG@K as the metrics.We set K = {50, 100} for Meitaun and K = {10, 20} for XING.We compute the metrics on both the IID and OOD testing sets, respectively.
Hyper-parameter Settings.We fix the embedding size 64 for all methods.Then, we tune the hyper-parameters on the validation sets with the following strategies:

Overall Performance (RQ1)
We compare the recommendation performance in both IID and OOD settings between IFL and the baselines.The results are shown in Table 2, where we have the following observations: • In the OOD testing setting, the proposed IFL outperforms all baselines, including SSL-based methods, on both two datasets.To achieve a good OOD generalization, a model should eliminate the impact of spurious correlation in the training set.Thus, this result can verify the effectiveness of our proposal -spurious feature mask learning and mask-guided data augmentation -in identifying the spurious features and blocking the bad effect of the corresponding spurious correlations for SSL.• In the IID testing setting, our IFL at least achieves comparable performance to SOTA baselines.Together with its good OOD performance, this result shows that IFL can achieve good OOD generalization ability without sacrificing IID prediction.• When moving from the IID testing setting to the OOD testing setting, all methods show sharp performance drops.This is a normal phenomenon, since there are huge distribution shifts from the training to the OOD testing set.
Although our IFL is expected to achieve OOD performances by removing the spurious correlation in training, we cannot expect it to have no performance drops since 1) the OOD set is not an unbiased set and 2) the data distribution itself also influences the model learning and further affects model performances.• Among all baselines, SGL usually achieves better performance.This can be attributed to its good ability to model high-order interactions via graph learning and its effective graphbased data augmentation.Meanwhile, it is less possible to be affected by the spurious features with only ID information, achieving better OOD performances.However, SGL-Attr-A and SGL-Attr-B perform worse than SGL, which may be attributed to the noise introduced by spurious features.Therefore, it verifies that the superior performance of our proposed IFL comes from the invariant features learned by the feature mask mechanism.Moreover, the better performance of SSL-DNN compared to AutoFIS+CFM and the relatively poor performance of SSL-DNN compared to IFL and SGL indicates the necessity to deal with the spurious correlation issues for SSL and the effectiveness of IFL again.

Ablation Study (RQ2)
Feature mask learning is key to identifying the spurious features and further helping to remove these features.In this section, to verify the importance of its key components, we conduct the following ablation studies: • The variances loss in the feature mask learning is the key to pushing the feature mask mechanism to identify the invariant features and discard the spurious features.Therefore, we next study its importance.Specially, we compare the IFL with the variance loss and a variant of IFL without the loss.Fig. 6 shows the results.We can find: 1) in the IID testing setting, the two models show similar performance; 2) but in the OOD testing setting, the performance of the IFL without the variance loss decreases obviously, compared to the IFL with the loss.This shows that the variance loss is necessary for helping the feature masking mechanism identify and discard the spurious features.• To verify the importance of the proposed spurious feature mask mechanism, we compare the IFL with (w/) and the IFL without (w/o) the feature mask mechanism.The results are shown in Fig. 7. From the figure, we find that the IFL without the feature mask shows poorer performances on both the IID testing and the OOD testing.The results of the OOD testing indicate that it is necessary to remove the effects of spurious features for OOD generalization, and the proposed feature mask mechanism effectively discards the effects.Meanwhile, the results for the IID testing show that removing spurious features could benefit the IID prediction.The possible reason is that: the spurious features could be non-necessary features for the IID prediction.Such non-necessary features could introduce noise to the model learning as discussed in [11].Note that directly removing such features is not helpful for OOD generalization, which can be proved by the fact that IFL without variance loss also conducts the feature mask mechanism, but shows poor performances (cf.Fig. 6).

Hyper-parameter Analysis (RQ3)
To analyze the influences of different hyperparameters of IFL on the recommendation performance, we conduct a series of experiments on Meituan by varying these hyper-parameters.Note that when studying a hyper-parameter, we fix other hyper-parameters as the optimal in Table 2.
For the hyper-parameters β to control the influence of variance loss, α to adjust the weight of SSL loss, and τ in SSL loss for smoothing, we vary them in the ranges of {0, 1e − 4, 1e − 3, 1e − 2, 0.1, 1}, {0, 0.1, 0.3, 0.5, 0.7} and {0.1, 0.3, 0.5, 0.7}, respectively.Fig. 8 and Fig. 9 show the results for them on OOD and IID testing settings, respectively.For the hyper-parameter C, we vary it in the range of {1, 2, 4}, and Table 3 shows the corresponding results.From the figures and the table, we have the following observations: • For β, according to the sub-figures on the left of Fig. 8 and Fig. 9, we find that β has a slight influence on the IID testing performance.While in the OOD testing setting, the performance roughly shows a decreasing trend when β decreases.When β decreases to 0, i.e., no variance loss to push the feature mask mechanism to discard the spurious features, our IFL cannot outperform the normal SSL method SGL (cf.Table 2).Thus, when tuning the β for IFL, we do not need to pay much attention to the IID performance, and we need to set a relatively large β to achieve better OOD performance.• For α, as the middle sub-figures of Fig. 8 and Fig. 9 show, the OOD and IID testing performances show similar increasing trends when α increases.Thus, when tuning it, we do not need to consider the trade-off between the OOD and IID performance.Besides, we note that when the α = 0, i.e., the SSL part is disabled for IFL, the OOD testing performances of IFL decrease sharply, but IFL can still outperform other non-SSL baselines in the OOD testing setting (cf.Table 2).This shows that SSL is important for IFL but the proposed feature mask learning of IFL is also useful for non-SSL models.• Regarding τ , according the sub-figures on the right of Fig. 8 and Fig. 9, we find that τ has sensitive influences on both the IID and OOD testing performance, which is similar to previous work [1].Fortunately, the OOD and IID testing performances show very similar fluctuating trends, making hyper-parameter tuning relatively easy.• Regarding C, as Table 3 shows, with C changing, the IID performance remains relatively stable while the OOD performance is hugely affected by C. Since the quality and the number of the environments usually have huge influences on the effectiveness of invariant learning [16], we need to tune it carefully.

Case Study
To illustrate how IFL blocks the negative effect of spurious features, we analyze the feature masks and examine the interaction distributions.From Fig. 10, we find that some user features are recognized as spurious features with the feature mask value as zero.Thereafter, we investigate the interaction distributions on XING to verify the effectiveness of the feature mask.
Here, we demonstrate two examples of the discovered spurious user features (i.e., experience years and experience years in the current job).According to the figure on the left in Fig. 11, among all full-time jobs, users with 4 and 6 years of work experience are more likely to have positive interactions, while those with 5 and 7 years are more likely to have negative interactions.However, based on expert knowledge, the preference of users with only one year gap in years of work experience over full-time jobs should not be significantly different, which makes users' number of  years of experience a spurious feature.Similarly, from the graph to the right of Fig. 11, user preference for jobs in Austria is drastically influenced by the work experience in the current job.In fact, this is irrational; thus, the years of experience of users in their current job is also a spurious feature.In prediction, these identified spurious features are well shielded by the feature mask, which intuitively explains the superior generalization ability of IFL.
4 Related Work

Neural Recommendation
With the rise of deep learning in machine learning, neural recommendation is becoming more and more flourishing [24][25][26] and usually can surpass traditional recommenders [24].Technically, we can categorize the existing work into two lines.The first line is to develop neural recommenders based on various deep neural networks [24], including work based on MLP [27,28], convolutional neural networks [29], self-attention [30], etc.Another line of efforts models recommendation data with different graphs, e.g., bipartite graph [31], knowledge graph [32], and hypergraph [33], and then designs recommenders with graph neural networks.However, these methods are trained in a supervisedlearning manner, and the supervised signals (i.e., interactions data) are extremely sparse compared to the overall sample space, limiting the effectiveness of neural recommenders.Differently, we utilize self-supervised learning, which has potential to overcome this drawback.

Self-supervised Learning
Regarding SSL, contrastive models are the most related, which learn to compare samples through a Noise Contrastive Estimation (NCE) objective [1].Some work focuses on modeling the contrast between the local part of a sample and its global context [34], and other work performs comparisons between different views of samples [35,36].Benefiting from its relatively lower dependency on labeled data, SSL also receives huge attention in recommendation [37].Many attempts have been proposed for various neural recommendations to overcome the data sparsity challenge [1,2,37].For example, SGL [1] applies the SSL to the graph recommender model based on node and edge dropout.Although SSL has become popular in many areas and has become the SOTA for personalized recommendation [1,2], SSL might capture spurious correlations, since it blindly discovers correlation relationships with the selfdiscrimination task, resulting in poor generalization ability.In this work, we try to overcome this drawback by combining invariant learning.

Causal Recommendation
Data-driven recommender systems achieve great success in large-scale recommendation scenarios [24].However, recent work finds that they face various biases [38][39][40], unfairness [41], and low OOD generalization ability issues [42].The reason is thought of as the lack of modeling causality to avoid capturing spurious correlations [38,42].Many efforts try to incorporate causality into neural recommendations to overcome these drawbacks [38,39,42].There are mainly two types of work.The first line of research is based on the potential outcome framework [39,43], where IPS [39] and doubly robust [44] are utilized to achieve unbiased recommendation.Another line of research is based on the structural causal model [7,38,42,45].The existing efforts usually analyze the causal relationships with causal graphs and then estimate the target causal effect with the intervention [38] or counterfactual inference [42,45] for debiasing, fairness, or OOD generalization.Nevertheless, all previous methods do not consider dealing with the spurious correlation issues for SSL.Besides, to the best of our knowledge, existing work in recommendation does not take invariant learning [16,18] to remove spurious correlations.

Conclusion
In this work, we inspected spurious correlations in the SSL recommendation.To improve the generalization ability of SSL recommender models, we proposed the IFL framework to exclude spurious features and leverage the invariant features for recommendation.Specifically, we considered a feature mask mechanism to automatically recognize spurious features and utilized mask-guided contrastive learning to block the harmful effect transmission from spurious features to invariant features.Extensive experiments validate the superiority of IFL in mitigating spurious correlations and enhancing the generalization ability of SSL models.
This work makes the initial attempts to mitigate the effect of spurious correlations on the SSL recommendation.In future work, one promising direction is learning the personalized feature mask mechanism to discover user-specific invariant features.Additionally, to achieve better generalization performance, it is worth improving the approaches to automatically divide environments with distribution shifts.

Figure 1
Figure 1 An example of spurious correlations in job recommendation, where the positive interactions with the full-time jobs show strong correlations with the users' 4 and 6 years of experience.

Figure 2
Figure2Illustration of IFL framework.The two-tower encoders on the left are used for normal recommender training with the CF loss and variance loss.On the right side, we do mask-guided feature augmentation over the user and item features, i.e., drop the spurious features and then conduct contrastive learning.The augmentations on users and items are the same and we only show that on items to save space.

Figure 3
Figure 3 Illustration of the environment division in feature mask learning.The interaction representations z ui are clustered into C = 2 environments.The distributions of spurious features shift across two environments while those of invariant features are stable.

Figure 4
Figure 4 Illustration of mask-guided contrastive learning in IFL.(a) shows the augmentation by dropping spurious features, (b) demonstrates the contrastive pairs in a sample batch, and (c) presents the contrastive loss between two items.Only two samples' contrastive pairs are shown in (b) for neatness.
[21], and DeepFM[22]  are powerful feature-based recommenders, which blindly utilize all features by feature interaction modeling to generate recommendations.•SSL-DNN[2]  is a multi-task-based selfsupervised learning recommendation framework with two towers of DNN.It adopts a feature dropout strategy for data augmentation.Here, we utilize correlated feature masking (CFM) for feature dropout due to its better performance compared to random feature masking[2], and SSL-DNN shares the same base neural network and learnable parameters as IFL.• SGL[1]  is a self-supervised learning method for graph-based collaborative filtering[23], which only utilizes the feature of the user ID and item ID.It conducts data augmentation by node dropout and edge dropout.• SGL+Attr-A and SGL+Attr-B are two variants of SGL that utilize side information.We add the features for SGL+Attr-A and SGL+Attr-B before and after the graph propagation, respectively.• AutoDeepFM[11]  is a method that can automatically discover and remove redundant feature interactions based on informative feature selection.It identifies the informative cross features by assigning learnable weights to feature interactions.• AutoFIS+CFM adopts the informative feature selection into SSL-DNN.Learnable weights for features are used by SSL-DNN to discover the redundant features, and the model will be re-trained with only informative features.

Figure 7
Figure 7 The performance of IFL with (w/) or without (w/o) performing the feature mask mechanism under the IID and OOD testing settings on Meituan.

Figure 8
Figure 8 Hyper-parameter influences on the OOD testing performances for Meituan, regarding the hyper-parameters β to control the influence of variance loss, α to adjust the weight of SSL loss and τ in SSL loss for smoothing.

Figure 9
Figure9The influences of hyper-parameters on the IID testing performances for Meituan.There are three hyperparameters -the hyper-parameters β to control the influence of variance loss, α to adjust the weight of SSL loss and τ in SSL loss for smoothing.

Figure 10
Figure 10 User feature mask learned by IFL in XING.The mask value of 1 indicates the invariant feature, while that of 0 indicates the spurious feature.

Figure 11
Figure 11 The figure to the left presents the distribution of positive/negative interactions over users' years of work experience among all full-time jobs.The figure to the right demonstrates the distribution of positive/negative interactions over users' years of work experience in their current job among all jobs in Austria.

Table 1
The statistics of the two datasets.Here, 'Int.' denotes 'Interactions'.

Table 3
The influences of the hyper-parameter C to adjust the number of environments on Meituan.