Disentangled Variational Auto-encoder Enhanced by Counterfactual Data for Debiasing Recommendation

Recommender system always suﬀers from various recommendation biases, seriously hindering its development. In this light, a series of debias methods have been proposed in the recommender system, especially for two most common biases, i.e., popularity bias and ampliﬁed subjective bias. However, existing debias methods usually concentrate on correcting a single bias. Such single-functionality debiases neglect the bias-coupling issue in which the recommended items are collectively attributed to multiple biases. Besides, previous work cannot tackle the lacking supervised signals brought by sparse data, yet which has become a commonplace in the recommender system. In this work, we introduce a disentangled debias variational auto-encoder framework(DB-VAE) to address the single-functionality issue as well as a counterfactual data enhancement method to mitigate the adverse eﬀect due to the data sparsity. In speciﬁc, DB-VAE ﬁrst extracts two types of extreme items only aﬀected by a single bias based on the collier theory, which are respectively employed to learn the latent representation of corresponding biases, thereby realizing the bias de-coupling. In this way, the exact unbiased user representation can be learned by these decoupled bias representations. Furthermore, the data generation module employs Pearl’s framework to produce massive counterfactual data, making up the lacking supervised signals due to the sparse data. Extensive experiments on


Abstract
Recommender system always suffers from various recommendation biases, seriously hindering its development.In this light, a series of debias methods have been proposed in the recommender system, especially for two most common biases, i.e., popularity bias and amplified subjective bias.However, existing debias methods usually concentrate on correcting a single bias.Such single-functionality debiases neglect the bias-coupling issue in which the recommended items are collectively attributed to multiple biases.Besides, previous work cannot tackle the lacking supervised signals brought by sparse data, yet which has become a commonplace in the recommender system.In this work, we introduce a disentangled debias variational auto-encoder framework(DB-VAE) to address the single-functionality issue as well as a counterfactual data enhancement method to mitigate the adverse effect due to the data sparsity.In specific, DB-VAE first extracts two types of extreme items only affected by a single bias based on the collier theory, which are respectively employed to learn the latent representation of corresponding biases, thereby realizing the bias decoupling.In this way, the exact unbiased user representation can be learned by these decoupled bias representations.Furthermore, the data generation module employs Pearl's framework to produce massive counterfactual data, making up the lacking supervised signals due to the sparse data.Extensive experiments on

Introduction
Recommender system(RS) helps users find new items of personal preference from huge amounts of data, which is extensively employed in a myriad of on-the-fly applications, such as e-commerce [42,8], social networks [22] and advertisement [13].In realistic applications, the recommender system severely suffers from some widespread biases [41,46,43] (e.g., the popularity bias [46,39] and the amplified subjective bias [36]), being incapable of accurately grasping the user's personal preferences.
Fig. 1(a) provides an example about the popularity bias in the real-world ML-20M dataset [30].From the left y-axis, unpopular items (less than 50 interactions) account for the vast majority of the items, yet which are rarely recommended by the traditional recommendation models (e.g., the variational auto encoder(VAE) based model [14] represented by the orange line at the right y-axis).Influenced by the popularity bias, the Matthew effect [35] and the coldstart problem [40] are becoming increasingly commonplace in the recommender system.On the other hand, the amplified subjective bias [32] over-recommends the history preferences of users, misinterpreting the users' current needs.As shown in Fig. 1(b), given a user's interaction history with 70% comedies and 30% action movies, the positive feedback loop [6] in the traditional recommenders exaggerates the user's comedy preferences, increasing to 90% in the recommended films.Even worse, the amplified subjective bias also triggers some issues in the recommenders (e.g., filter bubbles [21] and echo chambers [9]), causing an ever-shrinking range of the recommended items.
Recently, a line of debias models have been proposed to alleviate these biases.For the popularity bias, the most common strategy is item reweighting, including highlighting unpopular items [16] or reducing the influence of popular items [15,38].While for the amplified subjective bias, existing models generally concentrate on adjusting the recommendation distribution from different aspects, including fairness [20,31], diversity [5] and calibration [32].Furthermore, some methods disentangle these biases by causal learning, e.g., the model-agnostic counterfactual reasoning framework (MACR) [39] or the deconfounded recommendation system (DecRS) [36], of which cause-effect strategy makes the debias process more reasonable and efficient.
Albeit much progress, the above debias methods still exhibit some defects.
First is the single functionality.Existing methods are either designed for the popularity bias, or target to the amplified subjective bias.Such single debias methods neglect the bias-coupling probability, i.e., the items are recommended, perhaps collectively attributed to these two biases.In this way, adjusting any sole bias is accompanied by the excessive expressiveness of the other bias.Further, these bias-coupling recommended items cannot represent any pure bias, which easily causes the debias shift due to the coupled supervised signals.Besides, previous debias work doesn't consider the data-sparsity problem in the recommender system, in which quantities of users have few historical interactions (as the sparsity metric shown in Table .1).In this light, insufficient interactions easily aggravate the difficulty of recognizing users' preferences and recommendation biases.
To tackle the couple supervised issue brought by the single functionality, we propose a disentangled debias variational auto-encoder framework (DB-VAE) that decouples these two biases to help obtain a debias representation of user preferences.In particular, we first analyze the reason about the interaction between users and items, from which two direct causes of user click behavior are summarized: the item is popular and the item is matching the user's preference.
From this analysis, we design a casual graph about the user click behavior, helping disentangle the popularity bias and amplified subjective bias from the user's interaction profile [24].By defining the data split criterion of disentangled biases, two extreme types of items only affected by one bias can be extracted to further train the debias process, thereby mitigating the adverse effect of couple supervised signals.In addition, to address the data sparsity issue, we further design a counterfactual data generation strategy to produce massive counterfactual data, making up the lacking supervised signals due to the sparse interactions.In this generation process, a Pearl's causal inference framework [26] is employed to help answer the questions about how will user perform if he or she is not affected by biases.Following the 'Abduction-Action-Prediction' process, we answer the counterfactual questions and generate the corresponding counterfactual data, making up the lacking supervised signals.
We evaluate DB-VAE on three real-world RS datasets to verify its effectiveness.Extensive experimental results show that DB-VAE outperforms state-ofthe-art baselines with average 5% improvements in terms of N DCG and Recall.
Furthermore, generating counterfactual data can further enhance DB-VAE, especially on sparser datasets.Overall, the contributions of this work can be summarized as the following three folds: 1. presenting a disentangled debias variational auto-encoder which simultaneously eliminates the popularity bias and the amplified subjective bias by two extreme types of items; 2. a counterfactual data generation method to make up the lacking supervised signals due to the data sparsity; 3. evaluations on three real-world datasets demonstrating the effectiveness and rationality of our proposed models.

Related work
In this section, we review existing recommendation debias work, mainly concentrating on the popularity bias and the amplified subjective bias.

Popularity Bias in Recommendation
In the recommender system, it is a common problem that users' feedback is easily influenced by popular items.Recently, eliminating the popularity bias has received much attention in the recommendation research.Specifically, methods targeting to de-bias the popularity bias can be roughly divided into the following three types.The first line of the debias research aims at reweighting the popular items during training.For instance, Rosenbaum and Rubin [29] firstly propose an inverse propensity weighting (IPW) to adjust the importance of items according to their popularities.Inspired by this weighting strategy, Liang et al. [15] propose an improved method to impose lower weights on popular items.The second line of the debias research takes advantage of the ranking adjustment to calibrate the popularity bias.For example, Abdollahpouri et al.
[1] introduce a regularization-based method to promote the rank of unpopular items.In addition, Abdollahpouri et al. [2] adopt a re-ranking strategy to adjust the output rank of the recommender system.The third line of the debias research employs the casual theory to alleviate the popularity bias, including the causal representation learning [18,45] and the causal adjustment [43,39].
Unlike the existing work, we argue that popularity bias should not be targeted singly because there is a complex coupled relationship between the popularity bias and the other biases.Hence, we introduce two extreme types of items to further train the debias process, thereby overcoming the blindness of debias brought by the couple biases.

Amplified subjective bias
During training, the positive feedback loop forces the recommender system to amplify the user's historical preferences [6].To correct such an amplified subjective bias, existing debias methods mainly explore from the following three aspects, i.e., fairness, diversity and causal recommendation.For the fairness in RS, Biega et al. [3] propose an amortized equity of attention, which ensures that similar individuals receive the similar treatments.And Morik et al. [20] argue all user groups are supposed to be treated fairly, thereby introducing a more granular way to keep the group fairness.Besides, Steck [32] re-rank the items to ensure the distribution of the recommended item groups to be matched with the users' interaction history.For the diversity in RS, existing work pursues the dissimilarity of the recommended items [7,34], where similarity can be measured by item category and embeddings [5,12].Furthermore, the causal recommendation targets to find the root reason for amplified bias.For example, Wang et al. [36] first scrutinize the cause-effect factors for bias amplification, and then contribute an approximation operator to eliminate the amplified subjective bias by the back-door adjustment.
However, the solely-modeling amplified subjective bias easily cause serious filter bubbles [21] and echo chambers [9].Hence, we consider the amplified subjective bias combined with the other bias (mainly the popularity bias) during debias process.In this light, our proposed model eliminates the two types of bias simultaneously to disentangle the complex coupled relationship between these two biases.

Method
In this section, we first present the bias extraction in §3.1 to construct two extreme types of items only affected by a single bias.Then, we elaborate on the proposed DB-VAE model in §3.2, which helps accurately characterize the user's debias latent representation and give debias predictions for user's preferences.
Finally, we introduce a counterfactual data generation method in §3.3 to enhance DB-VAE to tack the data sparsity issue.

Bias extraction
According to the causal theory [24], we design a causal graph on users' interactions in Fig. 2. Clearly, there exist two direct causes (Node B and Node M) towards the users' click behavior (Node C), i.e., the popularity of a specific item and the matching degree between this item and the target user.Although such two causes are independent in essence (Node B and Node M are not directly connected), their collider (Node C) makes them correlated in the expressive form of recommendation items.In this way, we expect to obtain the click-behavior data attributed to the sole cause.According to the colliding effect [25,26], two extreme conditions can be extracted from user interactions, which reflect the popularity bias and the amplified subjective bias, respectively.In particular, if a user clicks on an item that hardly matches his/hers historical preferences, this click behavior is stemmed from the fact that this recommended item is very popular and the user is affected by conformity.And vice versa, an unpopular item gets clicked owing to its compatibility with the user's personal preferences.To measure the item popularity and the matching degree between users and items, the popularity score S m (u, i) and the matching score S p (u, i) between the target user u and the recommended item i are respectively introduced, as follows: where X denotes the interaction profile with all items the target user u clicked before and num(i) is the click number of the item i in the recommender system.
And the popularity score S p (u, i) is defined as the ratio between the click number of the item i and the click number summation of items the user u clicked before.Besides, d i and d u denote the category representations of item i and user u by the one-hot format, respectively.In specific, given that there are five categories of movie tags in a movie dataset, including 'Comedy','Action', 'Crime', 'Adventure' and 'Thriller', the dimension of the category embeddings can be set to 5. In this way, when a movie has 'spider man' has 'Comedy' and 'Action' tags, its category embeddings can be represented as [1, 1, 0, 0, 0].
While the user category embeddings are obtained by summarizing the category embeddings of all items the user has clicked on, i.e., Next, we elaborate on two extreme conditions from user interactions to obtain the sole-cause items.The first is the items that are clicked due to the conformity, which causes the popularity bias.So we select items that are very popular but rarely match the target user as X p , i.e., where rank(S p (u, i)) and rank(S m (u, i)) represents the ranking of the popularity score and the matching score of the item i among all items clicked by the target user u, respectively; and k ∈ [0, 1] denotes the debias degree in which the smaller k indicates that less items are extracted from user's interactions to construct X p .
The second is the items that are clicked due to the user's interests, which amplifies the subjective bias during the training process, and probably makes the recommender system fall into information cocoons [11].Similarly, we select items that have a good match for the user's interests, yet are rarely clicked by other users as X m , i.e.,

Debias variational auto-encoder
Variational auto-encoder(VAE) is a deep latent variable model [14,28].By the encoder-decoder structure, VAE can effectively learn latent representation from real-world data.Different from conventional auto-encoder(AE), VAE tries to ensure that the latent space is regular enough by introducing a distribution regularization during the training process.In this way, VAE can alleviate the over-fitting problem in conventional AE.In view of these merits, we select VAE as the backbone of DB-VAE in the hope that the standardized feature representation of users can be learned during the debias process.We present the framework of DB-VAE in Fig. 3.As shown in Fig. 3, the total interacted items of the target user X, items X p attributed to the popularity bias and items X m attributed to the amplified subjective bias are taken as the input of VAE.And then, we employ the onehot format embedding to encode X, X m and X p , obtaining the corresponding embeddings x, x m and x p .These one-hot embeddings are further fed into the VAE encoder to generate their latent representations z x , z m and z p , which can be formulated as follows: where φ denotes the parameter set of the VAE encoder.Here, we select a 5-layer neural network as the VAEencoder.Furthermore, z p and z m can be treated as the user representations influence by the popularity bias and the amplified subjective bias, respectively.To reconstruct a user latent representation vector that is not affected by these two biases, we can obtain the unbiased user representation by subtracting z p and z m from z x , i.e., Different from the conventional auto-encoder, VAE assumes that the latent representation is subject to a prior standard normal distribution, from which a random sampling can be decoded to real-world data x.However, the random sampling operation is not differentiable, which makes the model untrainable.In this light, we adopt the standard normal distribution N (µ, σ 2 ) as the posterior distribution generated by the encoder, and make the output of VAE approach this distribution by the KL divergence constraint during the model training.
Formally, the output of the VAE encoder is subject to: where in which f φ () is a multilayer perceptron in the VAE encoder, g 1 () and g 2 () are two different fully connected layers of the VAE encoder.Analogously, the poster distributions p φ (z m |x m ) and p φ (z p |x p ) are generated in the same way.VAE takes the representation vectors z x , z m and z m as samples from the generated distribution N (µ x , σ 2 x ), N (µ m , σ 2 m ) and N (µ p , σ 2 p ).In order to ensure this operation is derivable, a re-parameterized way is adopted, which can be formulated as follows: where is a sampled vector from multivariate standard normal distribution N (0, I) and is element-wise product operation.In this way, the sampling x ) is transformed to the sampling N (0, I).Hence, the operation 'sampling' is not involved in the gradient descent, but in the result of sampling, making the whole model trainable.Under this sampling, the debiased user latent representation z in Eq. 6 can be transformed into: With these laten representation, the final prediction score x and the prediction scores of two biases ( xp and xm ) can be decoded as follows: where θ denotes the parameter set of the decoder.
Following the VAE training, the encoder parameter set φ and the decoder parameter set θ can be learned by maximizing the evidence lower bound (ELBO) [4]: where the first half of this formula, E where one − hot(•) is the one-hot embedding.All in all, the final loss can be jointly connected as: where ω m and ω p are weight parameters.

Counterfactual data generation
Intuitively, the direct way to mitigate the adverse effect of sparse data is to increase the data volume.To this end, we employ Pearl's causal inference framework [26] to generate counterfactual data, in which this inference framework contains a three-layer causal hierarchy, including abduction, intervention and prediction.
Following this hierarchy, we first construct a basic model F abided by the causal graph proposed in Fig. 2. Different from the causal graph, two exogenous variables α and β are introduced to ensure the counterfactual prediction works.
In specific, exogenous variables α and β describe uncertainties that affect the matching degree in node M and the popularity attribute in node B, respectively.
For example, Santa movies are probably more popular around Christmas, revealing that the extra factor of date potentially influences the popularity.This indicates that there exist some uncertainties affecting the data generation process.Hence, the basic model F (as shown in Fig. 4) is defined in a stochastic manner to consider the randomness and possible noisy data.

U I M B C
Formally, according to the inference process in Fig. 4, the basic model F can be formulated as follows: where the distributions p M , p B and p C can be detailed as: where E u and E i are the embeddings of user u and item i respectively; w B i and w M i are weighting parameters which can be learned during the training process, and S m (u, i) and S p (u, i) can be referred to Eq.( 1).

With the observed dataset
}, the basic model F can be learned by the cross entropy loss on the click prediction module C(u, i), i.e., where x(u, i) is the ground truth indicating whether the user u interacted with the item i.To ensure the randomness during th model training, α and β are subject to the multivariate standard normal distribution N (0, I).Since the distribution of the observed data is highly influenced by α and β, the posterior distributions in different datasets are diverse.In this way, once we have learned F , we can follow the Pearl's abduction-action-prediction framework [10] to generate counterfactual data, further strengthening the DB-VAE training.
In particular, the main objective of the abduction process is to estimate the posterior of α and β from the observed dataset O. Taking α as an example, its posterior can be computed by the following Bayesian rules: Unfortunately, the detail expressions of the prior distribution p(α) is unknown and too complex to be sampled.While the variational inference [4] can be a good solution to approximate p(α).In specific, α is assumed to subject to a Gaussian distribution q φ (α) ∼ N (µ, σ), where µ and σ are learnable parameters.By minimizing the KL-divergence between q φ (α) and p(α|O),the optimal µ and σ can be obtained.This training process can be formulated as maximizing the evidence lower bound (ELBO) [4], i.e., Analogously, the distribution p(β|O) can also be learned by a similar way.
In the action step, we aim to figure out three counterfactual distributions p C (C|do(M ), B), p C (C|M, do(B)) and p C (C|do(M ), do(B)).Such three distributions answer three counterfactual questions, respectively, including 1) which items would a user interact with if he/she were not affected by the amplified subjective bias? 2) which items would a user like if he/she were not affected by the popularity bias? and 3) what would the user's interaction behavior be if he is not affected by either bias?Besides, S m (u, i) is assumed to subject to the standard normal distribution N (0, 1) and the do(M ) operation is realized by sampling S m (u, i) from N (0, 1).Similarly, the operation do(B) can be achieved by sampling S p (u, i) from N (0, 1).
In the Prediction step, we employ the definition in Eq.( 16) to generate three types of counterfactual distribution p C (C|do(M ), B), p C (C|M, do(B)) and p C (C|do(M ), do(B)).By selecting top-N items from these counterfactual distributions, we can construct counterfactual data X counter , X counter m and X counter p .
The enhanced data can be obtained by combining counterfactual data with factual data, i.e., By retraining DB-VAE with these enhanced data, the debias difficulty due to the data sparsity can be mitigated.In this work, we set N = 100.

Experiments
In this section, we conduct experiments to evaluate the effectiveness of our proposed DB-VAE.To better guide our experimental analyses, we introduce four research questions (RQ)k, which are shown as follows: • RQ1 Does our DB-VAE framework outperform other debiasing methods?
• RQ2 How do different debiasing components contribute to the performance?Can counterfactual data help improve the performance of DB-VAE?
• RQ3 How does DB-VAE perform in the item-user groups with different data sparsities?
• RQ4 How does the debiasing threshold value k affect the performance of DB-VAE?
In addition, we first introduce the settings of experiment in §4.1, which includes the information about datasets, experimental setup, evaluation metrics and baseline models.Then, we elaborate on experimental results and present some analyses in §4.2, respectively answering the proposed research questions.In this paper, we conduct experiments on three real-word recommendation datasets, including MovieLens [27,44], AliShop-7C [19] and Amazon-book datasets [33].In specific, MovieLens includes movie category information and user attribute information, which is a wildly-used dataset collected from Movie-Lens website.Given the different data sizes in MovieLens, we select ML-1M and ML-20M as our research datasets in this paper.AliShop-7C is collected from Taobao (Alibaba's e-commerce platform).Amazon-book is one of Amazon product datasets [33], which records how users rate different books on Amazon.
Among datasets used in our experiments, they all involve enough item features, e.g., the movie genre or the book category.More specifically, each dataset is filtered out items with less than 5 interactions and users with less than 2 interactions to ensure data quality.In addition, we list the information of all the datasets after pre-processing in Table 1.From this table, we can find that Amazon-book is the sparsest dataset but has the largest number of categories, which may take great difficulty for model debiasing.

Experimental setup
To evaluate the model's ability in constructing the user representation from the obvious interaction, we take all interactions of a specific user as a data instance.Specifically, we split all users into training/validation/test sets and the entire click histories from the training users are employed to train models.
The #held-out users of Table 1 indicates the number of validating/testing users.
The 80% of the interactions of a validation user or a test user are used as the input of different models while the remaining 20% of the interactions are used to evaluate the model.For the encoder of all models (i.e., f θ () in Eq. 8), we choose a three-layer perceptions.And the decoder structure is the same as that of the encoder for the sake of symmetry.As the dimension setting in previous work [30], we set the dimensions of the latent representation and any hidden layer to 200 and 600, respectively.During the model training, we employ the Adam [13] optimizer with the batchsize of 500 users and apply a weight decay of 0.01.And we retain the models with the best N DCG@100 in the validation set and evaluate them on the test set.

Metrics
We adopt two classical ranking metrics in the recommender system as the evaluating metrics, including Recall@K and N DCG@K [23,37], which are readily appropriate to the user data never appeared in the training set [36].By comparing the top-k predictions with the ground-truth user u in the test set X t u , Recall@K and N DCG@K can be obtained.
In particular, we first sort the items in descending order of the predicted likelihood scores, and then select the top-K items with the highest score to form a recommendation list R u .Formally, Recall@K(u) is defined as follows: where I[•] is an indicator function that returns 1 if the condition is satisfied.
And the definition of N DCG@K(u) is: where N DCG@K(u) is defined as DCG@K(u) divided by its theoretically possible value [30].
Generally, Recall@K is large bigger than N DCG@K under the same K, because Recall@K treats equally all items in the top-K recommendation list, while N DCG@K assigns larger weights to the top-ranked items.In this light, we select a larger K for N DCG@K than that for Recall@K as previous work does [30,17].In specific, Recall@20 and N DCG@100 are adopted as evaluation metrics in our experiments.

Baselines
To verify the effectiveness of DB-VAE, we select several recent competitive recommendation models as the baselines, which can be further divided into two model groups.In particular, the first group is some well-designed VAE models [30,17] without the debiasing operation.While the second group is model-agnostic debaising models [36,39] which aim at eliminating either the popularity bias or the amplified subjective bias.More specifically, • Mult-VAE [17] extends the variational autoencoder for learning implicit feedback in the top-k RS.By assuming that the user representation complies with the multivariate normal distribution, Mult-VAE introduces a different regularization parameter for the learning objective, achieving considerable performances.
• RecVAE [30] improves Mult-VAE by several optimization methods, including a novel composite prior distribution for the latent representation, a better approach to the weight setting in the evidence lower bound and a method of alternately updating parameters.
• MACR [39] is a model-agnostic framework that aims at eliminating the popularity bias in RS.MACR analyzes the cause-effect and introduces a multi-task learning method to answer the counterfactual question about what the ranking score would be if the model only uses item property.To fairly compare the model performance, we keep the backbone model of MACR the same as DB-VAE.
• DecRS [36] contributes an approximation operator by a backdoor adjustment, which concentrates on eliminating the obstacles in causal reasoning theory, thereby eliminating the amplified subjective bias.Similar to MACR, we keep the backbone model of DecRS same as DB-VAE.

Experimental results and analysis
This subsection respectively answers the proposed research questions by conducting the recommendation experiments about all discussed models on the selected datasets.

Overall performance.
For RQ1, we summarize the overall performance of our proposed DB-VAE as well as the selected baselines in terms of Recall@20 and N DCG@100 in Table .2.
Generally, for any evaluation metric on any dataset, DB-VAE consistently shows a state-of-art performance.The detailed observations are presented as follows: • In all cases, our disentangle debias framework boosts the performance of VAE models most obviously.eliminating biases in the recommender system can be a good solution to improve the recommendation performance.
• Generally, all discussed models have worse performance on the datasets with a smaller sparsity, except on ML-1M.This finding reflects that sparser data can be easier to make wrong predictions.On the other hand, the exception ML-1M has more dense data than that of ML-20M, yet models don't realize better performance on ML-1M.Such a contrast phenomenon may be attributed to the fact that the data volume is a more influential factor in the recommendation performance than the data sparsity.In other words, from the same data source, the data volume in ML-20M is 20 times of that in ML-1M, substantially making up the missing supervised signals brought by the data sparsity.This analysis can also be applicable to the comparison between Alishop-7c and ML-20M.In specific, despite the same data sparsity, the data volume of ML-20M is greater than that of Alishop-7c, and hence all discussed models have better performance on ML-20M than on Alishop-7c. in the belonged column.R@20 and N@100 are short for Recall@20 and N DCG@100, respectively.
Model ML-1M ML-20M Alishop-7c amazonbook R@20 N@100 R@20 N@100 R@20 N@100 R@20 Further, the generated counterfactual data can improve the performance of DB-VAE on all datasets except ML-1M, especially on amazonbook where DB-VAE(CD) achieve 6% and 10% improvements over original DB-VAE in terms of Recall@20 and N DCG@100, respectively.Such performance advancement indicate that counterfactual data is very appropriate to the data-sparse scenario, yet having a perverse effect on the data-intensive scenario.Such phenomena can be attributed to the fact that counterfactual data can help supplement the lacking supervised signals when data is sparse; however, for data with sufficient interactions, the addition of counterfactual data probably change the distribution of the original data, poisoning the model performance.

Performance on different sparsity groups
To further evaluate the models' performance under different sparsity degrees (i.e., RQ3), we sorted the test users in ascending order according to the degree of sparsity(i.e., the number of interacted items), and then divided them equally into eight groups.In addition, we only test debias models in this subsection, including MACR, DecRS, DB-VAE and DB-VAE(CD).Their group performance are summarized in Fig. 6.Clearly, with the increase of the sparsity value, all discussed models present an overall upward trend, indicating the data sparsity is an inherent problem that affects the recommendation performance and the smaller  The debias degree value k in Eq.(3) and Eq.( 4) controls the number of the items extracted from the user's profile.The smaller the k is, the fewer items we extract to debias.In this section, to answer RQ4, we evaluate how k affects

Conclusion
In this work, we explain the two most common biases in the recommender system, popularity bias and amplified subjective bias.To alleviate these biases, we propose a disentangled debias variational auto-encoder framework, which overcomes the shortcomings of other debias methods that are single and have no de-biased supervisory signals.In the process of debias, we make use of the relevant theory of causal inference, which helps us find the user interaction that may be biased.Extensive experiments validate that our proposed DB-VAE outperforms other debias methods.Also, we design a data enhancement method to help model training when data is sparse.
In future work, we will try to find the relationship between other types of biases and construct a unified and effective debias framework.

Figure 2 :
Figure 2: Causal graph on users interactions.Node C denotes the click behavior of user.Node B and Node M represent the popularity of the specific item and the matching degree between the item (Node I) and the user (Node U), respectively.
p φ [logq θ ( x|z x )] narrows the gap between x and x, and hence the cross entropy is employed to achieve this goal; the second part KL(p φ (z x |x, x p , x m )||p(z x )) represents the KL divergence between the posterior distribution p φ (z x |x, x p , x m ) and the prior distribution p(z x ) with the assumption p(z x ) ∼ N (0, I).By optimizing the negative KL divergence, DB-VAE can try to force that the debiased user feature representation z x adheres to the standard normal distribution.To overcome the issue of coupled supervised signals in the previous VAE training, we employ the bias-extracted data X p and X m to further train the VAE decoder, ensuring the accurate generation of z m and z p :

Figure 5 :
Figure 5: Model performance on different sparsity groups in terms of Recall@20.

Figure 6 :
Figure 6: Model performance on different sparsity groups in terms of Recall@20.

Table 1 :
Statistics of datasets, where the sparsity metric is calculated by dividing the number of actual interactions by the product of the user and item numbers.The held-out users is the number of validation/test users out of the total number of users in the first row

Table 2 :
Performance of DB-VAE as well as the selected baselines in terms of Recall@20 and N DCG@100.The bold font indicates the best performer in the belonged column.

Table 3 :
Performance of DB-VAE variants, including replacing/removing a specific component and adding a counterfactual data enhencement.The bold font indicates the best performer