Skip to main content

A bias detection tree approach for detecting disparities in a recommendation model’s errors


Many of the current recommendation systems are considered to be blackboxes that are tuned to optimize some global objective function. However, their error distribution may differ dramatically among different combinations of attributes, and such algorithms may lead to propagating hidden data biases. Identifying potential disparities in an algorithm’s functioning is essential for building recommendation systems in a fair and responsible way. In this work, we propose a model-agnostic technique to automatically detect the combinations of user and item attributes correlated with unequal treatment by the recommendation model. We refer to this technique as the Bias Detection Tree. In contrast to the existing works in this field, our method automatically detects disparities related to combinations of attributes without any a priori knowledge about protected attributes, assuming that relevant metadata is available. Our results on five public recommendation datasets show that the proposed technique can identify hidden biases in terms of four kinds of metrics for multiple collaborative filtering models. Moreover, we adapt a minimax model selection technique to control the trade-off between the global and the worst-case optimizations and improve the recommendation model’s performance for biased attributes.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8








  • Abdollahpouri, H., Burke, R., Mobasher, B.: Controlling popularity bias in learning-to-rank recommendation. In: Proceedings of the Eleventh ACM Conference on Recommender Systems, RecSys’17, pp 42–46. Association for Computing Machinery, New York (2017)

  • Anelli, V.W., Di Noia, T., Di Sciascio, E., Ragone, A., Trotta, J.: Local popularity and time in top-n recommendation. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds.) Advances in Information Retrieval, pp. 861–868. Springer, Cham (2019)

    Chapter  Google Scholar 

  • Baeza-Yates, R.: Bias on the web. Commun. ACM 61, 54–61 (2018)

    Article  Google Scholar 

  • Barocas, S., Hardt, M., Narayanan, A.: Fairness and Machine Learning. (2019).

  • Beutel, A., Chi, E.H., Cheng, Z., Pham, H., Anderson, J.: Beyond globally optimal: Focused learning for improved recommendations. In: Proceedings of the 26th International Conference on World Wide Web, WWW’17, pp. 203–212, Republic and Canton of Geneva, CHE, 2017. International World Wide Web Conferences Steering Committee

  • Beutel, A., Chen, J., Doshi, T., Qian, H., Wei, L., Wu, Y., Heldt, L., Zhao, Z., Hong, L., Chi, E.H., Goodrow, C.: Fairness in recommendation ranking through pairwise comparisons. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD’19, pp. 2212–2220. Association for Computing Machinery, New York (2019)

  • Boratto, L., Fenu, G., Marras, M.: The effect of algorithmic bias on recommender systems for massive open online courses. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds.) Advances in Information Retrieval, pp. 457–472. Springer, Cham (2019)

    Chapter  Google Scholar 

  • Boratto, L., Fenu, G., Marras, M.: Connecting user and item perspectives in popularity debiasing for collaborative recommendation. Inf. Process. Manag. 58(1), 102387 (2021)

    Article  Google Scholar 

  • Brown, M.B., Forsythe, A.B.: Robust tests for the equality of variances. J. Am. Stat. Assoc. 69(346), 364–367 (1974)

    Article  MATH  Google Scholar 

  • Burke, R., Sonboli, N., Ordonez-Gauger, A.: Balanced neighborhoods for multi-sided fairness in recommendation. In: Friedler, S.A., Wilson, C. (eds.) Proceedings of the 1st Conference on Fairness, Accountability and Transparency, Proceedings of Machine Learning Research, vol. 81, pp. 202–214. PMLR, New York (2018)

  • Chen, J., Dong, H., Wang, X., Feng, F., Wang, M., He, X.: Bias and debias in recommender system: a survey and future directions. CoRR. arXiv:2010.03240 (2020)

  • Diana, E., Gill, W., Kearns, M., Kenthapadi, K., Roth, A.: Convergent algorithms for (relaxed) minimax fairness. CoRR. arXiv:2011.03108 (2020)

  • Eskandanian, F., Sonboli, N., Mobasher, B.: Power of the few: analyzing the impact of influential users in collaborative recommender systems. In: Proceedings of the 27th ACM Conference on User Modeling, Adaptation and Personalization, UMAP’19, pp. 225–233. Association for Computing Machinery, New York (2019)

  • Fryer, R., Loury, G., Yuret, T.: An economic analysis of color-blind affirmative action. J. Law Econ. Organ. 24(2), 319–355 (2008)

    Article  Google Scholar 

  • Gajane, P., Pechenizkiy, M.: On formalizing fairness in prediction with machine learning (2017)

  • George, T., Merugu, S.: A scalable collaborative filtering framework based on co-clustering. In: Proceedings of the Fifth IEEE International Conference on Data Mining, ICDM’05, pp. 625–628. IEEE Computer Society, USA (2005)

  • Ghazanfar, M.A., Prügel-Bennett, A.: Leveraging clustering approaches to solve the gray-sheep users problem in recommender systems. Expert Syst. Appl. 41(7), 3261–3275 (2014)

    Article  Google Scholar 

  • Guidotti, R., Monreale, A., Turini, F., Pedreschi, D., Giannotti, F.: A survey of methods for explaining black box models. CoRR. arXiv:1802.01933 (2018)

  • Harper, F.M., Konstan, J.A.: The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5(4), 19:1–19:19 (2015)

  • Hug, N.: Surprise: a python library for recommender systems. J. Open Source Softw. 5(52), 2174 (2020)

    Article  Google Scholar 

  • Kamishima, T., Akaho, S., Sakuma, J.: Fairness-aware learning through regularization approach. In: 2011 IEEE 11th International Conference on Data Mining Workshops, pp. 643–650 (2011)

  • Kass, G.V.: An exploratory technique for investigating large quantities of categorical data. J. R. Stat. Soc. Ser. C Appl. Stat. 29, 119–127 (1980)

    Google Scholar 

  • Kershaw, D.J., Koeling, R., Bourgeois, S., Trenta, A., Muncey, H.J.: Fairness in Reviewer Recommendations at Elsevier, pp. 554–555. Association for Computing Machinery, New York (2021)

    Google Scholar 

  • Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009)

    Article  Google Scholar 

  • Lee, D.D., Seung, H.S.: Learning the parts of objects by nonnegative matrix factorization. Nature 401, 788–791 (1999)

    Article  MATH  Google Scholar 

  • Legislation E.U.: Equal treatment of persons (2009)

  • Li, J., Sun, L., Wang, J.: A slope one collaborative filtering recommendation algorithm using uncertain neighbors optimizing. In: Wang, L., Jiang, J., Lu, J., Hong, L., Liu, B. (eds.) Web-Age Information Management, pp. 160–166. Springer, Berlin (2012)

    Chapter  Google Scholar 

  • Li, Y., Chen, H., Fu, Z., Ge, Y., Zhang, Y.: User-Oriented Fairness in Recommendation, pp. 624–632. Association for Computing Machinery, New York (2021)

    Google Scholar 

  • Lippert-Rasmussen, K.: Born Free and Equal?: A Philosophical Inquiry Into the Nature of Discrimination. Oxford University Press, Oxford (2013)

    Book  Google Scholar 

  • McCrae, J., Piatek, A., Langley, A.: Collaborative filtering. (2004)

  • Misztal-Radecka, J., Indurkhya, B.: When is a recommendation model wrong? A model-agnostic tree-based approach to detecting biases in recommendations. In: Boratto, L., Faralli, S., Marras, M., Stilo, G. (eds.) Advances in Bias and Fairness in Information Retrieval, pp. 92–105. Springer, Cham (2021)

    Chapter  Google Scholar 

  • Olteanu, A., Castillo, C., Diaz, F., Kıcıman, E.: Social data: biases, methodological pitfalls, and ethical boundaries. Front. Big Data 2, 13 (2019)

    Article  Google Scholar 

  • Ribeiro, M.T., Singh, S., Guestrin, C.: Model-agnostic interpretability of machine learning (2016)

  • Ricci, F., Rokach, L., Shapira, B.: Recommender Systems: Introduction and Challenges, pp. 1–34. Springer US, Boston (2015)

    Book  Google Scholar 

  • Ritschard, G.: CHAID and Earlier Supervised Tree Methods, pp. 48–74. J.J. McArdle & G. Ritschard, Routeledge, New York (2013)

  • Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1(5), 206–215 (2019)

    Article  Google Scholar 

  • Salakhutdinov, R.. Mnih, A.: Probabilistic matrix factorization. In: Advances in Neural Information Processing Systems, vol. 20 (2008)

  • Sánchez, P., Bellogín, A.: Attribute-based evaluation for recommender systems: incorporating user and item attributes in evaluation metrics. In: Proceedings of the 13th ACM Conference on Recommender Systems, RecSys’19, pp. 378–382. Association for Computing Machinery, New York (2019)

  • Singh, J., Anand, A.: Posthoc interpretability of learning to rank models using secondary training data. CoRR. arXiv:1806.11330 (2018)

  • Su, X., Khoshgoftaar, T.M.: A survey of collaborative filtering techniques. Adv. Artif. Intell. (2009)

  • Tintarev, N., Masthoff, J.: Designing and evaluating explanations for recommender systems. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.) Recommender Systems Handbook, pp. 479–510. Springer US, Boston (2011)

    Chapter  Google Scholar 

  • Tsintzou, V., Pitoura, E., Tsaparas, P.: Bias disparity in recommendation systems. CoRR. arXiv:1811.01461 (2018)

  • Wan, M., Ni, J., Misra, R., McAuley, J.: Addressing marketing bias in product recommendations. In: Proceedings of the 13th International Conference on Web Search and Data Mining, WSDM’20, pp. 618–626. Association for Computing Machinery, New York (2020)

  • Wei, T., Feng, F., Chen, J., Wu, Z., Yi, J., He, X.: Model-agnostic counterfactual reasoning for eliminating popularity bias in recommender system, pp. 1791–1800. Association for Computing Machinery, New York (2021)

    Google Scholar 

  • Yao, S., Huang, B.: Beyond parity: fairness objectives for collaborative filtering. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 2921–2930. Curran Associates, Inc. (2017)

  • Yi, X., Yang, J., Hong, L., Cheng, D.Z., Heldt, L., Kumthekar, A.A., Zhao, Z., Wei, L., Chi, E. (eds.) Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations (2019)

  • Zehlike, M., Bonchi, F., Castillo, C., Hajian, S., Megahed, M., Baeza-Yates, R.: Fa*ir: a fair top-k ranking algorithm. CoRR. arXiv:1706.06368 (2017)

  • Zehlike, M., Castillo, C.: Reducing disparate exposure in ranking: a learning to rank approach. In: Proceedings of The Web Conference 2020, WWW’20, pp. 2849–2855. Association for Computing Machinery, New York (2020)

  • Zhang, Y., Chen, X.: Explainable recommendation: a survey and new perspectives. CoRR. arXiv:1804.11192 (2018)

  • Zhu, Z., He, Y., Zhao, X., Caverlee, J.: Popularity Bias in Dynamic Recommendation, pp. 2439–2449. Association for Computing Machinery, New York (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Joanna Misztal-Radecka.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Computational complexity

Appendix A: Computational complexity

In this section, we perform an analysis of the computational complexity of the proposed algorithm, and the execution time is measured for experimental data.

1.1 Appendix A.1: Theoretical analysis

In the complexity analysis, let L denote the number of attributes, N is the number of ratings and D is the maximum depth of the tree (\(1 \le D \le L\)). Next, let \(n_k\) be the number of subsets of \(T_k\) with significant difference in the error metric distribution (\(p<\alpha _{{\textit{merge}}}\)). Without loss of generality, let us assume that the number of attribute values \(|T_k|\) and \(n_k\) is equal for all attributes k:

$$\begin{aligned} |T_{k_1}|=|T_{k_2}|=\cdots =|T_{k_L}|=K, n_{k_1}=n_{k_2}= \cdots = n_{k_L}=n_k \end{aligned}$$

Then \(K-n_k\) is the number of iterations in the merge phase for an attribute k. Since the complexity of the split-phase is linear O(L), we only consider the complexity of the merge phase in the analysis. Assuming that the complexity of operations in each node is equal (Eq. 23), the overall complexity of BDT algorithm can be calculated as:

$$\begin{aligned} A(L,K) = \sum _{p \in P}F(p)=|P|F(p) \end{aligned}$$

where F(p) denotes the complexity of the merge phase for a non-leaf tree node where the split of this node is performed based on attribute \(x_k\) with values \(T_k=\{k_1, \ldots k_K\}\) and P denotes the set of all parent nodes in the tree.

In each iteration of the merge phase, the significance test is calculated for all pairs of attribute values from \(T_k\) for each k. Next, one pair \(k_m, k_n\) with the least significant difference (if \(p>\alpha _{{\textit{merge}}}\)) will be merged. Hence, the algorithm will make K comparisons of single values in the first step (\(i=1\)) and the pair \(k_{m,n}\) will be merged such that the number of values that will be processed in \(i=2\) is \(|T_k^1|=|T_k|-1=K-1\). Consequently, \(K_i=K-(i-1)\) values will be compared in the i-th iteration, until no further merges can be performed (\(p<\alpha _{{\textit{merge}}}\) for all combinations). The difference in error distribution is tested for each pair of attribute values, resulting in \(K_i^2\) operations in each iteration. Accordingly, the complexity of the merge stage for one attribute in a node is given by:

$$\begin{aligned} \sum _{i=1}^{K-n_{k}}(K-(i-1))^2 \end{aligned}$$

The number of nodes on tree level \(d, 1\le d \le D\) is equal to \(n_k^{d-1}\) and the number of non-leaf nodes in the tree of depth D can be calculated as:

$$\begin{aligned} |P|= \sum ^{D}_{d=1}n_k^{d-1} = n_k^0+n_k^1+\cdots n_k^{D-1} \end{aligned}$$

For a simple mean-based significance test, the comparison of the error metric \(\sigma _k^2\) between two attribute values \(k_i, k_j \in T_k\) requires N operations, where N is the number of samples.

Since the number of attributes available for a node on level d is \(L-d+1\), the number of operations for the merge stage in one node on level d is given by (From Formula 25):

$$\begin{aligned} F(p)=N(L-d+1)\sum _{i=1}^{K-n_{k}}(K-(i-1))^2 \end{aligned}$$

Hence, the overall general complexity can be bounded by (based on Formulas 27, 26):

$$\begin{aligned} A(L,K)=\sum ^{D}_{d=1}n_k^{d-1} F(p)=O(n_k^{D-2}LK^3N) \end{aligned}$$

To be more precise, let us consider the computational complexity of the BDT method in the following most extreme cases:

  1. 1.

    Optimistic case

    In the optimistic case, there are no significant differences in the error metric distribution among the parameter values for any attribute (as in the first simulated example):

    $$\begin{aligned} \forall _{k \in 1 \ldots L, k_i, k_j \in T_k}: p_{k_i, k_j}>\alpha _{{\textit{merge}}}, n_k=1 \end{aligned}$$

    In this case, in the merge phase, the values \(k_i\) will be merged successively until all values are merged into one set \(T^*_k=\{k_1, \ldots k_K\}\). Since there are no significant differences between the attribute values, no further splits will be made and the algorithm stops after the merge phase (\(D=1)\) and the overall complexity will be (Eq. 27):

    $$\begin{aligned} A(L,K)=LN\sum _{i=1}^{K-1}(K-(i-1))^2=O(LNK^3) \end{aligned}$$
  2. 2.

    Pessimistic case

    In the worst case, all pairs of parameter values have significant differences in error metric distribution:

    $$\begin{aligned} \forall _{k \in 1 \ldots L, k_i, k_j \in T_k}: p_{k_i, k_j}<\alpha _{{\textit{merge}}} \end{aligned}$$

    In this case, there is only one iteration in the merge phase (\(n_k=K\)) as no values are merged together but the depth of the tree is equal to the number of attributes (\(D=L\)). Then, the complexity of merge phase of one node is (Eq. 27):

    $$\begin{aligned} F(P)=N(L-d+1)K^2 \end{aligned}$$

    And the overall complexity can be restricted by:

    $$\begin{aligned} A(L,K)&=N\sum ^{L}_{d=1}K^{d-1}(L-d+1)K^2\nonumber \\&=N\sum ^{L}_{d=1}K^{d+1}(L-d+1)=O(NLK^{L+2}) \end{aligned}$$

1.2 Appendix A.2: Experimental computation time

To verify the practical complexity of the proposed algorithm, we performed an experiment on synthetic data generated in an analogous way as in Sect. 4.3; however, instead of a pre-defined set of attributes, we compared the result for varying parameters of the data characteristics (with \(\alpha =0.01\)):

  • N—the number of examples varies between 100 and 100.000 (default 10.000),

  • L—the number of attributes between 1 and 50 (default 5),

  • K—the number of categories per attribute between 2 and 20 (default 4),

  • \(n_k\)—ratio of categories with a different error metric distribution 0–1 (default 0.5),

  • D—maximum depth of the tree (3 or 20).

The computation times were calculated for a single machine (CPU Intel Core i5-4210U, 1.7 GHz), and the results were averaged over ten iterations.

Fig. 9
figure 9

Execution time of the BDT algorithm depending on different parameters: a Number of examples; b Number of attributes; c Number of categories in each attribute; d Ratio of categories with a significant difference in the error metric distribution. For each case, we compared the time for a maximum depth limited to 3 and 20 levels

Results The results for each parameter are presented in Fig. 9 for maximum depth limited to 3 and 20 levels. The correlation between the computation time and the number of examples is linear and does not depend on D (Fig. 9a) This shows that our proposed method scales well in terms of dataset size. Figure 9b shows that the relationship between the time and the number of attributes L for the two tree depths (D) we investigated does not differ significantly for small L; but, the difference starts increasing for \(L>20\). This observation indicates that restricting the depth of the tree may be important, especially when the number of attributes is large. A similar observation holds for the number of attribute categories K (Fig. 9c) and the ratio of categories with a significant difference in the error metric distribution (Fig. 9d). Moreover, as shown in Fig. 9d, the computation time is the lowest in the optimistic case when no categories have a significantly different error metric distribution. Importantly, it is worth noticing that the complexity is reduced significantly after limiting the maximum depth of the tree. Since for practical usage in detecting disparities, a large complexity may lead to difficulties in interpreting the model results, this parameter should be limited for the sake of optimization.

1.3 Appendix A.3: Limiting computation time

In practice, the number of splits for each node is usually significantly lower than K unless there are significant differences between the error metric distribution in all nodes. However, since the pessimistic computational complexity can be large, some steps can be undertaken to limit the complexity of the algorithm for practical applications:

  • Limiting L—to reduce the number of attributes, standard feature extraction techniques may be applied to remove the correlations and group attributes into fewer categories. For instance, the movie tags could be grouped into several categories with a topic modeling technique. Additional optimization steps may include subsampling techniques and training multiple trees on subsets of the features or examples.

  • Limiting K—may be achieved, for instance, by bucketizing the numerical attributes (such as the user’s age or movie production year), removing the outliers, or filtering the less frequent values.

  • Limiting D—to reduce the complexity and avoid selecting too specific rules that would be hard to interpret, the tree can be regularized by limiting the maximum depth. Then, parameter D will be constant.

  • Limiting \(n_k\)—the number of splits in each node can be controlled by the attributes \(\alpha \) that define the statistical significance threshold for the merge or split stages or by setting the minimum number of samples in a node.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Misztal-Radecka, J., Indurkhya, B. A bias detection tree approach for detecting disparities in a recommendation model’s errors. User Model User-Adap Inter 33, 43–79 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Recommender systems
  • System fairness
  • Bias detection