Abstract
Many of the current recommendation systems are considered to be blackboxes that are tuned to optimize some global objective function. However, their error distribution may differ dramatically among different combinations of attributes, and such algorithms may lead to propagating hidden data biases. Identifying potential disparities in an algorithm’s functioning is essential for building recommendation systems in a fair and responsible way. In this work, we propose a modelagnostic technique to automatically detect the combinations of user and item attributes correlated with unequal treatment by the recommendation model. We refer to this technique as the Bias Detection Tree. In contrast to the existing works in this field, our method automatically detects disparities related to combinations of attributes without any a priori knowledge about protected attributes, assuming that relevant metadata is available. Our results on five public recommendation datasets show that the proposed technique can identify hidden biases in terms of four kinds of metrics for multiple collaborative filtering models. Moreover, we adapt a minimax model selection technique to control the tradeoff between the global and the worstcase optimizations and improve the recommendation model’s performance for biased attributes.
This is a preview of subscription content, access via your institution.
Notes
References
Abdollahpouri, H., Burke, R., Mobasher, B.: Controlling popularity bias in learningtorank recommendation. In: Proceedings of the Eleventh ACM Conference on Recommender Systems, RecSys’17, pp 42–46. Association for Computing Machinery, New York (2017)
Anelli, V.W., Di Noia, T., Di Sciascio, E., Ragone, A., Trotta, J.: Local popularity and time in topn recommendation. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds.) Advances in Information Retrieval, pp. 861–868. Springer, Cham (2019)
BaezaYates, R.: Bias on the web. Commun. ACM 61, 54–61 (2018)
Barocas, S., Hardt, M., Narayanan, A.: Fairness and Machine Learning. fairmlbook.org (2019). http://www.fairmlbook.org
Beutel, A., Chi, E.H., Cheng, Z., Pham, H., Anderson, J.: Beyond globally optimal: Focused learning for improved recommendations. In: Proceedings of the 26th International Conference on World Wide Web, WWW’17, pp. 203–212, Republic and Canton of Geneva, CHE, 2017. International World Wide Web Conferences Steering Committee
Beutel, A., Chen, J., Doshi, T., Qian, H., Wei, L., Wu, Y., Heldt, L., Zhao, Z., Hong, L., Chi, E.H., Goodrow, C.: Fairness in recommendation ranking through pairwise comparisons. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD’19, pp. 2212–2220. Association for Computing Machinery, New York (2019)
Boratto, L., Fenu, G., Marras, M.: The effect of algorithmic bias on recommender systems for massive open online courses. In: Azzopardi, L., Stein, B., Fuhr, N., Mayr, P., Hauff, C., Hiemstra, D. (eds.) Advances in Information Retrieval, pp. 457–472. Springer, Cham (2019)
Boratto, L., Fenu, G., Marras, M.: Connecting user and item perspectives in popularity debiasing for collaborative recommendation. Inf. Process. Manag. 58(1), 102387 (2021)
Brown, M.B., Forsythe, A.B.: Robust tests for the equality of variances. J. Am. Stat. Assoc. 69(346), 364–367 (1974)
Burke, R., Sonboli, N., OrdonezGauger, A.: Balanced neighborhoods for multisided fairness in recommendation. In: Friedler, S.A., Wilson, C. (eds.) Proceedings of the 1st Conference on Fairness, Accountability and Transparency, Proceedings of Machine Learning Research, vol. 81, pp. 202–214. PMLR, New York (2018)
Chen, J., Dong, H., Wang, X., Feng, F., Wang, M., He, X.: Bias and debias in recommender system: a survey and future directions. CoRR. arXiv:2010.03240 (2020)
Diana, E., Gill, W., Kearns, M., Kenthapadi, K., Roth, A.: Convergent algorithms for (relaxed) minimax fairness. CoRR. arXiv:2011.03108 (2020)
Eskandanian, F., Sonboli, N., Mobasher, B.: Power of the few: analyzing the impact of influential users in collaborative recommender systems. In: Proceedings of the 27th ACM Conference on User Modeling, Adaptation and Personalization, UMAP’19, pp. 225–233. Association for Computing Machinery, New York (2019)
Fryer, R., Loury, G., Yuret, T.: An economic analysis of colorblind affirmative action. J. Law Econ. Organ. 24(2), 319–355 (2008)
Gajane, P., Pechenizkiy, M.: On formalizing fairness in prediction with machine learning (2017)
George, T., Merugu, S.: A scalable collaborative filtering framework based on coclustering. In: Proceedings of the Fifth IEEE International Conference on Data Mining, ICDM’05, pp. 625–628. IEEE Computer Society, USA (2005)
Ghazanfar, M.A., PrügelBennett, A.: Leveraging clustering approaches to solve the graysheep users problem in recommender systems. Expert Syst. Appl. 41(7), 3261–3275 (2014)
Guidotti, R., Monreale, A., Turini, F., Pedreschi, D., Giannotti, F.: A survey of methods for explaining black box models. CoRR. arXiv:1802.01933 (2018)
Harper, F.M., Konstan, J.A.: The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5(4), 19:1–19:19 (2015)
Hug, N.: Surprise: a python library for recommender systems. J. Open Source Softw. 5(52), 2174 (2020)
Kamishima, T., Akaho, S., Sakuma, J.: Fairnessaware learning through regularization approach. In: 2011 IEEE 11th International Conference on Data Mining Workshops, pp. 643–650 (2011)
Kass, G.V.: An exploratory technique for investigating large quantities of categorical data. J. R. Stat. Soc. Ser. C Appl. Stat. 29, 119–127 (1980)
Kershaw, D.J., Koeling, R., Bourgeois, S., Trenta, A., Muncey, H.J.: Fairness in Reviewer Recommendations at Elsevier, pp. 554–555. Association for Computing Machinery, New York (2021)
Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009)
Lee, D.D., Seung, H.S.: Learning the parts of objects by nonnegative matrix factorization. Nature 401, 788–791 (1999)
Legislation E.U.: Equal treatment of persons (2009)
Li, J., Sun, L., Wang, J.: A slope one collaborative filtering recommendation algorithm using uncertain neighbors optimizing. In: Wang, L., Jiang, J., Lu, J., Hong, L., Liu, B. (eds.) WebAge Information Management, pp. 160–166. Springer, Berlin (2012)
Li, Y., Chen, H., Fu, Z., Ge, Y., Zhang, Y.: UserOriented Fairness in Recommendation, pp. 624–632. Association for Computing Machinery, New York (2021)
LippertRasmussen, K.: Born Free and Equal?: A Philosophical Inquiry Into the Nature of Discrimination. Oxford University Press, Oxford (2013)
McCrae, J., Piatek, A., Langley, A.: Collaborative filtering. http://www.imperialviolet.org (2004)
MisztalRadecka, J., Indurkhya, B.: When is a recommendation model wrong? A modelagnostic treebased approach to detecting biases in recommendations. In: Boratto, L., Faralli, S., Marras, M., Stilo, G. (eds.) Advances in Bias and Fairness in Information Retrieval, pp. 92–105. Springer, Cham (2021)
Olteanu, A., Castillo, C., Diaz, F., Kıcıman, E.: Social data: biases, methodological pitfalls, and ethical boundaries. Front. Big Data 2, 13 (2019)
Ribeiro, M.T., Singh, S., Guestrin, C.: Modelagnostic interpretability of machine learning (2016)
Ricci, F., Rokach, L., Shapira, B.: Recommender Systems: Introduction and Challenges, pp. 1–34. Springer US, Boston (2015)
Ritschard, G.: CHAID and Earlier Supervised Tree Methods, pp. 48–74. J.J. McArdle & G. Ritschard, Routeledge, New York (2013)
Rudin, C.: Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1(5), 206–215 (2019)
Salakhutdinov, R.. Mnih, A.: Probabilistic matrix factorization. In: Advances in Neural Information Processing Systems, vol. 20 (2008)
Sánchez, P., Bellogín, A.: Attributebased evaluation for recommender systems: incorporating user and item attributes in evaluation metrics. In: Proceedings of the 13th ACM Conference on Recommender Systems, RecSys’19, pp. 378–382. Association for Computing Machinery, New York (2019)
Singh, J., Anand, A.: Posthoc interpretability of learning to rank models using secondary training data. CoRR. arXiv:1806.11330 (2018)
Su, X., Khoshgoftaar, T.M.: A survey of collaborative filtering techniques. Adv. Artif. Intell. (2009)
Tintarev, N., Masthoff, J.: Designing and evaluating explanations for recommender systems. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.) Recommender Systems Handbook, pp. 479–510. Springer US, Boston (2011)
Tsintzou, V., Pitoura, E., Tsaparas, P.: Bias disparity in recommendation systems. CoRR. arXiv:1811.01461 (2018)
Wan, M., Ni, J., Misra, R., McAuley, J.: Addressing marketing bias in product recommendations. In: Proceedings of the 13th International Conference on Web Search and Data Mining, WSDM’20, pp. 618–626. Association for Computing Machinery, New York (2020)
Wei, T., Feng, F., Chen, J., Wu, Z., Yi, J., He, X.: Modelagnostic counterfactual reasoning for eliminating popularity bias in recommender system, pp. 1791–1800. Association for Computing Machinery, New York (2021)
Yao, S., Huang, B.: Beyond parity: fairness objectives for collaborative filtering. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 2921–2930. Curran Associates, Inc. (2017)
Yi, X., Yang, J., Hong, L., Cheng, D.Z., Heldt, L., Kumthekar, A.A., Zhao, Z., Wei, L., Chi, E. (eds.) SamplingBiasCorrected Neural Modeling for Large Corpus Item Recommendations (2019)
Zehlike, M., Bonchi, F., Castillo, C., Hajian, S., Megahed, M., BaezaYates, R.: Fa*ir: a fair topk ranking algorithm. CoRR. arXiv:1706.06368 (2017)
Zehlike, M., Castillo, C.: Reducing disparate exposure in ranking: a learning to rank approach. In: Proceedings of The Web Conference 2020, WWW’20, pp. 2849–2855. Association for Computing Machinery, New York (2020)
Zhang, Y., Chen, X.: Explainable recommendation: a survey and new perspectives. CoRR. arXiv:1804.11192 (2018)
Zhu, Z., He, Y., Zhao, X., Caverlee, J.: Popularity Bias in Dynamic Recommendation, pp. 2439–2449. Association for Computing Machinery, New York (2021)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Computational complexity
Appendix A: Computational complexity
In this section, we perform an analysis of the computational complexity of the proposed algorithm, and the execution time is measured for experimental data.
1.1 Appendix A.1: Theoretical analysis
In the complexity analysis, let L denote the number of attributes, N is the number of ratings and D is the maximum depth of the tree (\(1 \le D \le L\)). Next, let \(n_k\) be the number of subsets of \(T_k\) with significant difference in the error metric distribution (\(p<\alpha _{{\textit{merge}}}\)). Without loss of generality, let us assume that the number of attribute values \(T_k\) and \(n_k\) is equal for all attributes k:
Then \(Kn_k\) is the number of iterations in the merge phase for an attribute k. Since the complexity of the splitphase is linear O(L), we only consider the complexity of the merge phase in the analysis. Assuming that the complexity of operations in each node is equal (Eq. 23), the overall complexity of BDT algorithm can be calculated as:
where F(p) denotes the complexity of the merge phase for a nonleaf tree node where the split of this node is performed based on attribute \(x_k\) with values \(T_k=\{k_1, \ldots k_K\}\) and P denotes the set of all parent nodes in the tree.
In each iteration of the merge phase, the significance test is calculated for all pairs of attribute values from \(T_k\) for each k. Next, one pair \(k_m, k_n\) with the least significant difference (if \(p>\alpha _{{\textit{merge}}}\)) will be merged. Hence, the algorithm will make K comparisons of single values in the first step (\(i=1\)) and the pair \(k_{m,n}\) will be merged such that the number of values that will be processed in \(i=2\) is \(T_k^1=T_k1=K1\). Consequently, \(K_i=K(i1)\) values will be compared in the ith iteration, until no further merges can be performed (\(p<\alpha _{{\textit{merge}}}\) for all combinations). The difference in error distribution is tested for each pair of attribute values, resulting in \(K_i^2\) operations in each iteration. Accordingly, the complexity of the merge stage for one attribute in a node is given by:
The number of nodes on tree level \(d, 1\le d \le D\) is equal to \(n_k^{d1}\) and the number of nonleaf nodes in the tree of depth D can be calculated as:
For a simple meanbased significance test, the comparison of the error metric \(\sigma _k^2\) between two attribute values \(k_i, k_j \in T_k\) requires N operations, where N is the number of samples.
Since the number of attributes available for a node on level d is \(Ld+1\), the number of operations for the merge stage in one node on level d is given by (From Formula 25):
Hence, the overall general complexity can be bounded by (based on Formulas 27, 26):
To be more precise, let us consider the computational complexity of the BDT method in the following most extreme cases:

1.
Optimistic case
In the optimistic case, there are no significant differences in the error metric distribution among the parameter values for any attribute (as in the first simulated example):
$$\begin{aligned} \forall _{k \in 1 \ldots L, k_i, k_j \in T_k}: p_{k_i, k_j}>\alpha _{{\textit{merge}}}, n_k=1 \end{aligned}$$In this case, in the merge phase, the values \(k_i\) will be merged successively until all values are merged into one set \(T^*_k=\{k_1, \ldots k_K\}\). Since there are no significant differences between the attribute values, no further splits will be made and the algorithm stops after the merge phase (\(D=1)\) and the overall complexity will be (Eq. 27):
$$\begin{aligned} A(L,K)=LN\sum _{i=1}^{K1}(K(i1))^2=O(LNK^3) \end{aligned}$$(29) 
2.
Pessimistic case
In the worst case, all pairs of parameter values have significant differences in error metric distribution:
$$\begin{aligned} \forall _{k \in 1 \ldots L, k_i, k_j \in T_k}: p_{k_i, k_j}<\alpha _{{\textit{merge}}} \end{aligned}$$In this case, there is only one iteration in the merge phase (\(n_k=K\)) as no values are merged together but the depth of the tree is equal to the number of attributes (\(D=L\)). Then, the complexity of merge phase of one node is (Eq. 27):
$$\begin{aligned} F(P)=N(Ld+1)K^2 \end{aligned}$$(30)And the overall complexity can be restricted by:
$$\begin{aligned} A(L,K)&=N\sum ^{L}_{d=1}K^{d1}(Ld+1)K^2\nonumber \\&=N\sum ^{L}_{d=1}K^{d+1}(Ld+1)=O(NLK^{L+2}) \end{aligned}$$(31)
1.2 Appendix A.2: Experimental computation time
To verify the practical complexity of the proposed algorithm, we performed an experiment on synthetic data generated in an analogous way as in Sect. 4.3; however, instead of a predefined set of attributes, we compared the result for varying parameters of the data characteristics (with \(\alpha =0.01\)):

N—the number of examples varies between 100 and 100.000 (default 10.000),

L—the number of attributes between 1 and 50 (default 5),

K—the number of categories per attribute between 2 and 20 (default 4),

\(n_k\)—ratio of categories with a different error metric distribution 0–1 (default 0.5),

D—maximum depth of the tree (3 or 20).
The computation times were calculated for a single machine (CPU Intel Core i54210U, 1.7 GHz), and the results were averaged over ten iterations.
Results The results for each parameter are presented in Fig. 9 for maximum depth limited to 3 and 20 levels. The correlation between the computation time and the number of examples is linear and does not depend on D (Fig. 9a) This shows that our proposed method scales well in terms of dataset size. Figure 9b shows that the relationship between the time and the number of attributes L for the two tree depths (D) we investigated does not differ significantly for small L; but, the difference starts increasing for \(L>20\). This observation indicates that restricting the depth of the tree may be important, especially when the number of attributes is large. A similar observation holds for the number of attribute categories K (Fig. 9c) and the ratio of categories with a significant difference in the error metric distribution (Fig. 9d). Moreover, as shown in Fig. 9d, the computation time is the lowest in the optimistic case when no categories have a significantly different error metric distribution. Importantly, it is worth noticing that the complexity is reduced significantly after limiting the maximum depth of the tree. Since for practical usage in detecting disparities, a large complexity may lead to difficulties in interpreting the model results, this parameter should be limited for the sake of optimization.
1.3 Appendix A.3: Limiting computation time
In practice, the number of splits for each node is usually significantly lower than K unless there are significant differences between the error metric distribution in all nodes. However, since the pessimistic computational complexity can be large, some steps can be undertaken to limit the complexity of the algorithm for practical applications:

Limiting L—to reduce the number of attributes, standard feature extraction techniques may be applied to remove the correlations and group attributes into fewer categories. For instance, the movie tags could be grouped into several categories with a topic modeling technique. Additional optimization steps may include subsampling techniques and training multiple trees on subsets of the features or examples.

Limiting K—may be achieved, for instance, by bucketizing the numerical attributes (such as the user’s age or movie production year), removing the outliers, or filtering the less frequent values.

Limiting D—to reduce the complexity and avoid selecting too specific rules that would be hard to interpret, the tree can be regularized by limiting the maximum depth. Then, parameter D will be constant.

Limiting \(n_k\)—the number of splits in each node can be controlled by the attributes \(\alpha \) that define the statistical significance threshold for the merge or split stages or by setting the minimum number of samples in a node.
Rights and permissions
About this article
Cite this article
MisztalRadecka, J., Indurkhya, B. A bias detection tree approach for detecting disparities in a recommendation model’s errors. User Model UserAdap Inter 33, 43–79 (2023). https://doi.org/10.1007/s1125702209334x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1125702209334x
Keywords
 Recommender systems
 System fairness
 Bias detection