Skip to main content

Co-clustering for Fair Recommendation

  • Conference paper
  • First Online:
Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2021)

Abstract

Collaborative filtering relies on a sparse rating matrix, where each user rates a few products, to propose recommendations. The approach consists of approximating the sparse rating matrix with a simple model whose regularities allow to fill in the missing entries. The latent block model is a generative co-clustering model that can provide such an approximation. In this paper, we show that exogenous sensitive attributes can be incorporated in this model to make fair recommendations. Since users are only characterized by their ratings and their sensitive attribute, fairness is measured here by a parity criterion. We propose a definition of fairness specific to recommender systems, requiring item rankings to be independent of the users’ sensitive attribute. We show that our model ensures approximately fair recommendations provided that the classification of users approximately respects statistical parity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    \(\gamma = (\boldsymbol{\tau }^{\left( U\right) }, \boldsymbol{\tau }^{\left( V\right) },\boldsymbol{\nu ^{\left( A\right) }},\boldsymbol{\rho ^{\left( A\right) }}, \boldsymbol{\nu ^{\left( B\right) }}, \boldsymbol{\rho ^{\left( B\right) }}, \boldsymbol{\nu ^{\left( C\right) }}, \boldsymbol{\rho ^{\left( C\right) }})\).

  2. 2.

    \(\gamma = (\boldsymbol{\tau }^{\left( U\right) }, \boldsymbol{\tau }^{\left( V\right) },\boldsymbol{\nu ^{\left( A\right) }},\boldsymbol{\rho ^{\left( A\right) }}, \boldsymbol{\nu ^{\left( B\right) }}, \boldsymbol{\rho ^{\left( B\right) }}, \boldsymbol{\nu ^{\left( C\right) }}, \boldsymbol{\rho ^{\left( C\right) }})\).

References

  1. Abbasi, M., Bhaskara, A., Venkatasubramanian, S.: Fair clustering via equitable group representations. In: Elish, M.C., Isaac, W., Zemel, R.S. (eds.) ACM Conference on Fairness, Accountability, and Transparency (FAccT), pp. 504–514 (2021). https://doi.org/10.1145/3442188.3445913

  2. Baudry, J.P., Celeux, G.: EM for mixtures. Stat. Comput. 25(4), 713–726 (2015). https://doi.org/10.1007/s11222-015-9561-x

    Article  MathSciNet  MATH  Google Scholar 

  3. Bellogin, A., Castells, P., Cantador, I.: Precision-oriented evaluation of recommender systems: an algorithmic comparison. In: Proceedings of the Fifth ACM Conference on Recommender Systems, pp. 333–336. Association for Computing Machinery, New York, NY, USA (2011). https://doi.org/10.1145/2043932.2043996

  4. Bera, S.K., Chakrabarty, D., Flores, N.J., Negahbani, M.: Fair algorithms for clustering (2019)

    Google Scholar 

  5. Beutel, A., et al.: Fairness in Recommendation Ranking through Pairwise Comparisons, pp. 2212–2220 (2019). https://doi.org/10.1145/3292500.3330745

  6. Biernacki, C., Celeux, G., Govaert, G.: Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate Gaussian mixture models. Comput. Stat. Data Anal. 41, 561–575 (2003). https://doi.org/10.1016/S0167-9473(02)00163-9

    Article  MathSciNet  MATH  Google Scholar 

  7. Binns, R.: Fairness in machine learning: lessons from political philosophy. In: Friedler, S.A., Wilson, C. (eds.) Proceedings of the 1st Conference on Fairness, Accountability and Transparency. Proceedings of Machine Learning Research, vol. 81, pp. 149–159. PMLR, 23–24 Feb 2018, New York, NY, USA. http://proceedings.mlr.press/v81/binns18a.html

  8. Brault, V., Mariadassou, M.: Co-clustering through latent bloc model: a review. J. de la Société Française de Statistique 156(3), 120–139 (2015). http://journal-sfds.fr/article/view/474/448

  9. Burke, R., Sonboli, N., Ordonez-Gauger, A.: Balanced neighborhoods for multi-sided fairness in recommendation. In: 1st Conference on Fairness, Accountability and Transparency. PMLR, vol. 81, pp. 202–214 (2018). http://proceedings.mlr.press/v81/burke18a.html

  10. Bürkner, P.C., Vuorre, M.: Ordinal regression models in psychology: a tutorial. Adv. Meth. Pract. Psychol. Sci. 2(1), 77–101 (2019)

    Article  Google Scholar 

  11. Daykin, A.R., Moffatt, P.G.: Analyzing ordered responses: a review of the ordered Probit model. Understand. Stat. 1(3), 157–166 (2002). https://doi.org/10.1207/S15328031US0103_02

    Article  Google Scholar 

  12. Gajane, P.: On formalizing fairness in prediction with machine learning. CoRR abs/1710.03184 (2017). http://arxiv.org/abs/1710.03184

  13. George, T., Merugu, S.: A scalable collaborative filtering framework based on co-clustering. In: Fifth IEEE International Conference on Data Mining (ICDM) (2005)

    Google Scholar 

  14. Ghadiri, M., Samadi, S., Vempala, S.: Socially fair k-means clustering. arXiv preprint arXiv:2006.10085 (2020)

  15. Govaert, G., Nadif, M.: Block clustering with Bernoulli mixture models: comparison of different approaches. Comput. Stat. Data Anal. 52(6), 3233–3245 (2008)

    Article  MathSciNet  Google Scholar 

  16. Govaert, G., Nadif, M.: Latent block model for contingency table. Commun. Stat. Theory Meth. 39(3), 416–425 (2010). https://doi.org/10.1080/03610920903140197

    Article  MathSciNet  MATH  Google Scholar 

  17. Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning. In: Advances in Neural Information Processing Systems 29, pp. 3315–3323 (2016). https://proceedings.neurips.cc/paper/2016/hash/9d2682367c3935defcb1f9e247a97c0d-Abstract.html

  18. Hug, N.: Surprise: a python library for recommender systems. J. Open Source Softw. 5(52), 2174 (2020). https://doi.org/10.21105/joss.02174

  19. Jaakkola, T.S.: Tutorial on variational approximation methods. In: Advanced Mean Field Methods: Theory and Practice, pp. 129–159. MIT Press, Cambridge (2000)

    Google Scholar 

  20. Jang, E., Gu, S., Poole, B.: Categorical reparameterization with Gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016)

  21. Järvelin, K., Kekäläinen, J.: IR evaluation methods for retrieving highly relevant documents. In: ACM SIGIR Forum, vol. 51, pp. 243–250 (2017)

    Google Scholar 

  22. Kamishima, T., Akaho, S., Asoh, H., Sakuma, J.: Recommendation independence. In: Conference on Fairness, Accountability and Transparency, pp. 187–201 (2018)

    Google Scholar 

  23. Keribin, C., Brault, V., Celeux, G., Govaert, G.: Estimation and selection for the latent block model on categorical data. Stat. Comput. 25(6), 1201–1216 (2015). https://hal.inria.fr/hal-01095957

  24. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  25. Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009)

    Article  Google Scholar 

  26. Lomet, A., Govaert, G., Grandvalet, Y.: Model selection for Gaussian latent block clustering with the integrated classification likelihood. Adv. Data Anal. Classif. 12(3), 489–508 (2018). https://hal.archives-ouvertes.fr/hal-00913680

  27. Marlin, B.M., Zemel, R.S., Roweis, S.T., Slaney, M.: Collaborative filtering and the missing at random assumption. In: Twenty-Third Conference on Uncertainty in Artificial Intelligence (UAI), pp. 267–275 (2007)

    Google Scholar 

  28. Marlin, B.M., Zemel, R.S., Roweis, S.T., Slaney, M.: Collaborative filtering and the missing at random assumption. CoRR abs/1206.5267 (2012). http://arxiv.org/abs/1206.5267

  29. Movielens 1M datasets. https://grouplens.org/datasets/movielens/

  30. Ranganath, R., Gerrish, S., Blei, D.: Black box variational inference. In: Kaski, S., Corander, J. (eds.) Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 33, pp. 814–822. PMLR, 22–25 April 2014, Reykjavik, Iceland. http://proceedings.mlr.press/v33/ranganath14.html

  31. Räz, T.: Group fairness: independence revisited. arXiv preprint arXiv:2101.02968 (2021)

  32. Rendle, S., Zhang, L., Koren, Y.: On the difficulty of evaluating baselines: a study on recommender systems. arXiv preprint arXiv:1905.01395 (2019)

  33. Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976). http://www.jstor.org/stable/2335739

  34. Yao, S., Huang, B.: Beyond parity: fairness objectives for collaborative filtering. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017). https://proceedings.neurips.cc/paper/2017/file/e6384711491713d29bc63fc5eeb5ba4f-Paper.pdf

  35. Zhu, Z., Hu, X., Caverlee, J.: Fairness-aware tensor-based recommendation. In: 27th ACM International Conference on Information and Knowledge Management, pp. 1153–1162 (2018). https://doi.org/10.1145/3269206.3271795

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gabriel Frisch .

Editor information

Editors and Affiliations

Appendices

Co-clustering for Fair Recommendation. Supplementary Material

A Computation of the Variational Log-Likelihood Criterion

The criterion we want to optimize is:

$$\begin{aligned} \mathcal {J}{\left( q_{\gamma }, \theta \right) } = \mathcal {H}(q_{\gamma }) + \mathbb {E}_{q_{\gamma }}\left[ \mathcal {L}{\left( \boldsymbol{R}, \boldsymbol{U}, \boldsymbol{V}, \boldsymbol{A}, \boldsymbol{B}, \boldsymbol{C}; \theta \right) }\right] . \end{aligned}$$
(S1)

We chose to restrict the space of the variational distribution \(q_{\gamma }\) in order to get a fully factorized form:

$$\begin{aligned} q_{\gamma }&=\textstyle \prod _{i=1}^{{n_1}}{\mathcal {M}{\left( 1;\tau ^{\left( U\right) }_i\right) }}\;\times \;\; \prod _{j=1}^{{n_2}}{\mathcal {M}{\left( 1;\tau ^{\left( V\right) }_j\right) }} \\&\textstyle \quad \times \prod _{i=1}^{{n_1}}{\mathcal {N}{\left( \nu ^{\left( A\right) }_i,\rho ^{\left( A\right) }_i\right) }}\times \prod _{j=1}^{{n_2}}{\mathcal {N}{\left( \nu ^{\left( B\right) }_j,\rho ^{\left( B\right) }_j\right) }} \nonumber \\ {}&\quad \times \textstyle \prod _{j=1}^{{n_2}}{\mathcal {N}{\left( \nu ^{\left( C\right) }_j,\rho ^{\left( C\right) }_j\right) }}\nonumber \end{aligned}$$
(S2)

where \(\gamma \) denotes the parameters concatenation of the variational distributionFootnote 2 \(q_{\gamma }\). The entropy is additive across independant variables so we get:

$$\begin{aligned} \mathcal {H}{\left( q_{\gamma }\right) }= \mathcal {H}{\left( q_{\gamma }{\left( \boldsymbol{U}\right) }\right) } + \mathcal {H}{\left( q_{\gamma }{\left( \boldsymbol{V}\right) }\right) } + \mathcal {H}{\left( q_{\gamma }{\left( \boldsymbol{A}\right) }\right) } + \mathcal {H}{\left( q_{\gamma }{\left( \boldsymbol{B}\right) }\right) } + \mathcal {H}{\left( q_{\gamma }{\left( \boldsymbol{C}\right) }\right) } , \end{aligned}$$

with the following terms:

$$\begin{aligned} \mathcal {H}{\left( q_{\gamma }{\left( \boldsymbol{U}\right) }\right) }&= - \sum _{iq}{ \tau ^{\left( U\right) }_{iq} \log \tau ^{\left( U\right) }_{iq}} \\ \mathcal {H}{\left( q_{\gamma }{\left( \boldsymbol{V}\right) }\right) }&= - \sum _{jl}{ \tau ^{\left( U\right) }_{jl} \log \tau ^{\left( V\right) }_{jl}} \\ \mathcal {H}{\left( q_{\gamma }{\left( \boldsymbol{A}\right) }\right) }&= \frac{1}{2} \sum _{i}\log \rho ^{\left( A\right) }_i+ \frac{{n_1}}{2}{\left( \log 2\pi +1\right) } \\ \mathcal {H}{\left( q_{\gamma }{\left( \boldsymbol{B}\right) }\right) }&= \frac{1}{2} \sum _{j}\log \rho ^{\left( B\right) }_j+ \frac{{n_2}}{2}{\left( \log 2\pi +1\right) } \\ \mathcal {H}{\left( q_{\gamma }{\left( \boldsymbol{C}\right) }\right) }&= \frac{1}{2} \sum _{j}\log \rho ^{\left( C\right) }_j+ \frac{{n_2}}{2}{\left( \log 2\pi +1\right) } \\ \end{aligned}$$

The independence of the latent variables allows to rewrite the expectation of the complete log-likelihood as:

$$\begin{aligned} \mathbb {E}_{q_{\gamma }}{\left[ \mathcal {L}{\left( \boldsymbol{R}, \boldsymbol{U}, \boldsymbol{V}, \boldsymbol{A}, \boldsymbol{B}, \boldsymbol{C}\right) }\right] } =\;&\mathbb {E}_{q_{\gamma }}{\left[ \mathcal {L}{\left( \boldsymbol{U}\right) }\right] } + \mathbb {E}_{q_{\gamma }}{\left[ \mathcal {L}{\left( \boldsymbol{V}\right) }\right] }\\&+ \mathbb {E}_{q_{\gamma }}{\left[ \mathcal {L}{\left( \boldsymbol{A}\right) }\right] } + \mathbb {E}_{q_{\gamma }}{\left[ \mathcal {L}{\left( \boldsymbol{B}\right) }\right] } + \mathbb {E}_{q_{\gamma }}{\left[ \mathcal {L}{\left( \boldsymbol{C}\right) }\right] }\\&+ \mathbb {E}_{q_{\gamma }}{\left[ \mathcal {L}{\left( \left. \boldsymbol{R}\right| \boldsymbol{U}, \boldsymbol{V}, \boldsymbol{A}, \boldsymbol{B}, \boldsymbol{C}\right) }\right] } , \end{aligned}$$

with the following terms:

$$\begin{aligned} \mathbb {E}_{q_{\gamma }} \mathcal {L}{\left( \boldsymbol{U}\right) }&= \mathbb {E}_{q_{\gamma }} {\left[ \sum _{iq}{U_{iq} \log \alpha _q}\right] } = \sum _{iq} { \tau ^{\left( U\right) }_{iq} \log \alpha _q} \\ \mathbb {E}_{q_{\gamma }} \mathcal {L}{\left( \boldsymbol{V}\right) }&= \mathbb {E}_{q_{\gamma }} {\left[ \sum _{jl}{V_{jl} \log \beta _l}\right] } = \sum _{jl} { \tau ^{\left( V\right) }_{jl} \log \beta _l}\\ \end{aligned}$$
$$\begin{aligned} \mathbb {E}_{q_{\gamma }} \mathcal {L}{\left( \boldsymbol{A}\right) }&= - \frac{{n_1}}{2} \log 2\pi - \frac{{n_1}}{2} \log \sigma ^2_{A}- \frac{1}{2\sigma ^2_{A}} \sum _{i}{\mathbb {E}_{q_{\gamma }}A_i^2} \\&= - \frac{{n_1}}{2} \log 2\pi - \frac{{n_1}}{2} \log \sigma ^2_{A}- \frac{1}{2\sigma ^2_{A}} \sum _{i}{\left( {\left( \nu ^{\left( A\right) }_i\right) }^2 + \rho ^{\left( A\right) }_i\right) } \\ \mathbb {E}_{q_{\gamma }} \mathcal {L}{\left( \boldsymbol{B}\right) }&= - \frac{{n_2}}{2} \log 2\pi - \frac{{n_2}}{2} \log \sigma ^2_{B}- \frac{1}{2\sigma ^2_{B}} \sum _{i}{\left( {\left( \nu ^{\left( B\right) }_i\right) }^2 + \rho ^{\left( B\right) }_i\right) } \\ \mathbb {E}_{q_{\gamma }} \mathcal {L}{\left( \boldsymbol{C}\right) }&= - \frac{{n_2}}{2} \log 2\pi - \frac{{n_2}}{2} \log \sigma ^2_{C}- \frac{1}{2\sigma ^2_{C}} \sum _{j}{\left( {\left( \nu ^{\left( C\right) }_j\right) }^2 + \rho ^{\left( C\right) }_j\right) }\\ \end{aligned}$$

and as the entries of the data matrix \(\boldsymbol{R}\) are independent and identically distributed:

$$\begin{aligned}&\mathbb {E}_{q_{\gamma }} \mathcal {L}{\left( \left. \boldsymbol{R}\right| \boldsymbol{A}, \boldsymbol{B}, \boldsymbol{C}, \boldsymbol{U}, \boldsymbol{V}\right) } = \mathbb {E}_{q_{\gamma }} \mathcal {L}{\left( \left. \boldsymbol{R}^{(\text {o})}\right| \boldsymbol{A}, \boldsymbol{B}, \boldsymbol{C}, \boldsymbol{U}, \boldsymbol{V}\right) } + \mathcal {L}{\left( \boldsymbol{R}^{{\left( \lnot o\right) }}\right) } \end{aligned}$$
(S3)

where \(\boldsymbol{R}^{(\text {o})}\) denotes the set of observed ratings and \(\boldsymbol{R}^{{\left( \lnot o\right) }}\), the set of non-observed ratings, where \(R_{ij}=\text {NA}\). From Eq. S3, it becomes clear that maximizing \(\mathbb {E}_{q_{\gamma }} \mathcal {L}(\boldsymbol{R}^{(\lnot o)})\) is not necessary to infer the model parameters used for prediction and therefore ignoring the non-observed data is correct. The expectation of the conditional log-likelihood (first term of right side of Eq. S3) is numerically estimated by sampling from \(q_{\gamma }\).

Stochastic Gradient Optimization. To optimize the criterion with stochastic gradient descent, we express the variational log-likelihood criterion on a single rating:

$$\begin{aligned} \mathcal {J}{\left( R_{ij};q_{\gamma }, \theta \right) }&= \mathbb {E}_{q_{\gamma }}{\left[ \mathcal {L}{\left( \left. R^{(\text {o})}_{ij}\right| \boldsymbol{U}_i, \boldsymbol{V}_j, A_i, B_j, C_j\right) }\right] } \\&\quad +\frac{1}{{n_2}}{\left( \mathcal {H}{\left( q_{\gamma }{\left( \boldsymbol{U}_i\right) }\right) } +\mathcal {H}{\left( q_{\gamma }{\left( A_i\right) }\right) } +\mathbb {E}_{q_{\gamma }}{\left[ \mathcal {L}{\left( \boldsymbol{U}_i\right) }\right] } + \mathbb {E}_{q_{\gamma }}{\left[ \mathcal {L}{\left( A_i\right) }\right] } \right) } \\&\quad + \frac{1}{{n_2}} {\left( \mathcal {H}{\left( q_{\gamma }{\left( \boldsymbol{V}_j\right) }\right) } +\mathcal {H}{\left( q_{\gamma }{\left( B_j\right) }\right) } + \mathbb {E}_{q_{\gamma }}{\left[ \mathcal {L}{\left( \boldsymbol{V}_j\right) }\right] } +\mathbb {E}_{q_{\gamma }}{\left[ \mathcal {L}{\left( B_j\right) }\right] } \right) } \\&\quad + \frac{1}{{n_2}} {\left( \mathcal {H}{\left( q_{\gamma }{\left( C_j\right) }\right) } + \mathbb {E}_{q_{\gamma }}{\left[ \mathcal {L}{\left( C_j\right) }\right] } \right) } \end{aligned}$$

A batch of data, \(\boldsymbol{R}_{(i:i+n),(j:j+n)}\), consists of a \((n\times n)\) sub-matrix randomly sampled from the original matrix \(\boldsymbol{R}\).

B Clustering \(\varepsilon \)-parity and \(\varepsilon \)-fair Recommendation for Arbitrary Discrete Sensitive Attribute

Definition S1

(Clustering \(\varepsilon \)-parity, arbitrary discrete sensitive attribute). The clustering of users is said to respect \(\varepsilon \)-parity with respect to the discrete attribute \(s\in {\mathcal S}\) iff:

$$\begin{aligned} \forall (t,t') \in {\mathcal S}^2,\ \forall q,\left|\frac{\#\left\{ i|s_i= t\wedge u_{iq} = 1\right\} }{\#\left\{ i|s_i= t\right\} } - \frac{\#\left\{ i|s_i= t'\wedge u_{iq} = 1\right\} }{\#\left\{ i|s_i= t'\right\} } \right|\le \varepsilon , \end{aligned}$$
(S4)

where \(\varepsilon \in \mathbb {R}_+\) measures the gap to exact parity, \(u_{iq}\) is the (hard) membership of user \(i\) to cluster \(q\),and \(\#\left\{ i|\varOmega \right\} \) is the number of users defined by the cardinality of the set \(\varOmega \).

Definition S2

(\(\varepsilon \)-fair recommendation, arbitrary discrete sensitive attribute). A recommender system is said to be \(\varepsilon \)-fair with respect to the dicrete attribute \(s\in {\mathcal S}\) if for any two items \(j\) and \(j'\):

$$\begin{aligned} \forall (t,t') \in {\mathcal S}^2,\left|\frac{\#\left\{ i|s_i= t\wedge (\hat{R}_{ij}>\hat{R}_{ij'})\right\} }{\#\left\{ i|s_i= t\right\} } - \frac{\#\left\{ i|s_i= t'\wedge (\hat{R}_{ij}>\hat{R}_{ij'})\right\} }{\#\left\{ i|s_i= t'\right\} } \right|\le \varepsilon , \end{aligned}$$
(S5)

where \(\varepsilon \in \mathbb {R}_+\) measures the gap to exact fairness

C Proof of Theorem 1

Theorem 1

(Fair recommendation from clustering parity). If the clustering of users in \({k_1}\) groups respects \(\varepsilon \)-parity (Definition 3 or Definition S1) then the recommender system relying on the relevance score defined in Eq. (7) is \(({k_1}\varepsilon )\)-fair (Definition 1 or Definition S2).

Proof

Suppose that \(\boldsymbol{\tau }^{\left( U\right) }\), the maximum a posteriori of \(\boldsymbol{U}\), is a binary matrix; \(\boldsymbol{\tau }^{\left( U\right) }\) is thus a \({n_1}\times {k_1}\) indicator matrix of row classes membership. Then, given user \(i\), item \(j\) is said to be preferred to item \(j'\) if \(\hat{R}_{ij} > \hat{R}_{ij'}\), that is:

$$\begin{aligned} \hat{R}_{ij}> \hat{R}_{ij'}&\iff \boldsymbol{\tau }^{\left( U\right) }_{i}\hat{\boldsymbol{\mu }}{\boldsymbol{\tau }^{\left( V\right) }_{{j}}}^T + \nu ^{\left( A\right) }_i+ \nu ^{\left( B\right) }_{j}> \boldsymbol{\tau }^{\left( U\right) }_{i}\hat{\boldsymbol{\mu }}{\boldsymbol{\tau }^{\left( V\right) }_{{j'}}}^T + \nu ^{\left( A\right) }_i+ \nu ^{\left( B\right) }_{j'} \nonumber \\&\iff \boldsymbol{\tau }^{\left( U\right) }_i\hat{\boldsymbol{\mu }} {\left( \boldsymbol{\tau }^{\left( V\right) }_{j} - \boldsymbol{\tau }^{\left( V\right) }_{j'}\right) }^T> \nu ^{\left( B\right) }_{j'} - \nu ^{\left( B\right) }_{j} \nonumber \\&\iff \boldsymbol{\tau }^{\left( U\right) }_i\boldsymbol{a}> b\nonumber \\&\iff \boldsymbol{a}_{d_{i}} > b , \end{aligned}$$
(S6)

with \(\boldsymbol{a} \in \mathbb {R}^{{k_1}}\) defined by \(\boldsymbol{a}=\hat{\boldsymbol{\mu }} {\left( \boldsymbol{\tau }^{\left( V\right) }_{j} - \boldsymbol{\tau }^{\left( V\right) }_{j'}\right) }^T\), \(b \in \mathbb {R}\) defined by \(b = \nu ^{\left( B\right) }_{j'} - \nu ^{\left( B\right) }_{j}\) and \(d_{i}\in \{1,\cdots ,{k_1}\}\) being the group indicator of user \(i\): \(\tau ^{\left( U\right) }_{i, d_{i}} = 1\).

Suppose \(\varepsilon \)-parity, from Definition S1 (Definition 3 is a particular case of Definition S1), we have

$$\begin{aligned}&\forall (t,t'),\qquad \forall q,\quad \left|\frac{\#\left\{ i|s_i= t\wedge d_{i}= q\right\} }{\#\left\{ i|s_i= t\right\} } - \frac{\#\left\{ i|s_i= t'\wedge d_{i}= q\right\} }{\#\left\{ i|s_i= t'\right\} } \right|\le \varepsilon \end{aligned}$$

therefore,

$$\begin{aligned}&\forall (t,t'),\;\forall q,\;\left|\mathbbm {1}_{\boldsymbol{a}_{d_{i}}> b} \frac{\#\left\{ i|s_i= t\wedge d_{i}= q\right\} }{\#\left\{ i|s_i= t\right\} } - \mathbbm {1}_{\boldsymbol{a}_{d_{i}}> b}\frac{\#\left\{ i|s_i= t'\wedge d_{i}= q\right\} }{\#\left\{ i|s_i= t'\right\} } \right|\le \varepsilon \mathbbm {1}_{\boldsymbol{a}_{d_{i}} > b} \end{aligned}$$

By summing over all groups, we get:

$$\begin{aligned} \forall (t,t'),\;\sum _q\left|\frac{\mathbbm {1}_{\boldsymbol{a}_{d_{i}}> b} \#\left\{ i|s_i= t\wedge d_{i}= q\right\} }{\#\left\{ i|s_i= t\right\} } - \frac{\mathbbm {1}_{\boldsymbol{a}_{d_{i}}> b} \#\left\{ i|s_i= t'\wedge d_{i}= q\right\} }{\#\left\{ i|s_i= t'\right\} } \right|\!\le \! \varepsilon \sum _q\mathbbm {1}_{\boldsymbol{a}_{d_{i}} > b} \end{aligned}$$

and from the triangular inequality,

$$\begin{aligned} \forall (t,t'), \left|\frac{\sum _q\mathbbm {1}_{\boldsymbol{a}_{d_{i}}> b} \#\left\{ i|s_i= t\wedge d_{i}= q\right\} }{\#\left\{ i|s_i= t\right\} } - \frac{\sum _q\mathbbm {1}_{\boldsymbol{a}_{d_{i}}> b} \#\left\{ i|s_i= t'\wedge d_{i}= q\right\} }{\#\left\{ i|s_i= t'\right\} } \right|&\le \varepsilon \sum _q\mathbbm {1}_{\boldsymbol{a}_{d_{i}}> b} \\ \forall (t,t'), \qquad \qquad \left|\frac{ \#\left\{ i|s_i= t\wedge \boldsymbol{a}_{d_{i}}> b\right\} }{\#\left\{ i|s_i= t\right\} } - \frac{ \#\left\{ i|s_i= t'\wedge \boldsymbol{a}_{d_{i}} > b\right\} }{\#\left\{ i|s_i= t'\right\} } \right|&\le \varepsilon {k_1}\\ \end{aligned}$$

And, applying (S6), the result is obtained:

$$\begin{aligned} \forall (t,t'),\quad \left|\frac{\#\left\{ i|s_i= t\wedge (\hat{R}_{ij}>\hat{R}_{ij'})\right\} }{\#\left\{ i|s_i= t\right\} } - \frac{\#\left\{ i|s_i= t'\wedge (\hat{R}_{ij}>\hat{R}_{ij'})\right\} }{\#\left\{ i|s_i= t'\right\} } \right|&\le \varepsilon {k_1}\end{aligned}$$

   \(\square \)

D Supplemental Results for MovieLens 1M

1.1 D.1 Gender as Sensitive Attribute

Supplemental Analysis of the Model. We list in Tables 2 and 3 the most extreme movies according to the inferred value of their latent variable \(C_j\). Variable \(C_j\) encodes the difference in opinion between the sensitive groups, not the overall opinion. For example, a movie may well be liked by most people but liked even more by males. Table 2 lists movies for which females have a better opinion than males and Table 3 lists movies for which males have a better opinion than females.

Table 2. List of movies with the largest gap in opinion between females and males for which females have a better opinion than males

Higher Number of Groups. We did not optimize the hyper-parameters of the compared models. We present here additional experiments to illustrate that the conclusions of Sect. 4 apply to different hyper-parameter settings. Using a substantially larger number of groups (\({k_1}=50\) user groups and \({k_2}=50\) item groups) or a larger dimension of latent factors for SVD (also 50), the statistical gender parity measures given in Table 4 and the recommendation performance given in Fig. 7 are qualitatively similar to the ones given in Table 1 and Fig. 5.

Table 3. List of movies with the largest gap in opinion between females and males for which males have a better opinion than females
Fig. 7.
figure 7

Normalized Discounted Cumulative Gain estimated on MovieLens-1M with \({k_1}={k_2}=50\) groups for clustering methods and 50 factors for the SVD.

Table 4. Measures of gender statistical parity. The number of user groups is \({k_1}=50\). The \(\chi ^2\) statistic (with 49 degrees of freedom) is averaged over the five replicates of the experiment. A high value of the \(\chi ^2\) statistic (or a low p-value) leads to the rejection of the statistical parity hypothesis.

1.2 D.2 Age as Sensitive Attribute

The age range of the users is indicated within the following intervals: ‘Under 18’,‘18–24’, ‘25–34’, ‘35–44’, ‘45–49’, ‘50–55’ and ‘56+’. The counts of users in each age category is displayed in Fig. 8.

Fig. 8.
figure 8

Count of users in each age category.

User age is treated as sensitive: we introduce seven binary sensitive attributes \(s_i\) encoding for the seven categories of user age. We use a one-hot encoding of the seven categories of user age and introduce for the purpose seven binary sensitive attributes \(s^{1}_i, \cdots , s^{7}_i\) and their item associated latent variables \(C^{1}_j, \cdots , C^{7}_j\). We use the protocol described in Sect. 4 with the exception that our Parity-LBM is initialized from estimates obtained with the Standard-LBM. Table 5 presents results of the \(\chi ^2\) statistics constructed from the contingency table of user age counts in each group. The methods that do not consider the sensitive variable in the modelling create groups that are dependent on the age and assuming the statistical parity with our Parity-LBM model is reasonable.

Table 5. Measures of statistical parity with respect to age category. The number of group of users is \({k_1}=15\). A high value of the \(\chi ^2\) statistic (or a low p-value) leads to the rejection of the statistical parity hypothesis. The \(\chi ^2\) statistic is averaged on the five folds of the cross-validation. Degrees of freedom is 14.

Finally, we illustrate the interpretability of the estimates of the latent variables \(C^{1}_j, \cdots ,C^{7}_j\) related to movies. For each age category k, we select the thirty movies with the largest value of the latent variables \(C^{k}_j\). These movies have the largest positive opinion bias for users in the given age category. Figure 9 displays a boxplot of the release years of these films for all user age categories. The greater variability in the distribution for older users means that they have a comparatively higher opinion of older movies than younger users. If user age were the sensitive attribute, the recommendations would not account for these differences.

Fig. 9.
figure 9

Release years of the thirty most extreme movies according to the inferred positive value of the latent variables \(C^{1}_j, \cdots , C^{7}_j\). Each latent variable \(C^{k}_j\) is matched with its corresponding user age category.

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Frisch, G., Leger, JB., Grandvalet, Y. (2021). Co-clustering for Fair Recommendation. In: Kamp, M., et al. Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2021. Communications in Computer and Information Science, vol 1524. Springer, Cham. https://doi.org/10.1007/978-3-030-93736-2_44

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-93736-2_44

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-93735-5

  • Online ISBN: 978-3-030-93736-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics