Skip to main content
Log in

Making individually fair predictions with causal pathways

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Machine learning is being increasingly used to make algorithmic decisions that have strong societal impact on people’s lives. Due to their huge societal impact, such algorithmic decisions need to be accurate and fair with respect to sensitive features, including race, gender, religion, and sexual orientation. To achieve a good balance between prediction accuracy and fairness, causality-based methods have been proposed, which utilize a causal graph with unfair pathways. However, none of these methods can ensure fairness for each individual without making restrictive functional assumptions about the data generating processes, which are not satisfied in many cases. In this paper, we propose a far more practical causality-based framework for learning an individually fair classifier. To avoid impractical functional assumptions, we introduce a new criterion, the probability of individual unfairness, and derive its upper bound that can be estimated from data. We then train a classifier by solving an optimization problem where the upper bound value is forced to be close to zero. We elucidate why solving such an optimization problem can guarantee fairness for each individual. Moreover, we provide two extensions for dealing with challenging real-world scenarios where there are unobserved variables called latent confounders, and the true causal graph is uncertain. Experimental results show that our method can learn an individually fair classifier at a slight cost of prediction accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. Here \(A=0\) can be regarded as a baseline for measuring an unfair effect, and this baseline can be switched to \(A=1\), which yields potential outcomes \(Y_{A \Leftarrow 1}\) and \(Y_{A \Leftarrow 0 \parallel \pi }\).

  2. When classifier \(h_{\theta }\) is not deterministic, potential outcomes are formulated in the same way using random noise that is employed in the classifier.

  3. Since variable H influences mediator M and outcome Y, it is also called a mediator-outcome confounder (VanderWeele 2015, Section 5).

  4. Note that the FIO method infers such conditional distributions not by learning statistical models beforehand but by simultaneously learning them with the predictive model of Y (Nabi and Shpitser 2018). This is because unlike our method, it addresses not only training a classifier but also learning a generative model of joint distribution \({{\,\mathrm{\textrm{P}}\,}}({{\textbf {X}}}, Y)\).

  5. Obviously, this region is wider than red subregions \(({\hat{p}}^{A \Leftarrow 0}_{\theta }, {\hat{p}}^{A \Leftarrow 1 \parallel \pi }_{\theta }) \approx (0, 0)\) and \(({\hat{p}}^{A \Leftarrow 0}_{\theta }, {\hat{p}}^{A \Leftarrow 1 \parallel \pi }_{\theta }) \approx (1, 1)\). This indicates that compared with such naive constraints as \(({\hat{p}}^{A \Leftarrow 0}_{\theta }, {\hat{p}}^{A \Leftarrow 1 \parallel \pi }_{\theta }) \approx (0, 0)\) and \(({\hat{p}}^{A \Leftarrow 0}_{\theta }, {\hat{p}}^{A \Leftarrow 1 \parallel \pi }_{\theta }) \approx (1, 1)\), ours can accept more various values of classifier parameter \(\theta \) to achieve high prediction accuracy.

  6. More precisely, R is an exposure-induced confounder (VanderWeele 2015, Chapter 5), i.e., an observed variable that is affected by sensitive feature A and that influences multiple observed variables. An exposure-induced confounder is also a mediator. Unlike a mediator, however, it yields a spurious correlation among the observed variables.

  7. According to Miles et al. (2017), the lower and upper bounds coincide when the potential outcome and the potential mediator are degenerate.

  8. We used the modified COMPAS dataset included in R package "fairness" (Kozodoi and V Varga 2021).

References

  • Agrawal S, Ding Y, Saberi A, Ye Y (2010) Correlation robust stochastic optimization. In: SODA, pp 1087–1096

  • Andrews RM, Didelez V (2020) Insights into the cross-world independence assumption of causal mediation analysis. Epidemiology 32(2):209–219

    Article  Google Scholar 

  • Angwin J, Larson J, Mattu S, Kirchner L (2016) Machine bias. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing

  • Avin C, Shpitser I, Pearl J (2005) Identifiability of path-specific effects. In: IJCAI, pp 357–363

  • Bache K, Lichman M (2013) UCI machine learning repository: datasets. http://archive.ics.uci.edu/ml/datasets

  • Burke JV, Lewis AS, Overton ML (2005) A robust gradient sampling algorithm for nonsmooth, nonconvex optimization. SIAM J Optim 15(3):751–779

    Article  MathSciNet  MATH  Google Scholar 

  • Chiappa S, Gillam TP (2019) Path-specific counterfactual fairness. In: AAAI, pp 7801–7808

  • Chikahara Y, Fujino A (2018) Causal inference in time series via supervised learning. In: IJCAI, pp 2042–2048

  • Chikahara Y, Sakaue S, Fujino A, Kashima H (2021) Learning individually fair classifier with path-specific causal-effect constraint. In: AISTATS, pp 145–153

  • Chouldechova A, Benavides-Prado D, Fialko O, Vaithianathan R (2018) A case study of algorithm-assisted decision making in child maltreatment hotline screening decisions. In: FAT, pp 134–148

  • Dwork C, Hardt M, Pitassi T, Reingold O, Zemel R (2012) Fairness through awareness. In: ITCS, pp 214–226

  • Fan Y, Guerre E, Zhu D (2017) Partial identification of functionals of the joint distribution of “potential outcomes’’. J Econom 197(1):42–59

    Article  MathSciNet  MATH  Google Scholar 

  • Feldman M, Friedler SA, Moeller J, Scheidegger C, Venkatasubramanian S (2015) Certifying and removing disparate impact. In: KDD, pp 259–268

  • Ferrera J (2013) An introduction to nonsmooth analysis. Academic Press

  • Firpo S, Ridder G (2019) Partial identification of the treatment effect distribution and its functionals. J Econom 213(1):210–234

    Article  MathSciNet  MATH  Google Scholar 

  • Glymour C, Zhang K, Spirtes P (2019) Review of causal discovery methods based on graphical models. Front Genet 10:524

    Article  Google Scholar 

  • Hardt M, Price E, Srebro N, et al. (2016) Equality of opportunity in supervised learning. In: NeurIPS, pp 3315–3323

  • Houser KA (2019) Can AI solve the diversity problem in the tech industry: mitigating noise and bias in employment decision-making. Stan Tech L Rev 22:290

    Google Scholar 

  • Hoyer PO, Janzing D, Mooij JM, Peters J, Schölkopf B (2009) Nonlinear causal discovery with additive noise models. In: NeurIPS, pp 689–696

  • Huber M (2014) Identifying causal mechanisms (primarily) based on inverse probability weighting. J Appl Econom 29(6):920–943

    Article  MathSciNet  Google Scholar 

  • Khandani AE, Kim AJ, Lo AW (2010) Consumer credit-risk models via machine-learning algorithms. J Bank Financ 34(11):2767–2787

    Article  Google Scholar 

  • Kilbertus N, Carulla MR, Parascandolo G, Hardt M, Janzing D, Schölkopf B (2017) Avoiding discrimination through causal reasoning. In: NeurIPS, pp 656–666

  • Kozodoi N, V Varga T (2021) Fairness: algorithmic fairness metrics. R package version 1.2.1; https://CRAN.R-project.org/package=fairness

  • Kusner M, Russell C, Loftus J, Silva R (2019) Making decisions that reduce discriminatory impacts. In: ICML, pp 3591–3600

  • Kusner MJ, Loftus J, Russell C, Silva R (2017) Counterfactual fairness. In: NeurIPS, pp 4066–4076

  • Makhlouf K, Zhioua S, Palamidessi C (2020) Survey on causal-based machine learning fairness notions. arXiv

  • Miles C, Kanki P, Meloni S, Tchetgen ET (2017) On Partial Identification of the Natural Indirect Effect. J Causal Inference 5(2):20160004. https://doi.org/10.1515/jci-2016-0004

  • Nabi R, Shpitser I (2018) Fair inference on outcomes. In: AAAI, pp 1931–1940, https://github.com/raziehna/fair-inference-on-outcomes

  • Nabi R, Malinsky D, Shpitser I (2019) Learning optimal fair policies. In: ICML, pp 4674–4682

  • Pearl J (2001) Direct and indirect effects. In: UAI, pp 411–420

  • Pearl J (2009) Causality: models. Cambridge University Press, Reasoning and Inference

  • Robins JM, Richardson TS (2010) Alternative graphical causal models and the identification of direct effects. Causality and psychopathology: finding the determinants of disorders and their cures, pp 103–158

  • Rosenbaum PR, Rubin DB (1983) The central role of the propensity score in observational studies for causal effects. Biometrika 70(1):41–55

    Article  MathSciNet  MATH  Google Scholar 

  • Rubinstein A, Singla S (2017) Combinatorial prophet inequalities. In: SODA, pp 1671–1687

  • Russell C, Kusner MJ, Loftus J, Silva R (2017) When worlds collide: integrating different counterfactual assumptions in fairness. In: NeurIPS, pp 6414–6423

  • Salimi B, Rodriguez L, Howe B, Suciu D (2019) Interventional fairness: causal database repair for algorithmic fairness. In: SIGMOD, pp 793–810

  • Sutskever I, Martens J, Dahl G, Hinton G (2013) On the importance of initialization and momentum in deep learning. In: ICML, pp 1139–1147

  • Tchetgen EJT, Phiri K (2014) Bounds for pure direct effect. Epidemiology 25(5):775

    Article  Google Scholar 

  • VanderWeele T (2015) Explanation in causal inference: methods for mediation and interaction. Oxford University Press

  • Wu Y, Zhang L, Wu X (2018) On discrimination discovery and removal in ranked data using causal graph. In: KDD, pp 2536–2544

  • Wu Y, Zhang L, Wu X (2019a) Counterfactual fairness: unidentification, bound and algorithm. In: IJCAI, pp 1438–1444

  • Wu Y, Zhang L, Wu X, Tong H (2019b) PC-fairness: a unified framework for measuring causality-based fairness. In: NeurIPS, pp 3399–3409

  • Xu D, Wu Y, Yuan S, Zhang L, Wu X (2019) Achieving causal fairness through generative adversarial networks. In: IJCAI, pp 1452–1458

  • Zhang J, Bareinboim E (2018a) Equality of opportunity in classification: a causal approach. In: NeurIPS, pp 3675–3685

  • Zhang J, Bareinboim E (2018b) Fairness in decision-making: the causal explanation formula. In: AAAI, pp 2037–2045

  • Zhang L, Wu X (2017) Anti-discrimination learning: a causal modeling-based framework. Int J Data Sci Anal 4(1):1–16

    Article  MathSciNet  Google Scholar 

  • Zhang L, Wu Y, Wu X (2017) A causal framework for discovering and removing direct and indirect discrimination. In: IJCAI, pp 3929–3935

  • Zhang L, Wu Y, Wu X (2018) Causal modeling-based discrimination discovery and removal: criteria, bounds, and algorithms. IEEE Trans Knowl Data Eng 31(11):2035–2050

    Article  Google Scholar 

Download references

Acknowledgements

We are sincerely grateful to the anonymous reviewers for providing invaluable feedback. SS is supported by JST ERATO Grant Number JPMJER1903.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yoichi Chikahara.

Additional information

Responsible editor: Toon Calders, Salvatore Ruggieri, Bodo Rosenhahn, Mykola Pechenizkiy and Eirini Ntoutsi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of this work appeared in Proceedings of AISTATS’21 (Chikahara et al. 2021) .

Appendices

Differences from previous conference publication

This article is an expanded version of a previous conference publication (Chikahara et al. 2021). The following are the main differences from the conference publication:

  • We extended the framework described in the conference publication to address real-world scenarios where the true causal graph is uncertain (Sect. 6.2). To test this extended framework, we performed synthetic data experiments (Sect. 7.3.2) and confirmed that the performance of our extended framework is comparable to the original one with a true causal graph and better than the original one with a misspecified causal graph.

  • We improved its clarity by presenting a more detailed explanation of our fairness measure (Sect. 2). We first introduced two example real-world scenarios (i.e., Example 1 and Example 2) in Sect. 2.2. Then based on these scenarios, we defined SEM (Definition 2.1) in Sect. 2.3.1 and provided a formulation of potential outcomes and path-specific causal effects for each scenario in Sect. 2.3.2.

  • We enhanced the reliability of our experiments by conducting additional, extensive experiments (Sect. 7.2 and Sect. 7.3). In the conference publication (Chikahara et al. 2021), to confirm the robustness of our method, we used 10 randomly generated synthetic datasets and computed the mean and the standard deviation of the test accuracy and unfair effects. However, we did not evaluate such mean and standard deviations in real-world data experiments. To improve the trustworthiness, for all the experiments, we performed 20 runs by randomly splitting each dataset into training and test data and computed the mean and the standard deviation of the test accuracy and unfair effects.

Definition of path-specific causal effects

In this section, we provide a formal definition of path-specific causal effects (Avin et al. 2005).

Using interventional SEM \({\mathscr {M}}^p_{A=a}\) (\(a \in \{0, 1\}\)) defined in Sect. 2.3.1, we define path-specific causal effects (see the original paper (Avin et al. 2005) for details). As already described in Sect. 2.3.2, a path-specific causal effect is formulated by the difference between two potential outcomes, \(Y_{A \Leftarrow 0}\) and \(Y_{A \Leftarrow 1 \parallel \pi }\), as \(Y_{A \Leftarrow 1 \parallel \pi }- Y_{A \Leftarrow 0}\).

To define potential outcome \(Y_{A \Leftarrow 0}\), we consider interventional SEM \({\mathscr {M}}^p_{A=0}\), which is obtained by simply performing intervention \(do(A=0)\). Suppose that this SEM expresses each variable \(V \in \{{{\varvec{X}}}, Y\}\) by the following structural equation:

$$\begin{aligned} V = f_V ({{\varvec{pa}}}(V)_{A=0}, {{\varvec{U}}}_V), \end{aligned}$$
(37)

where \({{\varvec{pa}}}(V)_{A=0}\) denotes variables \({{\varvec{pa}}}(V)\) (i.e., parents of variable V), whose values are determined by interventional SEM \({\mathscr {M}}^p_{A=0}\). Then potential outcome \(Y_{A \Leftarrow 0}\) is defined as prediction Y, whose structural equation is expressed by (37) where function \(f_Y\) is given by classifier \(h_{\theta }\).

By contrast, to define potential outcome \(Y_{A \Leftarrow 1 \parallel \pi }\), we need an SEM that is modified using interventional SEMs \({\mathscr {M}}^p_{A=0}\) and \({\mathscr {M}}^p_{A=1}\). To formulate this modified SEM, for each variable \(V \in \{{{\varvec{X}}}, Y\}\), we partition its parents \({{\varvec{pa}}}(V)\) into two subsets, \({{\varvec{pa}}}(V) = \{{{\varvec{pa}}}(V)^{\pi }, {{\varvec{pa}}}(V)^{{\overline{\pi }}}\}\), where \({{\varvec{pa}}}(V)^{\pi }\) is the members of \({{\varvec{pa}}}(V)\) connected with V on unfair pathways \(\pi \), and \({{\varvec{pa}}}(V)^{{\overline{\pi }}}\) is a complementary set (i.e., \({{\varvec{pa}}}(V)^{{\overline{\pi }}} = {{\varvec{pa}}}(V) \backslash {{\varvec{pa}}}(V)^{\pi }\)). Based on these two subsets, we consider the following structural equation over \(V \in \{{{\varvec{X}}}, Y\}\):

$$\begin{aligned} V = f_V ({{\varvec{pa}}}(V)^{\pi }_{A=1}, {{\varvec{pa}}}(V)^{{\overline{\pi }}}_{A=0}, {{\varvec{U}}}_V), \end{aligned}$$
(38)

where \({{\varvec{pa}}}(V)^{\pi }_{A=1}\) is a set of the variables in \({{\varvec{pa}}}(V)^{\pi }\) whose values are determined by interventional model \({\mathscr {M}}^p_{A=1}\), and \({{\varvec{pa}}}(V)^{{\overline{\pi }}}_{A=0}\) is a set of the variables in \({{\varvec{pa}}}(V)^{{\overline{\pi }}}\) whose values are provided by \({\mathscr {M}}^p_{A=0}\). Then potential outcome \(Y_{A \Leftarrow 1 \parallel \pi }\) is defined as prediction Y, whose structural equation is represented by (38).

Derivation of Eqs. (24) and (25)

Following the original paper (Huber 2014), we derived the following formulation of the existing estimators of the marginal potential outcome probabilities:

$$\begin{aligned} \begin{aligned} {\hat{p}}^{A \Leftarrow 0}_{\theta }&= \frac{1}{n} \sum _{i=1}^n {{\,\mathrm{{\textbf{1}}}\,}}(a_i = 0) {\hat{w}}^{A \Leftarrow 0}_ic_{\theta }(a_i, q_i, d_i, m_i), \quad \\ {\hat{p}}^{A \Leftarrow 1 \parallel \pi }_{\theta }&= \frac{1}{n} \sum _{i=1}^n {{\,\mathrm{{\textbf{1}}}\,}}(a_i = 1) {\hat{w}}^{A \Leftarrow 1 \parallel \pi }_ic_{\theta }(a_i, q_i, d_i, m_i), \end{aligned} \end{aligned}$$
(24)

where \(c_{\theta }({{\varvec{X}}}) = {{\,\mathrm{\textrm{P}}\,}}(Y=1 | {{\varvec{X}}})\) is the conditional distribution given by classifier \(h_{\theta }\), \({{\,\mathrm{{\textbf{1}}}\,}}(\cdot )\) is an indicator function, and \({\hat{w}}^{A \Leftarrow 0}_i\) and \({\hat{w}}^{A \Leftarrow 1 \parallel \pi }_i\) are the following weights:

$$\begin{aligned}&{\hat{w}}^{A \Leftarrow 0}_i= \frac{1}{{\hat{{{\,\mathrm{\textrm{P}}\,}}}}(A=0 | q_i)}, \\&{\hat{w}}^{A \Leftarrow 1 \parallel \pi }_i= \frac{{\hat{{{\,\mathrm{\textrm{P}}\,}}}}(A=1 | q_i, d_i) {\hat{{{\,\mathrm{\textrm{P}}\,}}}}(A=0 | q_i, d_i, m_i)}{{\hat{{{\,\mathrm{\textrm{P}}\,}}}}(A=1 | q_i) {\hat{{{\,\mathrm{\textrm{P}}\,}}}}(A=0 | q_i, d_i) {\hat{{{\,\mathrm{\textrm{P}}\,}}}}(A=1 | q_i, d_i, m_i)}, \end{aligned}$$
(25)

where \({\hat{{{\,\mathrm{\textrm{P}}\,}}}}\) is the conditional distribution that is estimated by learning the statistical models (e.g., neural networks) to the training data beforehand.

Following the notations in the original paper (Huber 2014), let the potential outcomes denote \(Y_{A \Leftarrow 0}= Y(0, D(0), M(0))\) and \(Y_{A \Leftarrow 1 \parallel \pi }= Y(1, D(1), M(0))\).

Then with the causal graph in Fig. 2c, marginal probability \({{\,\mathrm{\textrm{P}}\,}}(Y_{A \Leftarrow 1 \parallel \pi }= 1)\) can be written as

$$\begin{aligned}&{{\,\mathrm{\textrm{P}}\,}}(Y_{A \Leftarrow 1 \parallel \pi }= 1) \\&\quad = {{\,\mathrm{\textrm{P}}\,}}(Y(1, D(1), M(0) ) = 1) \\&\quad = {{\,\mathrm{{\mathbb {E}}}\,}}_{Q}[ {{\,\mathrm{{\mathbb {E}}}\,}}_{D(1) | Q}[ {{\,\mathrm{{\mathbb {E}}}\,}}_{M(0) | Q, D(1)} [ {{\,\mathrm{\textrm{P}}\,}}( Y(1, d, m) = 1 | A = 1, Q = q, D(1) = d, \\&\qquad M(0) = m ) ] ] ] . \end{aligned}$$

Using Assumption 2, this can be rewritten as

$$\begin{aligned} {{\,\mathrm{\textrm{P}}\,}}(Y_{A \Leftarrow 1 \parallel \pi }= 1)&= {{\,\mathrm{{\mathbb {E}}}\,}}_{Q}[ {{\,\mathrm{{\mathbb {E}}}\,}}_{D | A=1, Q}[ {{\,\mathrm{{\mathbb {E}}}\,}}_{M | A=0, Q, D} [ {{\,\mathrm{\textrm{P}}\,}}( Y(1, d, m) = 1 | A = 1,\\&\qquad Q = q, D = d, M = m ) ] ] ]. \end{aligned}$$

With Bayes’ theorem, this can be expressed as

$$\begin{aligned}&{{\,\mathrm{\textrm{P}}\,}}(Y_{A \Leftarrow 1 \parallel \pi }= 1) \\&\quad = {{\,\mathrm{{\mathbb {E}}}\,}}_{Q}[ {{\,\mathrm{{\mathbb {E}}}\,}}_{D | Q}[ {{\,\mathrm{{\mathbb {E}}}\,}}_{M | Q, D} [ {\omega }^{A \Leftarrow 1 \parallel \pi }{{\,\mathrm{\textrm{P}}\,}}( Y = 1 | d A = 1, Q = q, D = d, M = m ) ] ] ], \end{aligned}$$

where \({\omega }^{A \Leftarrow 1 \parallel \pi }\) is expressed as follows:

$$\begin{aligned}&{\omega }^{A \Leftarrow 1 \parallel \pi }= \frac{{{\,\mathrm{\textrm{P}}\,}}(A=1 | Q=q, D=d){{\,\mathrm{\textrm{P}}\,}}(A=0 | Q=q, D=d, M=m)}{{{\,\mathrm{\textrm{P}}\,}}(A=1 | Q=q) {{\,\mathrm{\textrm{P}}\,}}(A=0 | Q=q, D=d)}. \end{aligned}$$

With indicator function \({{\,\mathrm{{\textbf{1}}}\,}}(\cdot )\), this can be formulated as

$$\begin{aligned}&{{\,\mathrm{\textrm{P}}\,}}(Y_{A \Leftarrow 1 \parallel \pi }= 1) = {{\,\mathrm{{\mathbb {E}}}\,}}[ {{\,\mathrm{{\textbf{1}}}\,}}(A=1) w^{A \Leftarrow 1 \parallel \pi }{{\,\mathrm{\textrm{P}}\,}}( Y = 1 | A = 1, q, d, m ) ] , \end{aligned}$$
(39)

where weight \(w'\) is expressed as

$$\begin{aligned} w^{A \Leftarrow 1 \parallel \pi }= \frac{1}{{{\,\mathrm{\textrm{P}}\,}}(A=1 | Q=q, D=d, M=m)} {\omega }^{A \Leftarrow 1 \parallel \pi }. \end{aligned}$$

In a similar manner, marginal probability \({{\,\mathrm{\textrm{P}}\,}}(Y_{A \Leftarrow 0}= 1)\) can be represented as

$$\begin{aligned}&{{\,\mathrm{\textrm{P}}\,}}(Y_{A \Leftarrow 0}= 1) = {{\,\mathrm{{\mathbb {E}}}\,}}[ {{\,\mathrm{{\textbf{1}}}\,}}(A=0) w^{A \Leftarrow 0}{{\,\mathrm{\textrm{P}}\,}}( Y = 1 | A = 0, q, d, m ) ] , \end{aligned}$$
(40)

where weight \(w^{A \Leftarrow 0}\) is formulated as

$$\begin{aligned} w^{A \Leftarrow 0}= \frac{1}{{{\,\mathrm{\textrm{P}}\,}}(A=0 | Q=q)}. \end{aligned}$$

Given empirical distribution, by plugging conditional distribution \(c_{\theta }\) into \({{\,\mathrm{\textrm{P}}\,}}(Y=1 | A=1, Q=q, D=d, M=m)\), we can estimate (40) and (39) as (24).

Computation time and convergence guarantee

We minimize the objective function (1) with the stochastic gradient descent method (Sutskever et al. 2013).

To do so, we need to compute the penalty term and its gradient over the samples in each minibatch. However, adding the penalty function to the loss function does not increase the complexity of computing the objective function value and the gradient very much.

Whether we can guarantee that the gradient descent method converges depends on the choice of classifier \(h_{\theta }\). For instance, if we choose a neural network classifier, we cannot guarantee that the stochastic gradient descent method (Sutskever et al. 2013) converges because the objective function (1) becomes nonconvex, and its gradient does not become Lipschitz continuous; that is, the maximum rate of change in the gradient is not bounded. However, if the neural network only contains activation functions whose gradients are Lipschitz continuous (e.g., the sigmoid function), since in this case, the gradient of the objective function becomes locally Lipschitz continuous (Ferrera 2013, Chapter 2), we can optimize the objective function with convergence guarantees using e.g., the gradient sampling method (Burke et al. 2005).

Proof of (29)

Since we already proved the upper bound in (14), below we derive the lower bound in it. Since \(\alpha \) and \(\beta \) are marginal probabilities, we have

$$\begin{aligned} \begin{aligned} p_{10}+ p_{11}= \alpha , \quad p_{01}+ p_{11}= \beta , \end{aligned} \end{aligned}$$

which are equivalent to

$$\begin{aligned} \begin{aligned} p_{10}= \alpha - p_{11}, \quad p_{01}= \beta - p_{11}, \end{aligned} \end{aligned}$$

respectively. By summing up both, we have

$$\begin{aligned} p_{01}+ p_{10}= \alpha + \beta - 2 p_{11}. \end{aligned}$$

Since joint probability \(p_{11}\) is less than marginal probabilities \(\alpha \) and \(\beta \), we have \(p_{11}\le \min \{\alpha , \beta \}\). Therefore,

$$\begin{aligned}&p_{01}+ p_{10}\ge \alpha + \beta - 2 \min \{\alpha , \beta \} = |\alpha - \beta |. \end{aligned}$$
(41)

Combined with the upper bound on \(p_{01}+ p_{10}\) in (14), we obtain (29).

Experimental settings

This section details the experimental settings presented in Sect. 7.

1.1 Settings in synthetic data experiments

1.1.1 Data

In synthetic data experiments, we used four datasets: Synth1, Synth2, Synth3, and Synth4.

Synth1 and Synth2 datasets were generated based on a scenario of hiring decisions for physically demanding jobs, described in Example 2 in Sect. 2.2, whose causal graph is shown in Fig. 2c. As described below, while the Synth2 dataset follows the functional assumption of the PSCF method, the Synth1 dataset does not.

To prepare the Synth1 dataset, we sampled gender \(A \in \{0, 1\}\), qualification Q, number of children D, physical strength M, and hiring decision outcome \(Y \in \{0, 1\}\) from the following SEM:

$$\begin{aligned} \begin{aligned}&A = U_A, \quad U_A \sim \textrm{Bernoulli}(0.6), \\&Q = \lfloor U_Q \rfloor , \quad U_Q \sim {\mathcal {N}}(2, 5^2), \\&D = A + \lfloor 0.5 Q U_D \rfloor , \quad U_D \sim \textrm{Tr}{\mathcal {N}}(1, 0.5^2, 0.1, 3.0),\\&M = 3 A + 0.4 Q U_M, \quad U_M \sim \textrm{Tr}{\mathcal {N}}(1.5, 0.5^2, 0.1, 3.0), \\&Y = h(A, Q, D, M), \end{aligned} \end{aligned}$$
(42)

where \(\textrm{Bernoulli}\), \({\mathcal {N}}\), and \(\textrm{Tr}{\mathcal {N}}\) represent the Bernoulli, Gaussian, and truncated Gaussian distributions, respectively, and \(\lfloor \cdot \rfloor \) is a floor function that returns an integer by removing the decimal places. To output hiring decision outcome Y, we used function h, which is a logistic regression model that provides the following conditional distribution:

$$\begin{aligned}&{{\,\mathrm{\textrm{P}}\,}}(Y=1|A, Q, D, M) = \textrm{Bernoulli}(\varsigma ( -10 + 5A + Q + D + M) ), \end{aligned}$$

where \(\varsigma (x)\) \(=\) \(1 / (1 + \textrm{exp}(-x))\) is a standard sigmoid function. Note that SEM (42) does not satisfy the functional assumption of the PSCF method because the structural equations over D and M are not expressed by additive noise models (Hoyer et al. 2009) due to multiplicative noises \(U_D\) and \(U_M\).

By contrast, the Synth2 dataset is generated using the following SEM, which follows the functional assumption of the PSCF method:

$$\begin{aligned} \begin{aligned}&A = U_A, \quad U_A \sim \textrm{Bernoulli}(0.6), \\&Q = \lfloor U_Q \rfloor , \quad U_Q \sim {\mathcal {N}}(5, 2.5^2), \\&D = A + \lfloor 0.1 Q + U_D \rfloor , \quad U_D \sim \lfloor {\mathcal {N}}(1, 0.5^2) \rfloor ,\\&M = 3 A + 0.4 Q + U_M, \quad U_M \sim \lfloor {\mathcal {N}}(1, 0.5^2) \rfloor , \\&Y = h(A, Q, D, M), \end{aligned} \end{aligned}$$
(43)

where function h is given by the following conditional distribution:

$$\begin{aligned}&{{\,\mathrm{\textrm{P}}\,}}(Y=1|A, Q, D, M)\\&\quad = \textrm{Bernoulli}(\varsigma ( -10 + 2A + 2Q + 2D + 2M) ). \end{aligned}$$

The Synth3 dataset is associated with the causal graph in Fig. 8, which contains latent confounder H. To prepare this dataset, we considered the following SEM:

$$\begin{aligned} \begin{aligned}&A = U_A, \quad U_A \sim \textrm{Bernoulli}(0.6), \\&R = 3 A + \lfloor 10 H \rfloor + \lfloor U_R \rfloor , \quad U_R \sim {\mathcal {N}}(1, 0.5^2),\\&M = A + R + \lfloor U_M \rfloor , \quad U_M \sim {\mathcal {N}}(1, 0.5^2), \\&Y = h(A, R, M, H), \end{aligned} \end{aligned}$$
(44)

where H denotes a latent confounder, which is sampled by \(H \sim {\mathcal {N}}(1, 0.5^2)\), and function h is expressed by the following conditional distribution:

$$\begin{aligned} {{\,\mathrm{\textrm{P}}\,}}(Y=1|A, R, M, H) = \textrm{Bernoulli}(\varsigma ( -15 + 3A + R + M + 5 H) ). \end{aligned}$$

We generated the Synth4 dataset using the following SEM:

$$\begin{aligned} \begin{aligned}&A = U_A, \quad U_A \sim \textrm{Bernoulli}(0.6), \\&Q_1 = \lfloor U_{Q_1} \rfloor , \quad U_{Q_1} \sim {\mathcal {N}}(2, 1^2), \\&Q_2 = \lfloor U_{Q_2} \rfloor , \quad U_{Q_2} \sim {\mathcal {N}}(2, 1^2), \\&Q_3 = A + \lfloor U_{Q_3} \rfloor , \quad U_{Q_3} \sim {\mathcal {N}}(0, 1^2), \\&D = A + \lfloor 0.1 (Q_1 + Q_2) U_D \rfloor , \quad U_D \sim {\mathcal {N}}(1, 0.5^2) \\&M = 2A + \lfloor 0.01 \textrm{exp}(Q_1) + 0.2 * (Q_2 + Q_3) ) \rfloor + \lfloor U_{M} \rfloor , \quad U_M \sim {\mathcal {N}}(1, 1^2), \\&Y = h(A, Q_1, Q_2, Q_3, D, M), \end{aligned} \end{aligned}$$
(45)

where function h is given by the following conditional distribution:

$$\begin{aligned} {{\,\mathrm{\textrm{P}}\,}}(Y=1|A, Q_1, Q_2, Q_3, D, M)&= \textrm{Bernoulli}(\varsigma ( -5 + 2A + 0.5(Q_1 + Q_2 + Q_3) \\ {}&\quad + 0.5D + 2M) ). \end{aligned}$$

1.1.2 Computing unfair effects

With such synthetic data, we computed the four statistics of unfair effects for Proposed, FIO, and Unconstrained as follows.

To compute (i) the mean unfair effect and (iii) the upper bound on PIU, we estimated marginal potential outcome probabilities \({\hat{p}}^{A \Leftarrow 0}_{\theta }\) and \({\hat{p}}^{A \Leftarrow 1 \parallel \pi }_{\theta }\) with estimators (24).

To obtain (ii) the standard deviation in the conditional mean unfair effects and (iv) the PIU value, we sampled potential outcomes \(Y_{A \Leftarrow 0}\) and \(Y_{A \Leftarrow 1 \parallel \pi }\) based on the SEM. For instance, in case of SEM (42), we sampled \(Y_{A \Leftarrow 0}\) from the following (interventional) SEM:

$$\begin{aligned} \begin{aligned}&A = 0, \\&Q = \lfloor U_Q \rfloor , \quad U_Q \sim {\mathcal {N}}(2, 5^2), \\&D(0) = \lfloor 0.5 Q U_D \rfloor , \quad U_D \sim \textrm{Tr}{\mathcal {N}}(2, 1^2, 0.1, 3.0),\\&M(0) = 0.4 Q U_M, \quad U_M \sim \textrm{Tr}{\mathcal {N}}(3, 2^2, 0.1, 3.0), \\&Y_{A \Leftarrow 0}= h_{\theta }(0, Q, D(0), M(0)), \end{aligned} \end{aligned}$$

where \(h_{\theta }\) is the classifier. We sampled \(Y_{A \Leftarrow 1 \parallel \pi }\) from

$$\begin{aligned} \begin{aligned}&A = 1, \\&Q = \lfloor U_Q \rfloor , \quad U_Q \sim {\mathcal {N}}(2, 5^2), \\&D(1) = 1 + \lfloor 0.5 Q U_D \rfloor , \quad U_D \sim \textrm{Tr}{\mathcal {N}}(2, 1^2, 0.1, 3.0),\\&M(0) = 0.4 Q U_M, \quad U_M \sim \textrm{Tr}{\mathcal {N}}(3, 2^2, 0.1, 3.0), \\&Y_{A \Leftarrow 1 \parallel \pi }= h_{\theta }(1, Q, D, M). \end{aligned} \end{aligned}$$

Then using n pairs of these samples \(\{(y_{A\Leftarrow 0, i}, y_{A \Leftarrow 1 \parallel \pi , i})\}_{i=1}^n\), we evaluated the PIU value by

$$\begin{aligned} {\hat{{{\,\mathrm{\textrm{P}}\,}}}}(Y_{A \Leftarrow 0}\ne Y_{A \Leftarrow 1 \parallel \pi }) = \frac{1}{n} \sum _{i=1}^n {{\,\mathrm{{\textbf{1}}}\,}}( y_{A\Leftarrow 0, i} \ne y_{A \Leftarrow 1 \parallel \pi , i} ), \end{aligned}$$

where \({{\,\mathrm{{\textbf{1}}}\,}}(\cdot )\) is an indicator function that takes 1 if \(y_{A\Leftarrow 0, i} \ne y_{A \Leftarrow 1 \parallel \pi , i} \) (i \(\in \) \(\{1, \ldots , n\}\)) and 0 otherwise.

We computed the standard deviation in conditional mean unfair effects as follows. We separated n individuals into K subgroups who have the same values of features \({{\varvec{X}}}\), took the mean unfair effects over individuals in each subgroup, and computed their standard deviation. Let the individuals in the k-th subgroup (\(k = 1, \ldots , K\)) have identical feature attributes \({{\varvec{X}}}\) \(=\) \({{\varvec{x}}}^k\), where superscript k represents the k-th subgroup. Then using \(\{(y_{A\Leftarrow 0, i}, y_{A \Leftarrow 1 \parallel \pi , i})\}_{i=1}^n\), we estimated the standard deviation of the conditional mean unfair effects over the K subgroups as

$$\begin{aligned} {\hat{\sigma }} = \sqrt{ \frac{ \sum _{k=1}^K {\hat{\mu }}^k - {\hat{\mu }} }{K} }. \end{aligned}$$
(46)

Here \({\hat{\mu }}^k\) is the estimated conditional mean unfair effect in the k-th subgroup of individuals with identical attributes \({{\varvec{X}}}\) \(=\) \({{\varvec{x}}}^k\), i.e.,

$$\begin{aligned} {\hat{\mu }}^k = \frac{1}{n^k} \sum _{i \in \{1, \ldots , n\} | {{\varvec{x}}}_i = {{\varvec{x}}}^k} {{\,\mathrm{{\textbf{1}}}\,}}(y_{A\Leftarrow 0, i}\ne y_{A \Leftarrow 1 \parallel \pi , i}), \end{aligned}$$

where \(n^k\) is the number of individuals in the k-th subgroup and \({\hat{\mu }}\) is the mean of \({\hat{\mu }}^k\) over \(k=1, \ldots , K\), i.e.,

$$\begin{aligned} {\hat{\mu }} = \frac{1}{K} \sum _{k}^K {\hat{\mu }}^k. \end{aligned}$$

1.1.3 Unfair effects of PSCF

With PSCF, we did not evaluate the two statistics of unfair effects, (i) the mean unfair effect and (iii) the upper bound on PIU, since they are not well-defined for this method.

These statistics measure the unfairness of the learned predictive model of Y (i.e., classifier \(h_{\theta }\) in our method); however, PSCF aims to ensure fairness using unfair predictive models.

To do so, PSCF approximates the SEM by learning predictive models of each variable in \({{\,\mathrm{\textrm{P}}\,}}({{\varvec{X}}}, Y)\), which is unfair due to the discriminatory bias in the observed data, and removes the unfairness by sampling the fair feature values based on the approximated SEM.

For instance, in the case of synthetic data experiments, PSCF approximates the SEM (42) as follows. Using latent variable \(H_D\), PSCF learns the predictive models of A, Q, D, M, and Y and the distribution of \(H_D\), which are expressed as follows:

$$\begin{aligned} \begin{aligned}&A = {{\,\mathrm{\textrm{P}}\,}}_{\theta }(A), \\&Q = {{\,\mathrm{\textrm{P}}\,}}_{\theta }(Q), \\&D = {{\,\mathrm{\textrm{P}}\,}}_{\theta }(D | A, Q, H_D), \quad H_D = {{\,\mathrm{\textrm{P}}\,}}_{\theta }(H_D),\\&M = {{\,\mathrm{\textrm{P}}\,}}_{\theta }(M | A, Q),\\&Y = {{\,\mathrm{\textrm{P}}\,}}_{\theta }(Y | A, Q, D, M), \end{aligned} \end{aligned}$$
(47)

where \({{\,\mathrm{\textrm{P}}\,}}_{\theta }\) denotes a (conditional) distribution, which is parameterized as a neural network. Here latent variable \(H_D\) approximates the additive noise in the structural equation over D.

By using the approximated SEM (47), PSCF aims to make fair predictions. For each individual i \(\in \) \(\{1, \ldots , n\}\) with attributes \(\{a_i, q_i, d_i, m_i\}\), PSCF samples their fair attribute of D by

$$\begin{aligned} {\hat{d}}(0)_i \sim {{\,\mathrm{\textrm{P}}\,}}_{\theta }(D | A=0, q_i, h_{D, i}), \quad h_{D, i} \sim {{\,\mathrm{\textrm{P}}\,}}_{\theta }(H_D). \end{aligned}$$
(48)

Using this attribute, PSCF makes a prediction using the following Monte Carlo estimate:

$$\begin{aligned}&{\hat{y}}^{PSCF}_i = \frac{1}{J} \sum _{j=1}^J {\hat{y}}^{PSCF}_{i, j} \quad \text {where}\quad {\hat{y}}^{PSCF}_{i, j} \sim {{\,\mathrm{\textrm{P}}\,}}_{\theta }(Y | A=0, q_i, d(0)_i, m_i). \end{aligned}$$
(49)

Here J is the number of Monte Carlo samples, which is set to \(J=5\) in our experiments.

According to Chiappa and Gillam (2019), if there is slight mismatch between the approximated and true SEMs (i.e., (47) and (42)), PSCF can eliminate the conditional mean unfair effect and achieve individual-level fairness. Intuitively, this is because in (48) and (49), A’s values are fixed in the approximated structural equations over Y and D, which involve unfair pathways \(\pi = \{A \rightarrow Y, A \rightarrow D \rightarrow Y\}\).

In this way, PSCF aims to achieve fairness by sampling the fair feature values using unfair predictive models. Therefore, we cannot measure the unfairness of this method using the two statistics (i) and (iii), which measure the unfairness of the predictive model.

By contrast, it is appropriate to measure unfairness based on two other statistics, i.e., (ii) the standard deviation in the conditional mean unfair effects and (iv) the PIU value. Since both are formulated using the true SEM, they can be used to quantify the unfairness due to SEM’s approximation error.

To compute the unfairness of PSCF using these two statistics, we made a prediction in the same way as (49), except that we used the true SEM (42). Let such predicted values be \(\{y^{PSCF}_i\}_{i=1}^n\). Then using n pairs of predicted values \(\{(y^{PSCF}_{i}, {\hat{y}}^{PSCF}_{i})\}_{i=1}^n\), we estimated (iv) the PIU value as

$$\begin{aligned} {\hat{{{\,\mathrm{\textrm{P}}\,}}}}(Y_{A \Leftarrow 0}\ne Y_{A \Leftarrow 1 \parallel \pi }) = \frac{1}{n} \sum _{i=1}^n {{\,\mathrm{{\textbf{1}}}\,}}( y^{PSCF}_{i} \ne {\hat{y}}^{PSCF}_{i} ), \end{aligned}$$

and (ii) the standard deviation of conditional mean unfair effects in the same way as (46) except that \({\hat{\mu }}^k\) is estimated as

$$\begin{aligned} {\hat{\mu }}^k = \frac{1}{n^k} \sum _{i \in \{1, \ldots , n\} | {{\varvec{x}}}_i = {{\varvec{x}}}^k} {{\,\mathrm{{\textbf{1}}}\,}}(y^{PSCF}_{i}\ne {\hat{y}}^{PSCF}_{i}). \end{aligned}$$

1.2 Settings in real-world data experiments

In real-world data experiments, to evaluate the two statistics of unfair effects (i.e., (i) and (iii)), we computed the marginal probabilities of potential outcomes \({\hat{p}}^{A \Leftarrow 0}_{\theta }\) and \({\hat{p}}^{A \Leftarrow 1 \parallel \pi }_{\theta }\) based on the existing estimators (Huber 2014).

With the German credit dataset, using the attributes of n individuals \(\{a_i, {{\varvec{c}}}_i, {{\varvec{s}}}_i, {{\varvec{r}}}_i\}_{i=1}^n\), we estimated the marginal probabilities as

$$\begin{aligned}&{\hat{p}}^{A \Leftarrow 0}_{\theta }= \frac{1}{n} \sum _{i=1}^n {{\,\mathrm{{\textbf{1}}}\,}}(a_i = 0) {\hat{w}}^{A \Leftarrow 0}_ic_{\theta }(a_i, {{\varvec{c}}}_i, {{\varvec{s}}}_i, {{\varvec{r}}}_i), \nonumber \\&{\hat{p}}^{A \Leftarrow 1 \parallel \pi }_{\theta }= \frac{1}{n} \sum _{i=1}^n {{\,\mathrm{{\textbf{1}}}\,}}(a_i = 1) {\hat{w}}^{A \Leftarrow 1 \parallel \pi }_ic_{\theta }(a_i, {{\varvec{c}}}_i, {{\varvec{s}}}_i, {{\varvec{r}}}_i), \end{aligned}$$
(50)

respectively, where weights \({\hat{w}}^{A \Leftarrow 0}_i\) and \({\hat{w}}^{A \Leftarrow 1 \parallel \pi }_i\) are expressed as

$$\begin{aligned} {\hat{w}}^{A \Leftarrow 0}_i= \frac{1}{{\hat{{{\,\mathrm{\textrm{P}}\,}}}}(A=0 | {{\varvec{c}}}_i)}, \quad {\hat{w}}^{A \Leftarrow 1 \parallel \pi }_i= \frac{{\hat{{{\,\mathrm{\textrm{P}}\,}}}}(A=1 | {{\varvec{c}}}_i, {{\varvec{s}}}_i) {\hat{{{\,\mathrm{\textrm{P}}\,}}}}(A=0 | {{\varvec{c}}}_i, {{\varvec{s}}}_i, {{\varvec{r}}}_i)}{{\hat{{{\,\mathrm{\textrm{P}}\,}}}}(A=1 | {{\varvec{c}}}_i) {\hat{{{\,\mathrm{\textrm{P}}\,}}}}(A=0 | {{\varvec{c}}}_i, {{\varvec{s}}}_i) {\hat{{{\,\mathrm{\textrm{P}}\,}}}}(A=1 | {{\varvec{c}}}_i, {{\varvec{s}}}_i, {{\varvec{r}}}_i)}. \end{aligned}$$

To compute these weights, we inferred the conditional probabilities of A by fitting the logistic regression model to the training data beforehand.

For the Adult dataset, given the attributes of n individuals \(\{a_i, m_i, l_i, {{\textbf {r}}}_i, {{\varvec{c}}}_i\}_{i=1}^n\), we estimated the marginal probabilities as

$$\begin{aligned}&{\hat{p}}^{A \Leftarrow 0}_{\theta }= \frac{1}{n} \sum _{i=1}^n {{\,\mathrm{{\textbf{1}}}\,}}(a_i = 0) {\hat{w}}^{A \Leftarrow 0}_ic_{\theta }(0, m_i, l_i, {{\varvec{r}}}_i, {{\varvec{c}}}_i), \\&{\hat{p}}^{A \Leftarrow 1 \parallel \pi }_{\theta }= \frac{1}{n} \sum _{i=1}^n {{\,\mathrm{{\textbf{1}}}\,}}(a_i = 1) {\hat{w}}^{A \Leftarrow 1 \parallel \pi }_ic_{\theta }(1, m_i, l_i, {{\varvec{r}}}_i, {{\varvec{c}}}_i), \end{aligned}$$

where weights \({\hat{w}}^{A \Leftarrow 0}_i\) and \({\hat{w}}^{A \Leftarrow 1 \parallel \pi }_i\) are provided by

$$\begin{aligned}&{\hat{w}}^{A \Leftarrow 0}_i= \frac{1}{{\hat{{{\,\mathrm{\textrm{P}}\,}}}}(A=0|{{\varvec{c}}}_i)}, \\&{\hat{w}}^{A \Leftarrow 1 \parallel \pi }_i= \frac{{\hat{{{\,\mathrm{\textrm{P}}\,}}}}(A=1|m_i, {{\varvec{c}}}_i) {\hat{{{\,\mathrm{\textrm{P}}\,}}}}(A=0|m_i,l_i, {{\varvec{r}}}_i, {{\varvec{c}}}_i) }{{\hat{{{\,\mathrm{\textrm{P}}\,}}}}(A=1|{{\varvec{c}}}_i) {\hat{{{\,\mathrm{\textrm{P}}\,}}}}(A=0|m_i, {{\varvec{c}}}_i) {\hat{{{\,\mathrm{\textrm{P}}\,}}}}(A=1|m_i,l_i, {{\varvec{r}}}_i, {{\varvec{c}}}_i) }. \end{aligned}$$

To obtain these weight values, we estimated each conditional probability of A by fitting the logistic regression model to the data beforehand.

1.3 Computing infrastructure

In our experiments, we used Python 3.6.8 with PyTorch 1.6.0 as an implementation of the optimization algorithm (Sutskever et al. 2013) and a 64-bit CentOS machine with 2.6GHz Xeon E5-2697A-v4 (x2) CPUs and 512-GB RAM.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chikahara, Y., Sakaue, S., Fujino, A. et al. Making individually fair predictions with causal pathways. Data Min Knowl Disc 37, 1327–1373 (2023). https://doi.org/10.1007/s10618-022-00885-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-022-00885-6

Keywords

Navigation