1 Introduction

Contemporary machine learning (ML) techniques surpass traditional statistical methods in terms of their higher predictive power and their capability of processing a larger number of attributes. However, these novel ML algorithms generate models that have a complex structure which makes it difficult for their outputs to be interpreted with high precision. Another important issue is that a highly accurate predictive model might lack fairness by generating outputs that may result in discriminatory outcomes for protected subgroups. Thus, it is imperative to design predictive systems that are not only accurate but also achieve the desired fairness level.

When used in certain contexts, predictive models, and strategies that rely on such models, are subject to laws and regulations that ensure fairness. For instance, a hiring process in the United States (US) must comply with the Equal Employment Opportunity Act (EEOA 1972). Similarly, financial institutions (FI) in the US that are in the business of extending credit to applicants are subject to the Equal Credit Opportunity Act (ECOA 1974), the Fair Housing Act (FHA 1968), and other fair lending laws. These laws often specify protected attributes that FIs must consider when maintaining fairness in lending decisions.

Examples of protected attributes include race, gender, age, ethnicity, national origin, marital status, and others. Under the ECOA, for example, it is unlawful for a creditor to discriminate against an applicant for a loan on the basis of race, gender or age. Even though direct usage of protected attributes in building a model is often prohibited by law (e.g. overt discrimination), some otherwise benign attributes can serve as “proxies” because they may share dependencies with a protected attribute. For this reason, it is crucial for data scientists to conduct a fairness review of their trained models in consultation with compliance professionals in order to evaluate the predictive modeling system for potential unfairness. In this paper, we develop a fairness interpretability framework to aid in this important task.

At an algorithmic level, bias can be viewed as an ability to differentiate between two subpopulations at the level of data or outcomes. Regardless of its definition, if bias is present in data when training an ML model, the ability to differentiate between subgroups might potentially lead to discriminatory outcomes. For this reason, the model bias can be viewed as a measure of unfairness and hence its measurement is central to the model fairness assessment.

There is a comprehensive body of research on ML fairness that discusses bias measurements and mitigation methodologies. Kamiran et al. (2009) introduced a classification scheme for learning unbiased models by modifying the biased data sets without direct knowledge of the protected attribute. Kamishima et al. (2012) proposed a regularization approach for discriminative probabilistic models. Zemel et al. (2013) designed an optimization problem that incorporates fairness constraints. Feldman et al. (2015) proposed a geometric repair scheme to remove disparate impact in classifiers by making data sets unbiased. Hardt et al. (2015) indtroduced post-processing techniques removing discrimination in classifiers based on equalized odds and equal opportunity fairness criteria. Woodworth et al. (2017) designed a framework for nearly-optimal learning predictors with equalized odds fairness constraint. Zhang et al. (2018) proposed to use adversarial learning to mitigate bias, and Jiang (2020) suggested a bias correction technique via re-weighting the data.

The work of Dwork et al. (2012) studies Lipschitz randomized classifiers and their statistical parity bias. It establishes a bound on that bias by a transport-like distance between the input subpopulation distributions. The bound aids in constructing an optimal Lipschitz classifier with control over the statistical parity bias by transporting one of the subpopulation input datasets into the other. The work of Gordaliza et al. (2019) establishes a similar bound for non-randomized classifiers by the total variance distance between input subpopulation distributions. Guided by the bound and utilizing optimal transport theory, their method focuses on repairing input datasets in a way that allows for control of the total variance distance, and hence the statistical parity bias.

Though the bounds in the aforementioned works are of theoretical and practical importance, they provide little information on how each component of the input contributes to the bias in the output. The main reason for that is that the bias from the inputs propagates through the model structure in a non-trivial way. For this reason, in our work, we focus on designing a fairness interpretability framework that evaluates how each predictor contributes to the model bias, incorporating the predictor’s favorability with respect to protected (or minority) class into the framework. The construction is carried out by employing optimal transport theory and game-theoretic techniques.

Another issue regarding the ML fairness literature is that it mainly focuses on classifiers. Specifically, given the data (XGY), where \(X\in \mathbb {R}^n\) are predictors, \(G\in \{0,1\}\) is a protected attribute and \(Y\in \{0,1\}\) is a binary output variable, with favorable outcome \(Y=1\), the bias measurements are often based on fairness criteria such as statistical parity, which reads \(\mathbb {P}(\hat{Y}=1|G=0)=\mathbb {P}(\hat{Y}=1|G=1)\), or alternative criteria such as equalized odds and equal opportunity (Feldman et al. 2015; Hardt et al. 2015).

Many models in the financial industry, however, are regressors \(f=\mathbb {E}[Y|X]\). In turn, classification models are usually obtained by thresholding the regressor, \(Y_t(X)=\mathbbm {1}_{\{f(X)>t\}}\), but the thresholds are in general not chosen during the model development stage. Thus, data scientists select the classification score \(f(X)=\widehat{\mathbb {P}}(Y=1|X)\) based on the overall performance across all thresholds. The same is true for fairness assessment, which is conducted at the level of the whole classification score. The main reason for this is that the strategies and decision-making procedures in FIs may rely on the classification score or its distribution, not a single classifier with a fixed threshold. This motivates us to measure and explain the bias exclusively in the regressor model.

Our interpretability framework in principle can be applied to a wide range of predictive ML systems. For instance, it can provide insight into predictor attributions for models that appear in economics, social sciences, medicine, and other fields.

Another application of the framework is for bias mitigation under regulatory constraints. In FIs, bias mitigation methodologies that require explicit consideration of protected class status in the training or prediction stages are not acceptable in view of ECOA. Consequently, bias mitigation methods such as those in Dwork et al. (2012); Feldman et al. (2015); Gordaliza et al. (2019) are not feasible. However, a probabilistic proxy model for a protected attribute G such as the Bayesian Improved Surname and Geocoding (BISG) is allowed to be used for fairness assessment and subsequent post-processingFootnote 1 (Elliot et al. 2009; Hall et al. 2021); for an alternative proxy model, see Chen et al. (2019). This setup allows for the use of our framework in the following regulatory-compliant fashion:

  1. (S1)

    Given a model f and the proxy protected attribute \(\tilde{G}\), perform a fairness assessment by measuring the bias across the subpopulation distributions \(f(X)|\tilde{G}=k\), \(k\in \{0,1\}\).

  2. (S2)

    If the model bias exceeds a certain threshold, determine the main drivers for the bias, that is, determine the list of predictors \(X_{i_1}, X_{i_2},\dots ,X_{i_r}\) contributing the most to that bias.

  3. (S3)

    Mitigate the bias by constructing a post-processed model \(\tilde{f}(X; f)\) utilizing the information on the most biased predictors \(\{X_{i_1},X_{i_2},\dots ,X_{i_r}\}\) and without the direct use of \(\tilde{G}\) or any information on the joint distribution \((X,\tilde{G})\).

In this article, the interpretability framework we develop addresses steps (S1) and (S2). The post-processing methods (S3) are investigated in our companion paper Miroshnikov et al. (2021b). In what follows, we provide a summary of the key ideas and main results.


Problem setup We consider the joint distribution (XGY), where \(X\in \mathbb {R}^n\) are predictors, \(G \in \{0,1\}\) is the protected attribute, with the non-protected class \(G=0\), and Y is either a response variable with values in \(\mathbb {R}\) (not necessarily a continuous random variable) or binary one with values in \(\{0,1\}\). We denote a trained model by \(f(x)=\widehat{\mathbb {E}}[Y|X=x]\), assumed to be trained on (XY) without access to G. We assume that there is a predetermined favorable model direction, denoted by \(\uparrow\) and \(\downarrow\); if the favorable direction is \(\uparrow\) then the relationship \(f(x)>f(y)\) favors the input x, and if it is \(\downarrow\) the input y. In the case of binary \(Y\in \{0,1\}\), the favorable direction \(\uparrow\) is equivalent to \(Y=1\) being a favorable outcome, and \(\downarrow\) to \(Y=0\). To simplify the exposition, the main text focuses on the case of a binary protected attribute G. However, the framework and all of the results in the article have a natural extension to the multi-labeled case.


Key components of the framework

  • Motivated by optimal transport theory, we focus on the bias measurement in the model output via the Wasserstein metric \(W_1\)

    $$\begin{aligned} \mathrm{Bias}_{W_1}(f|G) = \inf _{\pi \in \mathscr {P}(\mathbb {R}^2)} \Big \{ \int _{\mathbb {R}^2} |x_1-x_2| \, d\pi (x_1,x_2), \,\, \hbox { with marginals}\ P_{f(X)|G=0}, P_{f(X)|G=1} \Big \}, \end{aligned}$$

    which measures the minimal cost of transporting one distribution into another; see Santambrogio (2015). More importantly, we introduce the model bias decomposition into the sum of the positive and negative model biases, \(\mathrm{Bias}_{W_1}^{\pm }(f|G)\), which measure the transport effort for moving points of the unprotected subpopulation distribution \(f(X)|G=0\) in the non-favorable and favorable directions, respectively. This allows us to obtain a more informed perspective on the predictor’s impact; see Sects. 3.2 and 3.3.

  • We establish the connection of the model bias with that of a classifier. We show that the positive and negative model bias can be viewed as the integrated statistical parity bias over the family of classifiers induced by the regressor. This integral relationship is then used to construct an extended family of transport metrics for regressor bias. Via integration, these metrics incorporate generic group parity fairness criteria for classifiers induced by the given regressor. Furthermore, we prove a more general version of (Dwork et al. 2012, Theorem 3.3) that establishes the connection between the Wasserstein-based bias and the randomized classifier-based bias; see Sects. 3.3 and 3.4.

  • We introduce bias predictor attributions called bias explanations in order to understand how predictors contribute to the model bias. The bias explanation \(\beta _i\) of predictor \(X_i\) is computed as the cost of transporting the distribution of \(E_i|G=0\) to that of \(E_i|G=1\), where \(E_i(X;f)\) quantifies the contribution of \(X_i\) to the model value. The transport theory gives rise to the decomposition \(\beta _i=\beta _i^++\beta _i^-\) into the sum of positive and negative model bias explanations. Roughly speaking, \(\beta _i^+\) quantifies the combined predictor contribution to the increase of the positive model bias and decrease in the negative model bias, and vice versa for \(\beta _i^-\); see Sect. 4.3.

  • The bias explanations are in general not additive, even if the predictor explanations are. To construct additive bias explanations and to better capture the interactions at the distribution level, we employ a cooperative game theory approach motivated by the ideas of Štrumbelj and Kononenko (2010). We design a cooperative bias game \(v^{bias}\) which evaluates the bias in the model attributed to coalitions \(X_S\), \(S \subset \{1,\dots ,n\}\), and define bias explanations via the Shapley value \(\varphi [v^{bias}]\), which yields additivity. Similar approach is applied to construct additive positive and negative bias explanations; see Sect. 4.5.

  • We choose to design the bias explanations based upon model explainers \(E_i\) that are either conditional or marginal expectations, or game-theoretic explainers in the form of the Shapley value \(\varphi [v]\) where v is either a conditional game \(v^{ C\!E}\) or a marginal game \(v^{ M\!E}\). For each \(v \in \{v^{ C\!E},v^{ M\!E}\}\) we perform the stability analysis of non-additive and additive bias explanations. By adapting the grouping techniques from Miroshnikov et al. (2021a), we reduce the complexity of game-theoretic bias explanations and unite marginal and conditional approaches; see Sects. 4.4, 4.5 and 4.6.

Structure of the paper. In Sect. 2, we introduce the requisite notation and fairness criteria for classifiers, and discuss ML fairness literature related to our work. In Sect. 3, we introduce the Wasserstein-based regressor bias and investigate its properties. In addition, we discuss a wide class of transport metrics that could be used for fairness assessment. In Sect. 4, we provide a theoretical characterization of the bias explanations and investigate their properties. In Sect. 5 we discuss some regulatory aspects of bias mitigation, and present an application of the framework to a UCI dataset. In Appendix A, we discuss the Kantorovich transport problem. In Appendix B, we state and prove auxiliary lemmas.

2 Preliminaries

2.1 Notation and hypotheses

We consider the joint distribution (XGY), where \(X=(X_1,X_2,\dots ,X_n) \in \mathbb {R}^n\) are the predictors, \(G\in \{0,1,\dots ,K-1\}\) is the protected attribute and Y is either a response variable with values in \(\mathbb {R}\) (not necessarily a continuous random variable) or a binary one with values in \(\{0,1\}\). We encode the non-protected class as \(G=0\) and assume that all random variables are defined on the common probability space \((\Omega ,\mathscr{F},\mathbb {P})\), where \(\Omega\) is a sample space, \(\mathbb {P}\) a probability measure, and \({\mathscr{F}}\) a \(\sigma\)-algebra of sets.

The true model and a trained one, which is assumed to be trained without access to G, are denoted by

$$\begin{aligned} f(X)=\mathbb {E}[Y|X] \quad \text {and} \quad \hat{f}(X)=\widehat{\mathbb {E}}[Y|X], \end{aligned}$$

respectively. In the case of binary Y they read \(f(X)=\mathbb {P}(Y=1|X)\) and \(\hat{f}(X)=\widehat{\mathbb {P}}(Y=1|X)\). We denote a classifier based on the trained model by

$$\begin{aligned} \widehat{Y}_t=\widehat{Y}_t(X;\hat{f})=\mathbbm {1}_{\{\hat{f}(X)>t\}}, \quad t \in \mathbb {R}. \end{aligned}$$

The subpopulation cumulative distribution function (CDF) of \(\hat{f}(X)|G=k\) is denoted by

$$\begin{aligned} F_k(t)=F_{\hat{f}(X)|G=k}(t)=\mathbb {P}(\hat{f}(X)\le t|G=k) \end{aligned}$$

and the corresponding generalized inverse (or quantile function) \(F_k^{[-1]}\) is defined by:

$$\begin{aligned} F_k^{[-1]}(p)=F_{\hat{f}(X)|G=k}^{[-1]}(p)=\inf _{x \in \mathbb {R}}\big \{ p \le F_k(x) \big \}. \end{aligned}$$

We assume that there is a predetermined favorable model direction, denoted by either \(\uparrow\) or \(\downarrow\). If the favorable direction is \(\uparrow\) then the relationship \(f(x)>f(z)\) favors the input x, and if it is \(\downarrow\) the input z. The sign of the favorable direction of f is denoted by \(\varsigma _{f}\) and satisfies

$$\begin{aligned} \varsigma _{f} = \left\{ \begin{aligned}&1,&\hbox {if the favorable direction of} f \hbox {is} \uparrow&\\ -&1,&\hbox {if the favorable direction of} f \hbox {is} \downarrow \,.&\end{aligned} \right. \end{aligned}$$

In the case of binary Y, the favorable direction \(\uparrow\) is equivalent to \(Y=1\) being a favorable outcome, and \(\downarrow\) to \(Y=0\); see Sect. 2.4.

In what follows we first develop the framework in the context of the binary protected attribute \(G\in \{0,1\}\) and then extend it to the case of the multi-labeled protected attribute; see Sect. 3.4.

2.2 Fairness criteria for classifiers

When undesired biases concerning demographic groups (or protected attributes) are in the training data, well-trained models will reflect those biases. There have been numerous articles devoted to ML systems that lead to fair decisions. In these works, various measurements for fairness have been suggested. In what follows, we describe several well-known definitions which help measure fairness of classifiers.

Definition 1

Suppose that Y is binary with values in \(\{0,1\}\) and \(Y=1\) is the favorable outcome. Let \(\widehat{Y}\) be a classifier.

  • \(\widehat{Y}\) satisfies statistical parity (Feldman et al. 2015) if

    $$\begin{aligned} \mathbb {P}(\widehat{Y}=1|G=0) = \mathbb {P}(\widehat{Y}=1|G=1). \end{aligned}$$
  • \(\widehat{Y}\) satisfies equalized odds (Hardt et al. 2015) if

    $$\begin{aligned} \mathbb {P}(\widehat{Y}=1|Y=y,G=0) = \mathbb {P}(\widehat{Y}=1|Y=y,G=1), \quad y\in \{0,1\}. \end{aligned}$$
  • \(\widehat{Y}\) satisfies equal opportunity (Hardt et al. 2015) if

    $$\begin{aligned} \mathbb {P}(\widehat{Y}=1|Y=1,G=0) = \mathbb {P}(\widehat{Y}=1|Y=1,G=1). \end{aligned}$$
  • The balanced error rate (BER) of \(\widehat{Y}\) (Feldman et al. 2015) is given by

    $$\begin{aligned} BER(\widehat{Y}, G) = \tfrac{1}{2}(\mathbb {P}(\widehat{Y}=1|G=0) + \mathbb {P}(\widehat{Y}=0|G=1)). \end{aligned}$$

The statistical parity requires that the proportions of people in the favorable class \(\widehat{Y}=1\) within each group \(G=k,k\in \{0,1\}\) are the same. The equalized odds constraint requires the classifier to have the same misclassification error rates for each class of the protected attribute G and the label Y. Equal opportunity constraint requires the misclassification rates to be the same for each class \(G=k\) only for the individuals labeled as \(Y=1\). The BER is the average class-conditioned error rate of \(\widehat{Y}\).

2.3 Group classifier fairness example

There are numerous reasons why a trained classifier may lead to unfair outcomes. To illustrate, we provide an instructive example that shows how predictors and labels, as well as their relationship with the protected attribute, affect classifier fairness.

Consider a data set (XYG) where the predictor X depends on \(G \in \{0,1\}\), \(Y\in \{0,1\}\) is binary, with favorable outcome \(Y=0\), and the classification score f depends explicitly on X only:

$$\begin{aligned}&X \sim N(\mu -a\cdot G,\sqrt{\mu }), \quad \mu =5, \, a=1\\&Y \sim Bernoulli(f(X)), \quad f(X)=\mathbb {P}(Y=1|X)={logistic(\mu -X)}. \qquad \qquad (\hbox {M1}) \end{aligned}$$

The data set is constructed in such a way that the proportions of \(Y=0|G=k\) in the two groups are different: \(\mathbb {P}(Y=0|G=0)=0.5\), \(\mathbb {P}(Y=0|G=1) = 0.36\). The predictor X serves as a good proxy for G, which can be seen in Fig. 1a. The plot depicts the density of X and the conditional densities of X given \(G=0\) and \(G=1\), respectively. The shifted conditional densities clearly show the dependence of X on G. Though the true score f(X) does not depend explicitly on G, a classifier trained on X will learn that the higher the value of X the more likely it is that \(Y=0\). Using the logistic regression model \(\hat{f}\) we observe that for any threshold \(t \in (0,1)\) the classifier \(\widehat{Y}_{t}\) satisfies neither the statistical parity, nor the equal opportunity, nor the equalized odds criterion. Furthermore, since both classes of G are equally likely, \(BER(\widehat{Y}_t,G)<0.5\) implies that one can potentially infer G from X; see Fig. 1b. The vertical axis in the plot represents the difference between the probabilities for each of the first three fairness metrics described in Definition 1 as well as the value of the balanced error rate. Notice how only in the trivial cases where \(t\in \{0,1\}\) are all metrics satisfied and the balanced error rate is equal to 0.5, since \(\widehat{Y}_0 = 1, \widehat{Y}_1 = 0\) for all X.

Fig. 1
figure 1

Predictor distributions and fairness for the model (M1), \(\varsigma _f=-1\)

2.4 Classifier bias based on statistical parity

In this section we provide a definition for classifier bias based on the statistical parity fairness criterion and establish some basic properties of the classifier bias. In what follows, we suppress the symbol \(\,\hat{}\,\) , using it only when it is necessary to differentiate between the true model and the trained one. The same rule applies to classifiers.

Definition 2

Let f be a model, \(X\in \mathbb {R}^n\) predictors, \(G \in \{0,1\}\) protected attribute, \(G=0\) non-protected class, \(\varsigma _f\) the sign of the favorable direction, and \(F_k\) the CDF of \(f(X)|G=k\).

  • The signed classifier (or statistical parity) bias for a threshold \(t \in \mathbb {R}\) is defined by

    $$\begin{aligned} \begin{aligned} \widetilde{bias} ^{C}_t(f|X,G)&= \big ( \mathbb {P}(Y_t=\mathbbm {1}_{\{\varsigma _f=1\}}|G=0)-\mathbb {P}(Y_t=\mathbbm {1}_{\{\varsigma _f=1\}}|G=1) \big )\\&=\big ( F_1(t)-F_0(t) \big ) \cdot \varsigma _f. \end{aligned} \end{aligned}$$
  • The classifier bias at \(t \in \mathbb {R}\) is defined by

    $$\begin{aligned} \begin{aligned} bias ^C_t(f|X,G)=| \widetilde{bias} ^C_t(f|X,G)|. \end{aligned} \end{aligned}$$

We say that \(Y_t\) favors the non-protected class \(G=0\) if the signed bias is positive. Respectively, \(Y_t\) favors the protected class \(G=1\) if the signed bias is negative.

Remark 1

Suppose that \(Y \in \{0,1\}\) is binary and that the favorable direction is \(\uparrow\), which implies that \(\mathbbm {1}_{\{\varsigma _f=1\}}=1\). Then \(Y_t\) favors the non-protected class \(G=0\) if and only if there is a larger proportion of individuals from class \(G=0\) for which \(Y_t=1\) compared to the class \(G=1\). This, from a statistical parity perspective, describes the outcome \(Y=1\) as favorable. Similar remarks apply to the case when the favorable direction is \(\downarrow\). Thus, the favorable direction is \(\uparrow\) (\(\downarrow\)) is equivalent to the favorable outcome \(Y=1\) (\(Y=0\)).

2.5 Quantile bias and geometric parity

Given a model f and a threshold \(t \in \mathbb {R}\), the classifier bias based on statistical parity measures the difference in population sizes corresponding to groups \(G=\{0,1\}\) for which \(Y_t=0\). This measurement however does not take into account the geometry of the model distribution, that is, the score values themselves.

For example, when measuring the bias in incomes among ‘females’ and ‘males’ one can view the difference of expected incomes in the two groups as ‘bias’. Alternatively, one can measure an income bias by evaluating the absolute difference of the ‘female’ median income and ‘male’ median income, which is often done in various social studies. This motivates us to take into account the geometry of the score distribution when defining bias. For this reason, we propose the notion of the quantile bias which operates on the domain of the score rather than the sample space.

Definition 3

Let \(f,X,G,\varsigma _f\) and \(F_k\) be as in Definition 2. Let \(p \in (0,1)\).

  • The signed p-th quantile is defined by

    $$\begin{aligned} \widetilde{bias} ^{Q}_p(f|X,G) = \big ( F^{[-1]}_0(p)-F^{[-1]}_1(p) \big ) \cdot \varsigma _f \end{aligned}$$
  • The p-th quantile bias is defined by

    $$\begin{aligned} \begin{aligned} bias ^Q_p(f|X,G)=| \widetilde{bias} ^Q_p(f|X,G)|. \end{aligned} \end{aligned}$$

As a counterpart to statistical parity, we also introduce quantile (geometric) parity.

Definition 4

(geometric parity) Let f be a model and \(G \in \{0,1\}\) the protected attribute.

  • We say that the model f satisfies p-th quantile (or geometric) parity if

    $$\begin{aligned} bias ^Q_{p}(f|X,G)=0. \end{aligned}$$
  • Let \(t \in \mathbb {R}\). The classifier \(Y_t\) satisfies quantile (or geometric) parity if

    $$\begin{aligned} bias ^Q_{p_0}(f|X,G)=0, \quad p_0=F_0(t). \end{aligned}$$

Given a score f, the quantile bias measures the difference between subpopulation quantile values. For a given threshold t, the \(p_0\)-quantile signed bias, with \(p_0=F_0(t)\), measures by how much the corresponding score values of the protected class \(G=1\) differ from that of \(G=0\) or equivalently by how much the threshold for the protected group should be shifted to achieve the quantile parity (and in some cases statistical parity) between the two populations.

Lemma 1

Let f be a model, \(G \in \{0,1\}\) the protected attribute, and \(G=0\) the non-protected class. Suppose that \(t_0 \in \mathbb {R}\) is a point at which the CDFs \(F_0\) and \(F_1\) are continuous and strictly increasing. Then \(Y_{t_0}\) satisfies statistical parity if and only if it satisfies geometric parity.

Proof

The result follows from Definitions 2 and 3, and the fact that \(F_0\) and \(F_1\) are locally invertible at \(t_0\). \(\square\)

To better understand the classifier and quantile biases and their connection, see Fig. 2a. The conditional CDFs of the model scores are plotted given the protected attribute G. The blue line (corresponding to the scores given \(G=0\)) is above the red line (scores given \(G=1\)) for all values of t. Thus, for a given threshold \(t_0\) we have that \(F_0(t_0)-F_1(t_0)>0\), which means that if the favorable direction is \(\uparrow\) (\(\downarrow\)) then the classifier favors the class \(G=1\) (\(G=0\)). In view of the quantile bias, the green horizontal line segment represents the amount we would have to shift the threshold for one of the classes in order to achieve geometric parity. Since the CDFs are shown to be continuous and strictly increasing, the above lemma implies that doing so would achieve statistical parity as well.

Fig. 2
figure 2

Classifier and quantile bias, and model bias for the model (M1)

2.6 Optimal transport use in ML classifier fairness

2.6.1 Classifier bias mitigation via repaired datasets

Two notable works that utilize optimal transport theory to reduce statistical parity bias are Feldman et al. (2015) and Gordaliza et al. (2019).

The approach in Feldman et al. (2015) seeks to create an unbiased dataset by transforming predictors and then training a classifier on it. The authors propose a geometric repair scheme, which partially moves the two subpopulation distributions \(\mu _{i,0}\) and \(\mu _{i,1}\) of predictor \(X_i\) along the Wasserstein geodesic towards their (unidimensional) Wasserstein barycenter \(\mu _{i,B}\), a distribution minimizing the variance of the collection \(\{\mu _{i,0}, \mu _{i,1}\}\); see Appendix A. The transformed dataset is then used to train a model that reduces disparate impact.

Gordaliza et al. (2019) proposes a method for transforming the multivariate distribution of predictors called random repair. Given two subpopulation distributions of predictors \(\mu _k = P_{X|G=k}\), with \(k\in \{0,1\}\), and a repair parameter \(\lambda \in [0,1]\), the algorithm randomly chooses between the Wasserstein barycenter \(\mu _B\) of \(\{\mu _0,\mu _1\}\) and the original subpopulation distribution \(\mu _k\), with \(\lambda\) determining the probability of selecting \(\mu _B\).

The authors establish the upper bound on the disparate impact (DI) and balanced error rate (BER) of classifiers with respect to (XG) using the total variance distance between the subpopulation distributions of predictors,

$$\begin{aligned} \min _h BER(h,X,G) = \tfrac{1}{2}(1-d_{TV}(\mu _0,\mu _1)), \end{aligned}$$

and show that the TV-distance between repaired subpopulation distributions \(\tilde{\mu }_{0,\lambda }\),\(\tilde{\mu }_{1,\lambda }\) is bounded by \(1-\lambda\). This, in turn, allows to control the bound on the DI and BER, and hence the closely related statistical parity bias on the repaired dataset is bounded by

$$\begin{aligned} \max _h bias ^C(h|G,\tilde{X}_{\lambda }) = d_{TV} (\tilde{\mu }_{0,\lambda },\tilde{\mu }_{1,\lambda })\le 1-\lambda . \end{aligned}$$

The random repair algorithm allows for a tight control of TV-distance between repaired subpopulations unlike the geometric repair approach. They also establish bounds on the loss in performance due to modifying the data by the Wasserstein distance between the two subpopulation distributions of predictors. The performance loss is expressed as the difference in classification risk between the repaired and original data on (XG).

Remark 2

Given the regulatory constraints, the approaches of Feldman et al. (2015) and Gordaliza et al. (2019) would not be permitted in financial institutions that extend credit because a) the protected attribute cannot be used in training or prediction, and b) introducing randomness into the input dataset is prohibited; for details see (Hall et al. 2021). To take into account the regulatory constraints and practical applications, in our companion paper (Miroshnikov et al. 2021b) we propose a post-processing approach that relies on the fairness interpretability framework presented in the current article.

2.6.2 Individual fairness

The work of Dwork et al. (2012) studies individual fairness of randomized classifiers. To understand the main results of the article, we first provide relevant definitions.

Definition 5

Let \((\mathscr{X},d)\) be a metric space and D a distance on \(\mathscr {P}(\{0,1\})\).

(i):

A map \(M: \mathscr{X}\rightarrow \mathscr {P}(\{0,1\})\) is called a randomized classifier.

(ii):

\(Lip_1(\mathscr{X},\mathscr {P}(\{0,1\});d,D)=\{M: \mathscr{X}\rightarrow \mathscr {P}(\{0,1\}), \, D(M(x),M(y))\le d(x,y) \}\).

(iii):

Given \(\nu \in \mathscr {P}(\mathscr{X})\) the averaged \(M_{\nu }\) is defined by \(M_{\nu }(a)=\mathbb {E}_{x \sim \nu } [M(x)(a)]\), \(a \subset \{0,1\}\).

(iv):

The distance \(D_{rc}\) between \(\mu ,\nu \in \mathscr {P}(\mathscr{X})\) is defined by

$$\begin{aligned} D_{rc}(\mu ,\nu ; D,d) := \sup \Big \{ M_{\mu }(\{0\})-M_{\nu }(\{0\}),\,\, M \in Lip_1(\mathscr{X},\mathscr {P}(\{0,1\}); d,D) \Big \}\in [0,1]. \end{aligned}$$

Individual fairness is defined by imposing a Lipschitz property on the map \(x \rightarrow M(x)\in \mathscr {P}(\{0,1\})\), \(x\in \mathscr{X}\). As in Gordaliza et al. (2019), the work of Dwork et al. (2012) relates the bias in the output to the bias in the input. In particular, the paper establishes the upper bound \(D_{TV} (M_{P_0},M_{P_1})\le D_{rc}(P_0,P_1)\) for the statistical parity bias of Lipschitz randomized classifiers. Roughly speaking, the above bound means that when two subpopulations \(P_0,P_1\) are “similar” in the sense of the \(D_{rc}\) metric, then the Lipschitz condition ensures that the statistical parity bias is small.

The \(D_{rc}\) metric has transport-like properties and is related to the Wasserstein metric; see (Dwork et al. 2012, Theorem 3.3) and Theorem 3 in Sect. 3.5.

3 Model bias metric

In our work we shift the focus from measuring the bias in classifiers to the bias in regressor outputs. This is motivated by the fact that many strategies and decisions in the real-world make use of the regressor values or the classification scores of the trained ML models. Furthermore, in the case of classification scores, the bias assessment in FIs is carried out before any classifier threshold is determined.

In this section, we discuss how to measure the regressor bias using optimal transport. We also establish the connection between the regressor bias and the bias in the collection of classifiers induced by thresholding the regressor, and make use of this integral relationship to design generic regressor fairness metrics that incorporate group-based parity criteria, such as equalized odds (Hardt et al. 2015), into the transport formulation.

Definition 6

(D-model bias) Let \(X\in \mathbb {R}^n\) be predictors, f be a model, and \(G\in \{0,1\}\) the protected attribute. Let \(D(\cdot ,\cdot )\) be a metric on the space of probability measures \(\mathscr {P}_q(\mathbb {R})\), with \(q \ge 0\). Provided \(\mathbb {E}[|f(X)|^q]\) is finite, the D-based model bias is defined as the distance between the subpopulation distributions of the model:

$$\begin{aligned} \mathrm{Bias}_D(f|X,G) := D(P_{f(X)|G=0},P_{f(X)|G=1}), \end{aligned}$$
(1)

where \(P_{f(X)|G=k}\) is the pushforward probability measure of \(f(X)|G=k\). We say that the model (Xf) is fair up to the D-based bias \(\epsilon \ge 0\) if \(\mathrm{Bias}_{D}(f|X,G)\le \epsilon\).

Figure 2b illustrates the model bias for two choices of D: the 1-Wasserstein metric \(W_1\) and the Kolmogorov-Smirnov distance KS. Notice the stark difference between the two model biases. This raises the general question on which metric should one use to evaluate the bias. We discuss this issue in the following section.

In what follows we suppress the explicit dependence of the model bias on X.

3.1 Wasserstein distance

To determine an appropriate metric D to be used in (1) is not a trivial task. The choice depends on the context in which the model bias is measured. We argue that it is desirable for the metric to have the following properties:

  1. (P1)

    It should be continuous with respect to the change in the geometry of the distribution.

  2. (P2)

    It should be non-invariant with respect to monotone transformations of the distributions.

The property (P1) makes sure that the metric keeps track of changes in the geometry. For instance, suppose an “income” of the group \(\{G=0\}\) is \(x_0\) and that of \(\{G=1\}\) is \(x_1\). A metric that measures income inequality should be able to sense the distance between \(x_0\) and \(x_0+\varepsilon\). That is, having two delta measures \(\delta _{x_0}\) and \(\delta _{x_0+\varepsilon }\) the metric must ensure that as \(\varepsilon \rightarrow 0\) the distance \(D(\delta _{x_0},\delta _{x_0+\varepsilon })\) approaches zero. The property (P1) also makes sure that slight changes in the subpopulation distributions lead to a slight change in bias measurements, which is important for stability with respect to changes in the dataset X.

The property (P2) makes sure that the metric is non-invariant with respect to monotone transformations. That is, given two random variables \(X_0\) and \(X_1\) and a continuous, strictly increasing transformation \(T:\mathbb {R}\rightarrow \mathbb {R}\), one would expect the change in distance between \(T(X_0)\) and \(T(X_1)\) whenever T is not a shift. For example, if \(T(x)=\alpha x\), we would expect the distance between \(T(X_0)=\alpha X_0\) and \(T(X_1)=\alpha X_1\) depend continuously on \(\alpha\).

In what follows, we consider the Wasserstein distance \(W_q\) as a potential candidate for fairness interpretability; for use cases in the ML fairness community see Dwork et al. (2012); Feldman et al. (2015); Gordaliza et al. (2019).

To introduce the metric and investigate its properties we switch our focus to probability measures; recall that any random variable Z gives rise to the pushforward probability measure \(P_Z(A)=\mathbb{P}(Z \in A)\) on \(\mathbb {R}\), and the reverse is true, for any \(\mu \in \mathscr {P}(\mathbb {R})\) with the CDF \(F_{\mu }(a)=\mu ((-\infty ,a])\) there is a random variable Z such that \(P_Z=\mu\). Similar remarks apply for random vectors; see Shiryaev (1980). Given \(T:\mathbb {R}^k \rightarrow \mathbb {R}^m\) and \(\mu \in \mathscr {P}(\mathbb {R}^k)\), we denote by \(T_{\#}\mu\) a measure such that \(T_{\#}\mu (B)=\mu \big (T^{-1}(B))\).

The Wasserstein distance \(W_q\) is connected to the concept of optimal mass transport. Given two probability measures \(\mu _1,\mu _2 \in \mathscr {P}_q(\mathbb {R})\) with finite q-th moment and the cost function \(c(x_1,x_2)=|x_1-x_2|^q\), the Wasserstein distance \(W_q\) is defined by

$$\begin{aligned} \begin{aligned} W_q(\mu _1,\mu _2) := \mathscr {T}^{1/q}_{|x_1-x_2|^q}(\mu _1,\mu _2) \end{aligned} \end{aligned}$$

where

$$\begin{aligned} \mathscr {T}_{|x_1-x_2|^q}(\mu _1,\mu _2) = \inf _{\gamma \in \mathscr {P}(\mathbb {R}^2)} \bigg \{ \int _{\mathbb {R}^2} |x_1-x_2|^q \, d\gamma (x_1,x_2), \,\, \hbox { with marginals}\ \mu _1,\mu _2 \bigg \} \end{aligned}$$

is the minimal cost of transporting the distribution \(\mu _1\) into \(\mu _2\), and vice versa in view of the symmetry of the cost function. A joint probability measure \(\gamma \in \mathscr {P}(\mathbb {R}^2)\) with marginals \(\mu _1\) and \(\mu _2\) is called a transport plan. It specifies how each point \(x_1\) from \(\mathrm{supp(\mu _1)}\) gets distributed in the course of the transportation; specifically, the transport of \(x_1\) is described by the conditional probability measure \(\gamma _{x_2|x_1}\).

It can be shown that the Wasserstein metric for probability measures on \(\mathbb {R}\) can be expressed in terms of the quantile functions

$$\begin{aligned} W_q(\mu _1,\mu _2) = \bigg ( \int _0^1 |F_{\mu _1}^{[-1]}(p)-F_{\mu _2}^{[-1]}(p)|^q\, dp \bigg )^{1/q}, \end{aligned}$$
(2)

which makes the computation straightforward; see Theorem 7.

To get an understanding of the behavior of \(W_q\), consider two delta measures located at \(x_0\) and \(x_0+\varepsilon\), respectively. By definition of the metric it follows that

$$\begin{aligned} W_q(\delta _{x_0},\delta _{x_0+\varepsilon })=\varepsilon . \end{aligned}$$

Thus, \(W_q\) is continuous with respect to a shift of a point mass. Furthermore, for any two random variables \(X_0\) and \(X_1\) and \(\alpha > 0\)

$$\begin{aligned} \begin{aligned} W_q(P_{\alpha X_0},P_{\alpha X_1})=\alpha W_q(P_{X_0},P_{X_1}) \end{aligned} \end{aligned}$$

which implies that a multiplicative map \(T(x)=\alpha x\) affects the Wasserstein distance.

To formally show that properties (P1) and (P2) are satisfied by the Wasserstein metric, we provide the following theorem.

Theorem 1

The distance \(W_q\) satisfies:

  1. (a)

    \(W_q\) on \(\mathscr {P}_q(\mathbb {R})\) is continuous with respect to the geometry of the distribution.

  2. (b)

    Let \(T:\mathbb {R}\rightarrow \mathbb {R}\) be a continuous, strictly increasing map. \(W_q\) is non-invariant under T, provided, \(T(x) \ne x+C\) and \(T_{\#}\mu \in \mathscr {P}_q(\mathbb {R})\), \(\mu \in \mathscr {P}_q(\mathbb {R})\) .

Proof

See Appendix B. \(\square\)

Theorem 1 states that the Wasserstein metric relies on the geometry of the distribution. In particular, the distance is affected in a continuous way by the change in the geometry of the distribution. This, in turn, provides the desired sensitivity of the Wasserstein metric with respect to slight changes in the dataset distribution, including shifts, which is relevant for ML models with ragged CDFs, which makes the Wasserstein metric an appropriate candidate for the model bias measurement. In addition, as we will see, the Wasserstein distance enables us to assess the favorability at the level of the model, which is useful for applications in financial institutions.

3.2 Negative and positive flows under order preserving optimal transport plan

We now provide several properties of the Wasserstein metric, which we employ in the following sections.

Given two probability measures \(\mu _1,\mu _2 \in \mathscr {P}_q(\mathbb {R})\), it can be shown that the joint probability measure \(\pi ^* \in \mathscr {P}(\mathbb {R}^2)\) with the CDF

$$\begin{aligned} F_{\pi ^*}(a,b)=\min (F_{\mu _1}(a),F_{\mu _2}(b)) \end{aligned}$$
(3)

is an optimal transport plan for transporting \(\mu _1\) into \(\mu _2\) with the cost function \(c(x_1,x_2)=|x_1-x_2|^q\), and thus,

$$\begin{aligned} W_q^q(\mu _1,\mu _2) = \mathscr {T}_{|x_1-x_2|^q}(\mu _1,\mu _2) = \int _{\mathbb {R}^2} |x_1-x_2|^q d\pi ^*(x_1,x_2). \end{aligned}$$
(4)

Most importantly, \(\pi ^*\) is the only monotone (order preserving) transport plan such that

$$\begin{aligned} (x_1,x_2),(x_1',x_2')\in \mathrm{supp}(\pi ^*), \quad x_1<x_1' \quad \Rightarrow \quad x_2 \le x_2'. \end{aligned}$$

In a special case, when \(\mu _1\) is atomless, \(\pi ^*\) is determined by the monotone map

$$\begin{aligned} T^*=F^{[-1]}_{\mu _2} \circ F_{\mu _1}, \end{aligned}$$
(5)

called an optimal transport map. Specifically, each point \(x_1\) of the distribution \(\mu _1\) is transported to the point \(x_2=T^*(x_1)\); see Fig. 3a for an illustration. Thus, \(\mu _2=T^*_{\#}\mu _1\), and the conditional probability measure \(\pi ^*_{x_2|x_1}=\delta _{T^*(x_1)}\) for \(x_1 \in \mathrm{supp}(\mu _1)\). In this case, (4) reads

$$\begin{aligned} W_q^q(\mu _1,\mu _2) = \mathscr {T}_{|x_1-x_2|^q}(\mu _1,\mu _2) = \int _{\mathbb {R}} |x_1-T^*(x_1)|^q d\mu _1(x_1). \end{aligned}$$
(6)

The results (3)-(6) follow from Theorem 7 for the cost function \(c(x_1,x_2)=|x_1-x_2|^q\).

In a general case, under the transport plan \(\pi ^*\), points \(x_1 \in \mathrm{supp(\mu _1)}\) for which \(\mu _1(\{x_1\})=0\) are transported as a whole, while the “atoms”, points \(x_1\) for which \(\mu _1(\{x_1\})>0\), are allowed to be split or spread along \(\mathbb {R}\); see Fig. 3b that illustrates the transport flow under \(\pi ^*\) in the general case. The plot also provides a depiction of the order preservation; notice how the arrows do not intersect.

Fig. 3
figure 3

Transporting \(\mu _1\) to \(\mu _2\) under the monotone transport plan \(\pi ^*\)

To compute the portion of the transport cost used for moving points of \(\mu _1\) to the right or left, it is necessary to restrict the attention to the regions \(x_1 < x_2\) and \(x_1>x_2\), respectively.

Lemma 2

Let \(\mu _1,\mu _2 \in \mathscr {P}_q(\mathbb {R})\), \(q \in [1,\infty )\). Under the monotone plan \(\pi ^*\) the transport efforts to the left and right for the cost function \(c(x_1,x_2)=|x_1-x_2|^q\) are given by:

(7)

Hence, the Wasserstein distance \(W_q\) can be expressed as

$$\begin{aligned} W_q(\mu _1,\mu _2) = \big ( \mathscr {T}_{|x_1-x_2|^q}^{\leftarrow }(\mu _1,\mu _2) + \mathscr {T}_{|x_1-x_2|^q}^{\rightarrow }(\mu _1,\mu _2) \big )^{1/q}. \end{aligned}$$
(8)

Furthermore, if \(\mu _1\) is atomless, (7) reads

(9)

Proof

By (3) the monotone plan can be expressed as

$$\begin{aligned} \pi ^* = (F^{-1}_{\mu _1},F^{-1}_{\mu _2})_{\#} \lambda |_{[0,1]} \in \mathscr {P}(\mathbb {R}^2) \end{aligned}$$

where \(\lambda |_{[0,1]}\) denotes the Lebesgue measure restricted to [0, 1]. Then, by Proposition 6, for any Borel set \(B \subset \mathbb {R}^2\) we have

$$\begin{aligned} \int _{B} |x_1-x_2|^q d\pi ^*(x_1,x_2) = \int _{\{ p\in (0,1): \, (F_{\mu _1}^{[-1]}(p),F_{\mu _2}^{[-1]})(p)) \in B \}} |F^{[-1]}_{\mu _1}(p)-F^{[-1]}_{\mu _2}(p)|^q dp . \end{aligned}$$

Then (7) follows from the above identity with \(B=\{(x_1,x_2): \pm (x_1-x_2)>0\}\). Next, by (4) and (7), we obtain (8).

Finally, if \(\mu _1\) is atomless, by Theorem 7 the monotone plan \(\pi ^*=(I,T^*)_{\#}\mu _1\), where \(T^*\) is the optimal transport map given by (5). Then using Proposition 6 we obtain (9). \(\square\)

3.3 \(W_1\)-based model bias and its components

For \(q=1\) the Wasserstein distance \(W_1\) is known as the Earth Mover distance. Since the distance is symmetric, \(\mathrm{Bias}_{W_1}(f|X,G)\) is the cost of transporting the distribution of \(f(X)|G=0\) into that of \(f(X)|G=1\) or vice versa.

It can be shown that the \(W_1\)-based model bias formulation is consistent with both statistical parity fairness criterion as well as quantile parity criterion, which is shown by the following theorem.

Lemma 3

Let f be a model and \(G\in \{0,1\}\) the protected attribute. Then

$$\begin{aligned} \begin{aligned} \mathrm{Bias}_{W_1}(f|G)&= \int _0^1 bias ^Q_p(f|G) \, dp = \int _{\mathbb {R}} bias ^C_t(f|G) \,dt.\\ \end{aligned} \end{aligned}$$

Proof

By assumption \(\mathbb {E}|f(X)|<\infty\) and hence \(\mathbb {E}[|f(X)|G=k|]<\infty\) for \(k\in \{0,1\}\). Then, we have (Shorack and Wellner (1986))

$$\begin{aligned} \begin{aligned} W_1\big (f(X)|G=0,f(X)|G=1\big )&=\int _0^1 |F_{f(X)|G=0}^{[-1]}(p)-F_{f(X)|G=1}^{[-1]}(p)|\,dp\\&= \int _{\mathbb {R}} |F_{f(X)|G=0}(t)-F_{f(X)|G=1}(t)|\,dt \, < \, \infty . \end{aligned} \end{aligned}$$

Hence, the result follows from Definitions 2 and 3, and the above equality. \(\square\)

Remark 3

The above lemma establishes the representation of the model bias as an integration over the statistical parity bias of classifiers obtained by considering all thresholds. Here, the consistency of the model bias with statistical parity is understood in the sense of the equality in the above lemma. In comparison, Dwork et al. (2012) establishes a connection of statistical parity of Lipschitz randomized classifiers and subpopulations in a dataset upon which the models are built.

While the results in Dwork et al. (2012) do not imply the above lemma, it is appealing to provide a connection between the two. For example, consider the triplet (XGY) with \(Y\in \{0,1\}\) and a smooth regressor \(f(X) = P(Y=1|X)\). Consider a randomized classifier \(z\rightarrow \mu _z\) where \(z=(x,g,y)\), and \(\mu _z(1) = f(x)\). Let \(P_g = P_{Z|G=g}\). Then, the upper bound on statistical parity bias of \(\mu _z\) provided by Dwork et al. (2012) reads

$$\begin{aligned} D_{TV}(\mu _{P_0},\mu _{P_1}) = |\mathbb {E}[f(X)|G=0]-\mathbb {E}[f(X)|G=1]| \le W_1(P_0,P_1), \end{aligned}$$

which illustrates the difference between Lemma 3.1 of Dwork et al. (2012) and our lemma.

Positive and negative model bias. According to Lemma 2, the cost of transporting a distribution is the sum of the transport effort to the left and the transport effort to the right. This motivates us to define the positive bias as the transport effort for moving the particles of \(f(X)|G=0\) in the non-favorable direction and the negative bias as the transport effort in the favorable one; equivalently the latter is the transport effort for moving the particles of \(f(X)|G=1\) into the favorable direction and the former is the transport effort into the non-favorable one.

Motivated by Lemma 2 we define positive and negative model biases as follows:

Definition 7

Let \(f,G,\varsigma _f\) and \(F_k\) be as in Definition 2.

  • The positive and negative \(W_1\) based model biases are defined by

    $$\begin{aligned} \begin{aligned} \mathrm{Bias}_{W_1}^{\pm }(f|G) = \int _{\mathcal {P}_{\pm }} \pm (F_0^{[-1]}(p)-F_1^{[-1]}(p)) \cdot \varsigma _f \, dp \end{aligned} \end{aligned}$$

    where

    $$\begin{aligned} \mathcal {P}_{\pm } =\Big \{ p \in (0,1):\,\, \pm \widetilde{bias} _p^Q(f|G)=\pm (F_0^{-1}(p)-F_1^{-1}(p)) \cdot \varsigma _f > 0 \Big \}. \end{aligned}$$

    In this case, the model bias is disaggregated as follows:

    $$\begin{aligned} \mathrm{Bias}_{W_1}(f|G)=\mathrm{Bias}_{W_1}^+(f|G)+\mathrm{Bias}_{W_1}^-(f|G). \end{aligned}$$
  • The net model bias is defined by

    $$\begin{aligned} \mathrm{Bias}_{W_1}^{net}(f|G)=\mathrm{Bias}_{W_1}^+(f|G)-\mathrm{Bias}_{W_1}^-(f|G). \end{aligned}$$

We next establish that the positive and negative \(W_1\) model biases can be expressed in terms of classifier biases. To establish this, we first prove the following auxiliary lemma.

Lemma 4

Let \(X_0,X_1\) be random variables with \(\mathbb {E}|X_i|<\infty\), \(i \in \{0,1\}\). Let \(F_i\) denote the CDF of \(X_i\) and let

$$\begin{aligned} \begin{aligned} \mathcal {T}_0&=\{t \in \mathbb {R}: F_1(t)< F_0(t) \},&\mathcal {T}_1&=\{t \in \mathbb {R}: F_0(t)<F_1(t) \}\\ \mathcal {P}_0&=\{ p\in (0,1): F^{[-1]}_1(p)< F_0^{[-1]}(p) \},&\mathcal {P}_1&=\{p \in (0,1): F_0^{[-1]}(p)<F_1^{[-1]}(p) \}. \end{aligned} \end{aligned}$$

Then

$$\begin{aligned} \begin{aligned} 0 \le \int _{\mathcal {T}_0} F_0(t)-F_1(t) \, dt&= \int _{\mathcal {P}_1} F_1^{[-1]}(p)-F_0^{[-1]}(p) \, dp< \infty \\ 0 \le \int _{\mathcal {T}_1} F_1(t)-F_0(t) \, dt&= \int _{\mathcal {P}_0} F_0^{[-1]}(p)-F_1^{[-1]}(p) \, dp < \infty . \\ \end{aligned} \end{aligned}$$

Proof

See Appendix B. \(\square\)

Theorem 2

Let \(f,G,\varsigma _f\), \(\mathcal {P}^{\pm }\) and \(F_k\) be as in Definition 7. Then

$$\begin{aligned} \begin{aligned} \mathrm{Bias}_{W_1}^{\pm }(f|G)&= \int _{\mathcal {P}_{\pm }} bias ^{Q}_p(f|G) \,dp = \int _{\mathcal {T}_{\pm }} bias ^{C}_t(f|G) \, dt \end{aligned} \end{aligned}$$
(10)

where

$$\begin{aligned} \mathcal {T}_{\pm } =\big \{ t \in \mathbb {R}:\,\, \pm \widetilde{bias} _t^C(f|G)=\pm (F_1(t)-F_0(t)) \cdot \varsigma _f >0 \big \}. \end{aligned}$$

The net bias satisfies

$$\begin{aligned} \begin{aligned} \mathrm{Bias}_{W_1}^{net}(f|G)&= \int _{0}^1 \widetilde{bias} ^Q_p(f|G)dp = \int _{\mathbb {R}} \widetilde{bias} ^C_t(f|G) \, dt \\&= \big ( \mathbb {E}[f(X)|G=0]-\mathbb {E}[f(X)|G=1] \big ) \cdot \varsigma _f \end{aligned} \end{aligned}$$
(11)

Proof

Suppose first that favorable direction is \(\uparrow\). Since \(\mathbb {E}|f(X)|<\infty\), we have \(\mathbb {E}[|f(X)||G=k]<\infty\) for \(k\in \{0,1\}\). Then by Lemma 4

$$\begin{aligned} \mathrm{Bias}^{\pm }\big (f|G\big )=\pm \int _{\mathcal {P}^{\pm }} F_{f|G=0}^{[-1]}(p)-F_{f|G=1}^{[-1]}(p) \,dp =\pm \int _{\mathcal {T}^{\pm }} F_{f|G=1}(t)-F_{f|G=0}(t) \,dt < \infty . \end{aligned}$$

Hence (10) follows from Definitions 2 and 3, and the above equality.

Next, by (10) and Lemma 17 we have

$$\begin{aligned} \begin{aligned} \mathrm{Bias}^{net}(f|G)&= \mathrm{Bias}^+(f|G)-\mathrm{Bias}^-(f|G) \\&= \int _{\mathcal {T}^+} \big ( F_{f|G=1}(t)-F_{f(X)|G=0}(t) \big ) dt - \int _{\mathcal {T}^-} \big ( F_{f|G=0}(t)-F_{f|G=1}(t) \big ) dt \\&= \int _{-\infty }^0 \big ( F_{f|G=1}(t)-F_{f|G=0}(t) \big ) dt + \int _{0}^{\infty } \big ( (1-F_{f|G=0}(t))-(1-F_{f|G=1}(t)) \big ) dt \\&= \mathbb {E}[f(X)|G=0]-\mathbb {E}[f(X)|G=1]. \end{aligned} \end{aligned}$$

This proves (11). If the favorable direction is \(\downarrow\), the proof of (10) and (11) is similar. \(\square\)

In the context of classification, Theorem 2 states that the positive \(W_1\)-based model bias is the integrated classifier bias over the set of thresholds \(t \in \mathcal {T}_+\) where the classifiers \(Y_t=\mathbbm {1}_{\{f(X>t\}}\) favor the non-protected class \(G=0\). Similar remark holds for the negative model.

Furthermore, the property (10) of \(\mathrm{Bias}_{W_1}^{\pm }\) allow one to use thresholds and quantiles interchangeably, which is beneficial in classification problems. For this reason, we choose \(W_1\) as our primary metric.

Example

To understand the statement of Theorem 2 consider the following classification risk model (\(\varsigma _f=-1\)) with a predictor whose variance depends on the attribute G:

$$\begin{aligned}&X \sim N(\mu ,(1+G)\sqrt{\mu }), \quad \mu =5\\&Y \sim Bernoulli(f(X)), \quad f(X)=\mathbb {P}(Y=1|X)={\sigma (\mu -X)}.\qquad \qquad (\hbox {M2}) \end{aligned}$$

which leads to the presence of both positive and negative bias components in the score distribution. Figure 4 depicts the subpopulation score CDFs of the trained GBM classifier and illustrates the fact that the integrated positive quantile and classifier biases yield the positive model bias (green region), and a similar relationship holds for the negative model bias (purple region). The monotone transport flows are also depicted, showing the connection between the signed model bias and the favorability. Since \(\varsigma _f=-1\), in the green region the non-protected class is transported towards the non-favorable direction, while in the purple region it is transported towards the favorable one.

Fig. 4
figure 4

Positive and negative model biases for the trained XGBoost model (M2), \(\varsigma _f=-1\)

On renormalization of model bias If f(X) is a classification score then \(\mathrm{Bias}_{W_1}(f|G) \in [0,1]\), which makes it easy to interpret the amount of the bias in the model distribution.

For regressors, however, the model bias can take any value in \([0,\infty )\). One approach is to normalize the model bias as follows. First, pick an appropriate reference scale \(L>0\) corresponding to the response variable. Given the scale L one can define a generalized Wasserstein-based model bias as follows:

$$\begin{aligned} Bias_{g,W_1}(f|G) = g\Big ( \frac{1}{L}\mathrm{Bias}_{W_1}(f|G) \Big ) \end{aligned}$$
(12)

where the link function g is strictly increasing and satisfies

$$\begin{aligned} g(x)=\left\{ \begin{aligned}&x, \quad x\in [0,0.5]\\&\hbox {} g \hbox {increases to} 1. \end{aligned} \right. \end{aligned}$$

Having this setup yields \(Bias_{g,W_1}(f|G)=\frac{1}{L}Bias_{W_1}(f|G)\) whenever the transport effort is within the scale of interest L, that is, when \(Bias_{W_1}(f|G) \le \frac{L}{2}\). In practice, for bounded distributions, one can pick \(L=\mathrm{supp}\, P_{f(X)}\), while for unbounded distributions one can pick \(L=2 \sigma (f(X))\).

In our work, we develop the bias explanation methods to explain the actual amount of transport effort between subpopulations. The generalization to (12) is trivial.

3.4 Generalized group-based parity model bias

In this section, we will generalize the notions of the Wasserstein-based bias to the case of generic group-based parity for protected attributes with multiple classes. We then apply the generalization to the equalized odds and the equal opportunity parity conditions.

Definition 8

Let f be a model, \(X \in \mathbb {R}^n\) predictors, \(G \in \{0,1,\dots ,K-1\}\) protected attribute, \(G=0\) non-protected class, and \(\varsigma _f\) the sign of the favorable direction of f. Let \(\mathcal {A}=\{A_1,\dots ,A_M\}\) be a collection of disjoint subsets of the sample space \(\Omega\). Define events

$$\begin{aligned} A_{km}=\{G=k\} \cap A_m, \quad k \in \{0,1,\dots ,K-1\}, \quad m \in \{1,\dots ,M\}. \end{aligned}$$
  1. (i)

    We say that \(Y_t= \mathbbm {1}_{\{f(X)>t\}}\) satisfies \(\mathcal {A}\) group-based parity if

    $$\begin{aligned} \mathbb {P}(Y_t=\mathbbm {1}_{\{\varsigma _f=1\}}|A_{km})=\mathbb {P}(Y_t=\mathbbm {1}_{\{\varsigma _f=1\}}|A_{0m}), \quad k \in \{1,\dots ,K-1\}, \quad m \in \{1,\dots ,M\}. \end{aligned}$$
    (13)
  2. (ii)

    \((W_1,\mathcal {A})\)-based (weighted) model bias is defined by

    $$\begin{aligned} \mathrm{Bias}_{W_1,\mathcal {A}}^{(w)}(f|G) = \sum _{k=1}^{K-1}\sum _{m=1}^{M} w_{km} \mathrm{Bias}_{W_1}(f|\{A_{0m},A_{km}\}), \quad w_{km} \ge 0, \end{aligned}$$

    where the weights satisfy \(\sum _{k=1}^{K-1}\sum _{m=1}^{M} w_{km}=1\).

  3. (iii)

    The positive and negative \((W_1,\mathcal {A})\) weighted model biases are defined by

    $$\begin{aligned} \mathrm{Bias}_{W_1,\mathcal {A}}^{(w)\pm }(f|G) = \sum _{k,m} w_{km} \mathrm{Bias}_{W_1}^{\pm }( f| \{A_{0m},A_{km}\}). \end{aligned}$$

Lemma 5

Let G and \(\mathcal {A}\) be as in Definition 8. The \((W_1,\mathcal {A})\) model bias is consistent with the generic parity criterion (13) as given by the following:

$$\begin{aligned} \begin{aligned} \mathrm{Bias}_{W_1,\mathcal {A}}(f|G)&= \sum _{k,m} w_{km} \int _0^1 |F^{[-1]}_{f|A_{0m}}-F^{[-1]}_{f|A_{km}}| \, dt \\&=\sum _{k,m} w_{km} \int _{\mathbb {R}} | \mathbb {P}(Y_t=\mathbbm {1}_{\{\varsigma _f=1\}}|A_{km})-\mathbb {P}(Y_t=\mathbbm {1}_{\{\varsigma _f=1\}}|A_{0m}) | \, dt. \end{aligned} \end{aligned}$$

Similarly, the signed model biases can be expressed

$$\begin{aligned} \begin{aligned} \mathrm{Bias}_{W_1,\mathcal {A}}^{(w)\pm }(f|G)&:= \sum _{k,m} w_{km} \int _{\mathcal {P}_{km\pm }} \pm \big (F^{[-1]}_{f|A_{0m}}(p)-F^{[-1]}_{f|A_{km}}(p)\big ) \cdot \varsigma _f \ \, dp \\&=\sum _{k,m} w_{km} \int _{\mathcal {T}_{km \pm }} | \mathbb {P}(Y_t=\mathbbm {1}_{\{\varsigma _f=1\}}|A_{km})-\mathbb {P}(Y_t=\mathbbm {1}_{\{\varsigma _f=1\}}|A_{0m}) | \, dt, \end{aligned} \end{aligned}$$

where

$$\begin{aligned} \begin{aligned} \mathcal {P}_{km \pm }&= \Big \{p \in [0,1]: \pm \big (F^{[-1]}_{f|A_{0m}}(p)-F^{[-1]}_{f|A_{km}}(p)\big ) \cdot \varsigma _f> 0 \Big \}\\ \mathcal {T}_{km \pm }&= \Big \{t \in \mathbb {R}: \pm \big (F_{f|A_{km}}(t)-F_{f|A_{0m}}(t)\big ) \cdot \varsigma _f > 0 \Big \}. \end{aligned} \end{aligned}$$

Proof

The claim follows directly from Theorem 4. \(\square\)

Example

Suppose that the favorable direction is \(\uparrow\). Suppose that \(G \in \{0,1\}\) and that the response variable \(Y \in \{0,1\}\). Let \(\mathcal {A}=\{ \{Y=0\}, \{Y=1\} \}\). In that case, the group-based parity condition (13) reads

$$\begin{aligned} \mathbb {P}(Y_t=1|G=0,Y=m)=\mathbb {P}(Y_t=1|G=1,Y=m), \quad m=0,1, \end{aligned}$$

which is the equalized odds criterion; Hardt et al. (2015). Then apply the above Lemma.

3.5 Integral probability metrics for fairness assessment

When assessing fairness of model regressors, it is crucial to pick an appropriate metric because the model output is often used to make decisions. A wide class of candidate metrics could be integral probability metrics (IPMs). These provide a notion of “distance” between probability distributions and are designed as generalizations of the Kantorovich-Rubinstein variational formula. They can be defined directly using variational formulas (Müller 1997; Sriperumbudur et al. 2009). Specifically, IPMs can be defined by maximizing the difference of expected values over a function space \(\mathcal {A}\),

$$\begin{aligned} W_{\mathcal {A}}(\nu _0, \nu _1) := \sup _{\varphi \in \mathcal {A}}\left\{ \int \varphi (x)\,\nu _0(dx) - \int \varphi (x)\,\nu _1(dx)\right\} , \end{aligned}$$
(14)

where \(\nu _0,\nu _1 \in \mathscr {P}(\mathscr{X})\) and \((\mathscr{X},d)\) is a metric space. For example, the Wasserstein metric can be obtained by taking \(\mathcal {A}= \{ \varphi : [\varphi ]_{Lip} \le 1 \}\) in (14), where \([\varphi ]_{Lip}\) is the Lipschitz constant of the function \(\varphi\); The Dudley metric is obtained by taking \(\mathcal {A}= \{\varphi : [\varphi ]_{Lip} + \Vert \varphi \Vert _{\infty } \le 1 \}\). Dropping the regularity of test functions leads to a discontinuous response to shifting of delta masses. For example, by setting \(\mathcal {A}= \{ \varphi : \Vert \varphi \Vert _{\infty } \le 1 \}\), one obtains the total variation metric \(D_{TV}\). An interesting aspect of the above variational formula is that it can be generalized to include a broader family of distances between probability distributions, namely divergences such as the Kullback-Leibler divergence; see Birrell et al. (2020) for more information.

Thus, IPMs with regular test functions serve as good candidates for assessing the fairness of the regressor via formula (1). One of the interesting contenders is \(W_{\mathcal {A}^*}\) where \(\mathcal {A}^* := \{ \varphi : \Vert \varphi \Vert _{\infty } \le \tfrac{1}{2}, [\varphi ]_{Lip}\le 1 \}\), which is an equivalent metric to the Dudley metric and has the appealing property that its values are in the unit interval. \(W_{\mathcal {A}^*}\) provides meaning in fairness assessment, as it could be expressed via a supremum over all “agents” in the form of regular randomized classifiers that detect the differences between two probability subpopulations. Specifically, it can be shown that \(W_{\mathcal {A}^*}\) coincides with the \(D_{rc}\) metric introduced in Dwork et al. (2012) and discussed in Sect. 2.6.

Lemma 6

Let \((\mathscr{X},d)\) be a metric space. Then \(D_{rc}(\mu ,\nu ; D_{TV},d) = W_{\mathcal {A}^*}(\mu ,\nu )\).

Proof

See Appendix B. \(\square\)

Recall that Dwork et al. (2012) established that the statistical parity bias of a randomized classifier is bounded by the \(D_{rc}\) distance between subpopulation input distributions. In contrast, we focus on measuring and explaining the bias in the output of non-randomized regressors, including classification scores, for which the notion of statistical parity is not, in general, applicable. In particular, we assess the distance between regressor output subpopulations via the \(W_1\) metric. In general, any transport metric can be considered for this task, such as \(W_{\mathcal {A}^*}\). Furthermore, we propose a framework that quantifies the contribution of predictors to that distance, which serves as a mechanism that pinpoints the main drivers to the regressor bias.

The lemma below illustrates the different behavior of the two metrics under scaling.

Lemma 7

Let d(xy) be a norm on \(\mathbb {R}^n\). Let \(T(x)=cx+x_0\) with \(c > 0\). Then

$$\begin{aligned} \begin{aligned} D_{rc}(T_{\#}\mu ,T_{\#}\nu ; D_{TV},d ) = D_{rc}(\mu ,\nu ;D_{TV},d_c), \quad \frac{1}{c} W_1(T_{\#}\mu ,T_{\#}\nu ;d) = W_1(\mu ,\nu ;d) \end{aligned} \end{aligned}$$

where \(\mu ,\nu \in \mathscr {P}_1(\mathbb {R}^n;d)\) and \(d_c(x,y)=cd(x,y)\).

Proof

See Appendix B. \(\square\)

Notice that for large c the values of \(D_{rc}\) with the \(d_c\) norm saturate and approximate one, which is an upper bound for the metric. However, \(W_1\) is unbounded and the distance between the pushforward measures \(T_{\#}\mu , T_{\#}\nu\) scales linearly by c, which is an appealing property.

Dwork et al. (2012) establishes the connection between \(D_{rc}\) and \(W_1\) under the assumption that the subpopulation distributions are discrete and \(d\le 1\). In what follows, we prove a more general version of (Dwork et al. 2012, Theorem 3.3) that connects the two metrics and holds for all probability measures with bounded support.

Theorem 3

Let \(\mu ,\nu \in \mathscr {P}_1(\mathbb {R}^n;d)\) have bounded supports and d(xy) be a norm. Then

$$\begin{aligned} \begin{aligned} \frac{1}{L} W_1(\mu ,\nu \,; d) = D_{rc}(\mu ,\nu ;D_{TV},d_{(1/L)})\\ \end{aligned} \end{aligned}$$
(15)

for any \(L>0\) such that \(\mathrm{supp}(\mu ) , \mathrm{supp}(\nu ) \subset B(x_*,\tfrac{L}{2};d)=\{x: d(x,x_*)\le \tfrac{L}{2} \}\).

Proof

See Appendix B. \(\square\)

When using \(D_{rc}\) for fairness, the above theorem implies that saturation can be partially avoided via scaling. For example, the rescaling factor can be chosen as the second moment of the two probability measures. However, in our paper we focus on the Wasserstein metric because of its appealing scaling property.

4 Bias explanations

4.1 Relationship between model fairness and predictors

It is shown in Gordaliza et al. (2019) that the statistical parity bias of (non-randomized) classifiers can be bounded by the total variance distance between predictors subpopulations, while the Wasserstein metric, in general, does not allow for such control (in the sense of a bound). In contrast to the bound in Gordaliza et al. (2019), \(W_1\)-bias in predictors can control the statistical parity bias of Lipschitz randomized classifiers as shown in Dwork et al. (2012), as well as the \(W_1\)-regressor bias as shown by the following lemma.

Lemma 8

Let XGf be as in Definition 8. If f is Lipschitz continuous then

$$\begin{aligned} \mathrm{Bias}_{W_1}(f|X,G) \le [f]_{Lip} \mathrm{Bias}_{W_1}(X|G). \end{aligned}$$
(16)

Proof

The proof follows directly from the Kantorovich-Rubinstein variational formula. \(\square\)

While the fairness of predictors as a bound is of theoretical importance, it provides little information on the contribution of each predictor to the model unfairness. This is because fairness of predictors is a sufficient requirement for fairness of the model, but not a necessary one. In particular, a model can be slightly unfair while having wildly biased predictors. For example, consider the data generating model

$$\begin{aligned} X_1 \sim N(\tau G,\sigma ), \quad X_2 \sim N(0,\sigma ), \quad Y = f(X) = \frac{\varepsilon }{\tau }X_1 + X_2. \end{aligned}$$
(17)

Note that \(\mathrm{Bias}_{W_1}(X|G) \rightarrow \infty\) as \(\tau \rightarrow \infty\), while \(\mathrm{Bias}_{W_1}(f|X,G) = \epsilon\) for any \(\tau >0\).

This pedagogical example motivates us to directly assess the contribution of predictors to the model bias. To accomplish this, we design an interpretability framework that employs optimal transport theory in order to pinpoint the main drivers of the model bias. Information from these drivers can then be used for policy decision-making, regulatory-compliant bias mitigation (Miroshnikov et al. 2021b), as well as in other settings.

4.2 Model interpretability

The bias explanations we develop in the next section make use of model explainers, whose objective is to quantify the contribution of each predictor to the value of f(x). Several methods of interpreting ML model outputs have been designed and used over the years. Some notable ones are Partial Dependence Plots (PDP) (Friedman 2001) and SHAP values (Lundberg and Lee 2017).

Partial dependence plots PDP marginalizes out the variables whose impacts to the output are not of interest, quantifying an overall impact of the values of the remaining features.

Let \(X \in \mathbb {R}^n\) be predictors, \(X_S\) with \(S \subseteq \{1, 2, \dots , n\}\) a subvector of X, and \(-S\) the complement set. Given a model f, the partial dependence plot of f on \(X_S\) is defined by

$$\begin{aligned} P\!D\!P _S(x; f) = \mathbb {E}[ f(x_S,X_{-S}) ] \approx \frac{1}{N} \sum _{j=1}^N f(x_S,X^{(j)}_{-S}), \end{aligned}$$
(18)

where we abuse the notation and ignore the variable ordering in f.

Shapley additive explanations In its original form the Shapley values appear in the context of cooperative games; see Shapley (1953); Young (1985). A cooperative game with n players is a super-additive set function v that acts on \(N=\{1,2,\dots ,n\}\) and satisfies \(v(\varnothing )=0\). Shapley was interested in determining the contribution by each player to the game value v(N). It turns out that under certain symmetry assumptions the contributions are unique and they are called Shapley values; furthermore, the super-additivity assumption can in principle be dropped (uniqueness and existence still hold).

It is shown in Shapley (1953) that there exists a unique collection of values \(\{\varphi _i\}_{i=1}^n\) satisfying the axioms of symmetry, efficiency, and law of aggregation, ((A1)-(A3) in Shapley (1953)), it is given by

$$\begin{aligned} \varphi _i[v] = \sum _{S \subseteq N \backslash \{i\}} \frac{s!(n-s-1)!}{n!} [ v(S \cup \{i\}) - v(S) ], \quad s=|S|, \, n=|N|. \end{aligned}$$
(19)

The values provide a disaggregation of the value v(N) of the game into n parts that represent a contribution to the worth by each player: \(\sum _{i=1}^n \varphi _{i}[v] = v(N).\)

The explanation techniques explored in Štrumbelj and Kononenko (2010) and Lundberg and Lee (2017) utilize cooperative game theory to compute the contribution of each predictor to the model value. In particular, given a model f, Lundberg and Lee (2017) consider the games

$$\begin{aligned} v^{ C\!E}(S; X, f)=\mathbb {E}[f|X_S], \quad v^{ M\!E}(S;X,f)=\mathbb {E}[ f(X_S,X_{-S}) ]|_{x_S=X_S} \end{aligned}$$
(20)

with

$$\begin{aligned} v^{ C\!E}(\varnothing ; X, f)=v^{ M\!E}(\varnothing ; X, f)=\mathbb {E}[f(X)]. \end{aligned}$$

The games defined in (20) are not cooperative since they do not satisfy the condition \(v(\varnothing )=0\). However, by setting \(\varphi _0=\mathbb {E}[f(X)]\), the values satisfy the additivity property:

$$\begin{aligned} \sum _{i=0}^n \varphi _i[v(\cdot \,; X, f)]=f(X), \quad v \in \{v^{ C\!E},v^{ M\!E}\}. \end{aligned}$$

Throughout the text when the context is clear we suppress the explicit dependence of v(SXf) on X and f. Furthermore, we will refer to values \(\varphi _i[v^{ M\!E}]\) and \(\varphi _i[v^{ C\!E}]\) as SHAP values and abusing the notation we write

$$\begin{aligned} \varphi _i(X; f, v)=\varphi _i[v(S;X,f)], \quad v \in \{v^{ C\!E},v^{ M\!E}\}. \end{aligned}$$

Conditional and marginal games In our work, we refer to the games \(v^{ C\!E}\) and \(v^{ M\!E}\) as conditional and marginal, respectively. If predictors X are independent, the two games coincide. In the presence of dependencies, however, the games are very different. Roughly speaking, the conditional game explores the data by taking into account dependencies, while the marginal game explores the model f in the space of its inputs, ignoring the dependencies. Strictly speaking, the conditional game is determined by the probability measure \(P_X\), while the marginal game is determined by the product probability measures \(P_{X_{S}} \otimes P_{X_{-S}}\), \(S \subset N\) as stated below.

Lemma 9

(stability) The SHAP explanations have the following properties:

  1. (i)

    \(\Vert \varphi (X;f,v^{ C\!E})\Vert _{L^2(\mathbb {P})} \le \Vert f\Vert _{L^2(P_X)}\).

  2. (ii)

    \(\Vert \varphi (X;f,v^{ M\!E})\Vert _{L^2(\mathbb {P})} \le C\Vert f\Vert _{L^2(\widetilde{P}_X)}\), with \(\widetilde{P}_X=\frac{1}{2^n}\sum _{S \subset N} P_{X_S} \otimes P_{X_{-S}}\).

Proof

By the properties of the conditional expectation and (19) we have

$$\begin{aligned} \Vert \varphi _i(X;f,v^{ C\!E})\Vert _{L^2(\Omega )} \le \sum _{S \subset N \backslash \{i\}} \frac{s!(n-s-1)!}{n!} \Vert \mathbb {E}[f(X)|X_S]\Vert _{L^2(\Omega )} \le \Vert f\Vert _{L^2(P_X)}. \end{aligned}$$

Since \(\varphi\) is linear, the map in (i) is a bounded, linear operator with the unit norm. This proves (i).

By (19) and (20) we have

$$\begin{aligned} \Vert \varphi _i(X;f,v^{ M\!E})\Vert _{L^2(\Omega )} \le max_{s \in \{0,\dots ,n-1\}} \frac{s!(n-s-1)!}{n!} \sum _{S \subset N \backslash \{i\}} \Vert f\Vert _{L^2(P_{X_S} \otimes P_{X_{-S}})} \le C \Vert f\Vert _{L^2(\tilde{P}_X)}. \end{aligned}$$

where \(C=C(n)\) is a constant that depends on n. This proves (ii). \(\square\)

To clarify the notation, we let \(L^2(\widetilde{P}_X)\) denote the space of functions defined on \(\mathbb {R}^n\) such that

$$\begin{aligned} \int f^2(x) \widetilde{P}_X (dx) := \frac{1}{2^n} \sum _{S \subset N} \int f^2(x_S, x_{-S}) [P_{X_S} \otimes P_{X_{-S}}](x_S,x_{-S}) < \infty , \end{aligned}$$

where as before we ignore the variable ordering in f, and for \(S=\varnothing\) we assign \(P_{X_{\varnothing }} \otimes P_X=P_X\).

We should point out that under dependencies the marginal explanation map (ii) in Lemma 9 is in general not continuous in \(L^2(P_X)\). Hence the algorithm that produces marginal explanations may fail to satisfy the stability bounds in the sense discussed in Kearns and Ron (1999); Bousquet and Elisseeff (2002). For a more general version of the above proposition see Miroshnikov et al. (2021a).

In general, SHAPs are computationally intensive to evaluate due to the different combinations of predictors that need to be considered; in addition, computing \(\varphi [v^{ C\!E}]\) is challenging when the predictor’s dimension is large in light of the curse of dimensionality; see Hastie et al. (2016). Lundberg et al. (2019) created a fast method called TreeSHAP but it can only be applied to ML algorithms that incorporate tree-based techniques. The algorithm evaluates \(\varphi [\nu ]\) for the game \(\nu\) that can be chosen as either one that is based upon tree paths and resembles \(v^{ C\!E}\), or the marginal game \(v^{ M\!E}\). To understand the difference between the two games, see Janzing et al. (2019); Sundararajan and Najmi (2019); Chen et al. (2020); Miroshnikov et al. (2021a).

4.3 Bias explanations of predictors

In this section, given a model, we define the bias explanation (or contribution) of each predictor. An extension to groups of predictors maybe found in Sect. 4.6.

In what follows we will be using the following notation. Given predictors \(X =(X_1,X_2,\dots ,X_n)\) and a model f, a generic single feature explainer of f that quantifies the attribution of each predictor \(X_i\) to the model value f(X) is denoted by

$$\begin{aligned} E(X; f) = (E_1(X;f),E_2(X;f),\dots ,E_n(X;f)). \end{aligned}$$

For example, a simple way of setting up an explainer \(E_i\) is by specifying each component via a conditional or marginal expectation \(E_i(X;f)=v(\{i\};X,f)\), \(v \in \{v^{ C\!E},v^{ M\!E}\}\).

A more advanced way of computing single feature explanations is via the Shapley value \(E(X;f)=\varphi [v(\cdot ;X,f)]\), \(v \in \{v^{ C\!E},v^{ M\!E}\}\). For more details on appropriate game values and their properties see Miroshnikov et al. (2021a).

Definition 9

Let \(X \in \mathbb {R}^n\) be predictors, f a model, \(G \in \{0,1\}\) the protected attribute, \(G=0\) the non-protected class, and \(\varsigma _{f}\) the sign of the favorable direction of f. Let E(Xf) be an explainer of f that satisfies \(\mathbb {E}\big [|E(X;f)|\big ]<\infty\).

  • The bias explanation of the predictor \(X_i\) is defined by

    $$\begin{aligned} \beta _i(f|X,G; E_i) = W_1(E_i(X;f)|G=0 , E_i(X;f)|G=1 ) = \int _0^1 |F_{E_i|G=0}^{[-1]}-F_{E_i|G=1}^{[-1]}| \, dp. \end{aligned}$$
  • The positive bias and negative bias explanations of the predictor \(X_i\) are defined by

    $$\begin{aligned} \begin{aligned} \beta _i^{\pm }(f|X,G; E_i) = \int _{\mathcal {P}_{i\pm }} (F_{E_i|G=0}^{[-1]}-F_{E_i|G=1}^{[-1]}) \cdot \varsigma _{f} \, dp \end{aligned} \end{aligned}$$

    where

    $$\begin{aligned} \begin{aligned} \mathcal {P}_{i\pm } = \{p \in [0,1]: \pm (F^{[-1]}_{E_i|G=0}-F^{[-1]}_{E_i|G=1})\cdot \varsigma _{f} > 0 \}. \end{aligned} \end{aligned}$$

    In this case the \(X_i\) bias explanation is disaggregated as follows:

    $$\begin{aligned} \beta _i(f|X,G; E_i) = \beta _i^{+}(f|X,G; E_i) +\beta _i^{-}(f|X,G; E_i). \end{aligned}$$
  • The \(X_i\) net bias explanation is defined by

    $$\begin{aligned} \beta _i^{net}(f|X,G; E_i) = \beta _i^{+}(f|X,G;E_i) - \beta _i^{-}(f| X,G; E_i). \end{aligned}$$
  • The classifier (or statistical parity) bias of the explainer \(E_i\) for a threshold \(t \in \mathbb {R}\) is defined by

    $$\begin{aligned} \begin{aligned} \widetilde{bias} ^{C}_t(E_i|G)&= \big ( F_{E_i|G=1}(t)-F_{E_i|G=0}(t) \big ) \cdot \varsigma _f. \end{aligned} \end{aligned}$$

By design the contribution \(\beta _i^+\) measures the positive contribution to the total model bias, not the positive one. In particular, it measures the contribution to the increase in the positive flow and the decrease to the negative one. The meaning of \(\beta _i^-\) is similar. To better understand their meaning, consider the following data generating model:

$$\begin{aligned} f(X)=X_1+X_2, \quad X_1=N(\mu +\tau G,\sigma ), \quad X_2=N(\mu -\tau G,\sigma ) \end{aligned}$$
(21)

where \(X_1,X_2\) are independent. Note that \(\mathrm{Bias}_{W_1}(f|X,G)=0\), while the bias explanations are \(\beta _1^+=\tau\), \(\beta _1^-=0\), \(\beta _2^+=0\), \(\beta _2^- = \tau\) for either model explainer discussed in this section. Note also that both positive and negative model biases are zero. The positive contribution \(\beta _1^+=\tau\) measures how much in total is added to the positive model bias and subtracted from the negative one. A similar discussion holds for \(\beta _i^-\). Thus, the amount that \(X_1\) contributes to the positive bias is offset by the amount that \(X_2\) resists to its increase. This leads to zero positive model bias. A similar discussion applies to the negative model bias.

Lemma 10

Let X, f, G, \(E_i(X;f)\), and \(\varsigma _{f}\) be as in the definition 9. Then

$$\begin{aligned} \beta ^{net}(f|X,G;E_i)= \big (\mathbb {E}[E_i(X;f)|G=0]-\mathbb {E}[E_i(X;f)|G=1] \big ) \cdot \varsigma _f. \end{aligned}$$
(22)

Proof

Similar to the proof of Theorem 2 with the assumption \(\varsigma _{E_i}=\varsigma _f\). \(\square\)

Observe that the bias explanations for a classification score always lie in the unit interval.

Lemma 11

Let f be a classification score and \(G \in \{0,1\}\) the protected attribute. Let the explainer \(E_i\) be either \(v(\{i\};X,f)\) or \(\varphi _i[v(\cdot ;X,f)]\), where \(v \in \{v^{ C\!E},v^{ M\!E}\}\). Then \(\beta _i,\beta _i^-,\beta _i^+ \in [0,1]\).

Proof

The lemma follows from the fact that \(f \in [0,1]\) and the definition of explainer values. \(\square\)

The explainer \(E_i\) that appears in Definition 9 is a generic one. In the examples that follow we chose to work with explainers based on marginal SHAPs because of the ease of computation. Note that when predictors are independent then the two types of explanations coincide; for the case when dependencies are present see the discussion at the end of the section.

Intuition. For a given model f and the explainer \(E_i\) the explanation \(\beta _i\) quantifies the \(W_1\) distance between the distributions of the explainer \(E_i|G=0\) and \(E_i|G=1\), that is, this value is an assessment of the bias introduced by the predictor \(X_i\). The value \(\beta _i\) is the area between the corresponding subpopulation explainer CDFs \(F_{E_i|G=k}\), \(k \in \{0,1\}\), similar to the area depicted in Fig. 4. The value \(\beta _i^+\) represents the bias across quantiles of the explainer \(E_i\) for which the predictor \(X_i\) favors the non-protected class \(G=0\) and \(\beta _i^-\) represents the bias across quantiles for which \(X_i\) favors the protected class \(G=1\). The \(\beta ^{net}_i\) assesses the net contribution across different quantiles and represents an explanation that allows one to assess whether on average the predictor \(X_i\) favors class \(G=0\) or class \(G=1\); see Lemma 10.

In what follows we consider several simple examples to get more intuition behind the bias explanation values as well as discuss their additivity or the lack thereof. To avoid complex notation when the context is clear we suppress the dependence of the bias explanations on X and the explainer E.

Definition 10

Let f, X, G, and \(E_i\) be as in Definition 9.

  • We say that \(E_i\) strictly favors class \(G=0\,(G=1)\) if \(\beta _i^-(f|G;E_i)=0\) (\(\beta _i^+(f|G;E_i)=0\)).

  • We say that \(X_i\) has mixed bias explanations if \(\beta _i^{\pm }(f|G;E_i)>0\).

Offsetting. Since each predictor may favor one class or the other, the predictors may offset each other in terms of the bias contributions to the model bias. To understand the offsetting effect consider a binary classification risk model (\(\varsigma _f=-1\)) with two predictors:

$$\begin{aligned}&X_1 \sim N(\mu +G,1), \quad X_2 \sim N(\mu -G,1) \\&Y \sim Bernoulli(f(X)), \quad f(X)=\mathbb {P}(Y=1|X)={logistic(2\mu -X_1-X_2)}\qquad \qquad (\hbox {M3}) \end{aligned}$$

where \(\mu =5\), and \(\{X_i|G=k\}_{i,k}\) are independent and \(\mathbb {P}(G=0)=\mathbb {P}(G=1)\). We next train logistic regression score \(\hat{f}(X)\), with \(\varsigma _{\hat{f}}=-1\), and choose the explainer to be \(E_i= P\!D\!P _i\). By construction the explanation \(E_1\) of the predictor \(X_1\) strictly favors class \(G=0\), while that of \(X_2\) strictly favors class \(G=1\). Moreover,

$$\begin{aligned} \beta _1(\hat{f}|G; E_1)=\beta ^+_1(\hat{f}|G; E_1)=\beta ^-_2(f|G;E_2)=\beta _1(\hat{f}|G; E_2)\approx 0.17. \end{aligned}$$

Combining the two predictors at the model level leads to bias offsetting. By construction the resulting model bias is \(\mathrm{Bias}_{W_1}({f}|G)=0\). Figure 5 displays the CDFs for the trained score subpopulations \(\hat{f}|G=k\) and the corresponding explainers \(E_i|G=k\), which illustrates the offsetting phenomena numerically.

Fig. 5
figure 5

Model and PDP biases for the model (M3), \(\varsigma _{\hat{f}}=-1\)

Another important point we need to make is that the equality \(\beta _i^{net}=0\) does not in general imply that the predictor \(X_i\) has no effect on the model bias. This is a consequence of (22). Moreover, predictors with mixed bias might amplify the model bias as well as offset it. To understand how mixed bias predictors interact at the level of the model bias consider the following risk classification model (\(\varsigma _f=-1\)).

$$\begin{aligned}&X_1 \sim N(\mu , 1+G), \quad X_2 \sim N(\mu , 1+G) \\&Y \sim Bernoulli(f(X)), \quad f(X)=\mathbb {P}(Y=1|X)={logistic(2\mu -X_1-X_2)}.\qquad \qquad (\hbox {M4}) \end{aligned}$$

where \(\mu =5\), and \(\{X_i|G=k\}_{i,k}\) are independent and \(\mathbb {P}(G=0)=\mathbb {P}(G=1)\). As before we train a logistic regression score \(\hat{f}\), with \(\varsigma _{\hat{f}}=-1\), and choose \(E_i= P\!D\!P _i\). By construction, the true classification score f satisfies \(\beta ^{net}_i(f|G)=0\) for each predictor \(X_i\). Furthermore, the CDFs of explainers satisfy

$$\begin{aligned} (F_{E_i(X,f)|G=0}(t)-F_{E_i(X,f)|G=1}(t)) \cdot {\mathrm{sgn}}(t-0.5)>0 \end{aligned}$$

for any threshold \(t \ne 0.5\). Combining the two predictors at the level of the model leads to amplifying the positive and negative model biases and hence the model bias itself. Figure 6 displays the CDFs for the trained score subpopulations \(\hat{f}|G=k\) and the corresponding explainers \(E_i(\hat{f})|G=k\). The numerics illustrate that as long as the regions for positive and negative bias of mixed predictors agree, when combined they will increase the model bias.

Fig. 6
figure 6

Model and PDP biases for the model (M4), \(\varsigma _{\hat{f}}=-1\)

If the regions of positive and negative bias for two predictors do not agree, then offsetting will happen. To see this, let us modify the above example as follows:

$$\begin{aligned}&X_1 \sim N(\mu , 2-G ), X_2 \sim N(\mu , 1+G ) \\&Y \sim Bernoulli(f(X)), \quad f(X)=\mathbb {P}(Y=1|X)={logistic(2\mu -X_1-X_2)}.\qquad \qquad (\hbox {M5}) \end{aligned}$$

By construction, \(\beta ^{net}_i(f|G)=0\) for each predictor. However, the region of thresholds where the explainer \(E_1(f)\) favors class \(G=0\) coincides with the region where \(E_2(f)\) favors class \(G=1\), and the same holds for the two complimentary regions. This leads to bias offsetting so that \(\mathrm{Bias}_{W_1}(f|G)=0\). The numerical results for this example are displayed in Fig. 7.

Fig. 7
figure 7

Model and PDP biases for the model (M5), \(\varsigma _{\hat{f}}=-1\)

Bias explanation plots. Given a machine learning model f, predictors \(X \in \mathbb {R}^n\), protected attribute G, and the explainers \(E_i\), the corresponding bias explanations

$$\begin{aligned} \big \{(\beta _i,\beta _i^+,\beta _i^-,\beta ^{net}_i)(f|G;E_i)\big \}_{i=1}^n \end{aligned}$$

are sorted according to any desired entry in the 4-tuple and then displayed in that order. This plot is called Bias Explanation Plot (BEP).

To showcase how BEP works, consider a classification risk model (\(\varsigma _f=-1\)):

$$\begin{aligned}&\mu =5, \quad a=\tfrac{1}{20}(10,-4,16,1,-3)\\&X_1 \sim N(\mu -a_1 (1-G), 0.5+G ), \quad X_2 \sim N(\mu -a_2 (1-G), 1 ) \\&X_3 \sim N(\mu -a_3 (1-G), 1 ), \quad X_4 \sim N(\mu -a_4 (1-G), 1-0.5 G ) \\&X_5 \sim N(\mu -a_5 (1-G),1-0.75 G ) \\&Y \sim Bernoulli(f(X)), \quad f(X)=\mathbb {P}(Y=1|X)={logistic(\textstyle {\sum _{i}} X_i-24.5)}.\qquad \qquad (\hbox {M6}) \end{aligned}$$

where \(\{X_i|G=k\}_{i,k}\) are independent and \(\mathbb {P}(G=0)=\mathbb {P}(G=1)\). We next generate 20, 000 samples from the distribution (XY) and train a regularized XGBoost model which produces the score \(\hat{f}\). Figure 8 displays the CDFs of the subpopulation scores \(\hat{f}|G=k\) (top left), and those of the explainers \(E_i=\varphi _i(\hat{f}, v^{ M\!E})\). We see that there is positive model bias in the plot showing the CDFs, thus class \(G=0\) is favored. For the predictors, the bias explanation plots show that \(X_1,X_4\) and \(X_5\) have mixed biases that arise due to differences in subpopulation variances of predictors, while the bias in \(X_2\) strictly favors class \(G=1\) and the bias in \(X_3\) favors \(G=0\).

Fig. 8
figure 8

Model bias and SHAP explainer biases for trained XGBoost (M6), \(\varsigma _{\hat{f}}=-1\)

The numerically computed model bias and its disaggregation are given by

$$\begin{aligned} ( \mathrm{Bias}_{W_1}, \mathrm{Bias}^+_{W_1}, \mathrm{Bias}^-_{W_1}, \mathrm{Bias}^{net}_{W_1})(\hat{f}|G)=(0.1533,0.1533,0,0.1533) \end{aligned}$$

The bias explanations are then computed as the Earth Mover distance, and its disaggregation, between the distributions of subpopulation explainers \(E_i(\hat{f})|G=k\). The bias explanations are given by

$$\begin{aligned} \begin{aligned} (\beta _1,\beta _1^+,\beta _1^-,\beta ^{net}_1)&=(0.0860,0.0799,0.0061,0.0738)\\ (\beta _2,\beta _2^+,\beta _2^-,\beta ^{net}_2)&=(0.0328,0, 0.0328,-0.0328)\\ (\beta _3,\beta _3^+,\beta _3^-,\beta ^{net}_3)&=(0.1100,0.1100,0,0.1100)\\ (\beta _4,\beta _4^+,\beta _4^-,\beta ^{net}_4)&=(0.0289,0.0169,0.0119,0.0050)\\ (\beta _5,\beta _5^+,\beta _5^-,\beta ^{net}_5)&=(0.0584,0.0127,0.0457,-0.0330)\\ \end{aligned} \end{aligned}$$

Figure 9 displays the above bias explanations for each predictor in increasing order by total bias (left), positive bias (middle), and ranked net bias (right), respectively. Clearer information can be obtained from these plots compared to Fig. 8. For example, one can now see how mixed \(X_1,X_4,X_5\) are and how the positive and negative parts compare.

Fig. 9
figure 9

Bias explanations ranked by \(\beta _i\) and \(\beta _i^+\) and ranked \(\beta ^{net}_i\) for the model (M6), \(\varsigma _{\hat{f}}=-1\)

Relationship with model bias The positive and negative bias explanations provide an informative way to determine the main drivers for positive and negative bias among predictors, which can be done by ranking the bias attributions. However, though informative, the positive and negative bias explanations are not additive. That is, in general

$$\begin{aligned} \textstyle \mathrm{Bias}_{W_1}^{\pm }(\hat{f}|G) \ne \sum _{i=1}^n \beta _i^{\pm }(\hat{f}|G; E_i). \end{aligned}$$

The main reason for lack of additivity is the presence of bias interactions which happen at the level of quantiles, or thresholds. The bias explanations by design compute the contribution to the cost of transport but do not track how mass is transported; see Figs. 6, 7. To better understand the bias interactions, motivated by Štrumbelj and Kononenko (2010), we introduce a game theoretic approach in Sect. 4.5 that yields additive bias explanations.

For additive models with independent predictors, however, we have the following result.

Lemma 12

Let \(X \in \mathbb {R}^n\) be predictors. Let the model f be additive, that is, \(f(X)=\sum _{i=1}^n f_i(X_i)\). Let an explainer \(E_i\) be either \(v^{ M\!E}(\{i\};X,f)\) or \(\varphi _i[v^{ M\!E}(\cdot ;X,f)]\). Let \(\{\beta _i,\beta _i^+,\beta _i^-,\beta ^{net}_i\}_i\) be the bias explanations of (Xf). Then

$$\begin{aligned} \mathrm{Bias}^{net}_{W_1}(f|G)= \mathrm{Bias}^{+}_{W_1}(f|G)- \mathrm{Bias}^{-}_{W_1}(f|G) = \sum _{i=1}^n \big (\beta ^+_i - \beta ^-_i\big )=\sum _{i=1}^n \beta ^{net}_i. \end{aligned}$$

If X are independent then the lemma holds for \(E_i\) in the form \(v^{ C\!E}(\{i\};X,f)\) or \(\varphi _i[v^{ C\!E}(\cdot ;X,f)]\).

Proof

Suppose that \(E_i(X;f)=v^{ M\!E}(\{i\};X,f)\). Then, in view of the additivity of f, we have

$$\begin{aligned} v^{ M\!E}(\{i\};X,f) = f_i(X_i)-\mathbb {E}[f_i(X_i)]+\mathbb {E}[f(X)] \end{aligned}$$

and hence by Lemma 10 we have

$$\begin{aligned} \beta _i^{net}(f|G; v^{ M\!E}) = \big ( \mathbb {E}[f_i(X_i)|G=0]-\mathbb {E}[f_i(X_i)|G=1] \big ) \cdot \varsigma _{f}. \end{aligned}$$

Summing up the net bias explanations gives

$$\begin{aligned} \begin{aligned} \sum _i \beta _i^{net}(f|G; v^{ M\!E})&= \sum _i \big ( \mathbb {E}[f_i(X_i)|G=0]-\mathbb {E}[f_i(X_i)|G=1] \big ) \cdot \varsigma _{f} \\&= \big ( \mathbb {E}[f(X)|G=0]-\mathbb {E}[f(X)|G=1] \big ) \cdot \varsigma _{f} = \mathrm{Bias}^{net}_{W_1}(f|G). \end{aligned} \end{aligned}$$
(23)

Suppose that \(E_i(X;f)=\varphi _i(X; f, v^{ M\!E})\). Since \(\{X_i\}_{i=1}^n\) are independent and f is additive,

$$\begin{aligned} \varphi _i(X; f, v^{ M\!E})=\varphi _i(X; f, v^{ C\!E}) = f_i(X_i)-\mathbb {E}[f_i(X_i)] = v^{ M\!E}(\{i\};X,f) + \mathbb {E}[f(X)]. \end{aligned}$$

Since a shift in the distribution does not affect the bias, the bias explanation based on \(\varphi _i[v^{ M\!E}]\) coincide with that of \(v^{ M\!E}\). This together with (23) and the independence assumption proves the lemma. \(\square\)

Example

Let f be as in Lemma 12. Suppose that f is either positively biased or negatively biased, that is, \(\mathrm{Bias}_{W_1}(f|G) = (1-\delta ) \cdot \mathrm{Bias}_{W_1}^+(f|G)+\delta \cdot \mathrm{Bias}_{W_1}^-(f|G)\) with \(\delta \in \{0,1\}\). Then

$$\begin{aligned} \mathrm{Bias}_{W_1}(f|G)= (-1)^{\delta } \sum _{i=1}^n (\beta _i^+ - \beta _i^-). \end{aligned}$$

4.4 Stability of marginal and conditional bias explanations

Under dependencies the marginal and conditional bias explanations differ in their description. The conditional bias explanations rely on the joint distribution (XY) and encapsulate the interaction between the bias in predictors and the response variable, while the marginal explanations encapsulate the interaction between bias in predictors and the structure of the model, that is, the map \(x\rightarrow f(x)\); for details see Miroshnikov et al. (2021a). In particular, we have the following result.

Theorem 4

(stability) Let \(X \in \mathbb {R}^n\) be predictors. Let \(E_i=\varphi _i[v]\), \(v \in \{v^{ C\!E},v^{ M\!E}\}\). The bias explanations based on the marginal and conditional Shapley values satisfy the following:

  1. (i)

    For all \(f,g \in L^2(P_X)\), we have

    $$\begin{aligned} |\beta _i^{\pm }(f|X,G,\varphi _i[v^{ C\!E}])-\beta _i^{\pm }(g|X,G,\varphi _i[v^{ C\!E}]) | \le C\Vert f-g\Vert _{L^2(P_X)}. \end{aligned}$$
  2. (ii)

    For all \(f,g \in L^2(\widetilde{P}_X)\), we have

    $$\begin{aligned} |\beta _i^{\pm }(f|X,G,\varphi _i[v^{ M\!E}])-\beta _i^{\pm }(g|X,G,\varphi _i[v^{ M\!E}]) | \le C\Vert f-g\Vert _{L^2(\widetilde{P}_X)}. \end{aligned}$$

Proof

Take \(f,g \in L^2(P_X)\). Take \(i \in \{1,2,\dots ,n\}\) and set

$$\begin{aligned} A=\varphi _i[ v^{ C\!E}(\cdot ;X,f)], \quad B=\varphi _i[ v^{ C\!E}(\cdot ;X,g)]. \end{aligned}$$

Let \(\mu _k=P_{A|G=k}\), \(\nu _k=P_{B|G=k}\), and \(\gamma =P_{(A,B)|G=k}\) for \(k \in \{0,1\}\). By construction \(\gamma _k \in \Pi (\mu _k,\nu _k)\) and hence

$$\begin{aligned} \begin{aligned} \sum _{k \in \{0,1\}} W_1(\mu _k,\nu _k)&\le \sum _{k \in \{0,1\}} \int |x_1 - x_2 | P_{(A,B)|G=k}(dx_1,dx_2) \\&\le \sum _{k \in \{0,1\}} \mathbb {E}[ |A-B| G=k]\\&\le C \Vert A-B\Vert _{L^2(\mathbb {P})} \le C \Vert f-g\Vert _{L^2(P_X)} \end{aligned} \end{aligned}$$

where \(C=\max _{ k \in \{0,1\} }\big \{ \tfrac{1}{\mathbb {P}(G=k)} \big \}\) and the last inequality follows from Lemma 9(i).

Then, using the triangle inequality and the inequality above, we obtain

$$\begin{aligned} \begin{aligned} |\beta _i(f|X,G,\varphi _i[v^{ C\!E}])-\beta _i(g|X,G,\varphi _i[v^{ C\!E}])|&=|W_1(\mu _1,\mu _2)-W_1(\nu _1,\nu _2)| \\&\le W_1(\mu _1,\nu _1)+W_1(\nu _2,\mu _2) \\&\le C \Vert f-g\Vert _{L^2(P_X)}. \end{aligned} \end{aligned}$$

We next establish the bounds for the net-bias explanations. Assuming \(\varsigma _{f}=\varsigma _{g}\) and using Lemma 10 we obtain

$$\begin{aligned} \begin{aligned}&|\beta _i^{net}(f|X,G,\varphi _i[v^{ C\!E}])-\beta _i^{net}(g|X,G,\varphi _i[v^{ C\!E}])| \\&=| \mathbb {E}[A|G=0] - \mathbb {E}[ A|G=1] - \mathbb {E}[B|G=0] + \mathbb {E}[B|G=1] |\\&\le \sum _{k \in \{0,1\}} \mathbb {E}[ |A-B| | G=k]\\&\le C \Vert A-B\Vert _{L^2(P)} \le C \Vert f-g\Vert _{L^2(P_X)}. \end{aligned} \end{aligned}$$

Combining the above inequalities and using the fact that \(\beta ^{\pm }=\frac{1}{2}(\beta \pm \beta ^{net})\) gives (i). To prove (ii), we follow the same steps as above and use Lemma 9(ii). \(\square\)

Remark 4

Proposition 4 implies that the map \(f \rightarrow \beta _i^{\pm }(f|X,G,\varphi _i[v^{ C\!E}])\) is continuous in \(L^2(P_X)\) and the map \(f \rightarrow \beta _i^{\pm }(f|X,G,\varphi _i[v^{ M\!E}])\) is continuous in \(L^2(\widetilde{P}_X)\).

4.5 Shapley-bias explanations

As discussed in Sect. 4.2, the non-additive bias explanations measure the positive and negative contributions to the model bias, but not to each flow. To this end, we measure signed contributions to each positive and negative model bias by employing a game-theoretic approach, which has been explored in numerous works in the area of machine learning interpretability; see Lipovetsky and Conklin (2001); Štrumbelj and Kononenko (2010); Lundberg and Lee (2017). In the spirit of Štrumbelj and Kononenko (2010), we define a cooperative game in which the players are predictors and the payoff is their bias contributions and then compute corresponding additive Shapley values.

Group explainers. Let \(X \in \mathbb {R}^n\) be predictors and f a model. A generic group explainer of f is denoted by

$$\begin{aligned} E(S; X, f), \quad S \subset \{1,2,\dots ,n\}. \end{aligned}$$

We assume that E(SXf) quantifies the attribution of each predictor \(X_S\) with \(S \subset \{1,2,\dots ,n\}\) to the model value f(X) and satisfies

$$\begin{aligned} E(\varnothing , X, f)=\mathbb {E}[f(X)], \quad E(\{1,2,\dots ,n\}; X,f)=f(X). \end{aligned}$$

Relatively straightforward group explainers can be constructed using conditional and marginal game or game value. In particular, for a nonempty \(S \subset \{1,2,\dots ,n\}\) one can set a trivial group explainer as

$$\begin{aligned} v(S; X,f) \quad \text {or} \quad \varphi _S[v]=\varphi _S(X;f,v)=\sum _{i \in S} \varphi _i(X;f,v) \quad \text {where} \quad v \in \{ v^{ C\!E}, v^{ M\!E}\}. \end{aligned}$$
(24)

Definition 11

Let \(X,G,f,\varsigma _f\) be as in Definition 9. Let \(E(\cdot \,; X,f)\) be a group explainer.

  • Cooperative bias-game \(v^{bias}\) associated with XGf and E is defined by

    $$\begin{aligned} v^{bias}(S; G, E(\cdot ; X,f))=W_1(E(S; X,f)|G=0,E(S; X,f)|G=1), \quad S\subset \{1,2,\dots ,n\}. \end{aligned}$$

    \(v^{bias}(S)\) is the minimal cost of transporting \(E(S)|G=0\) to \(E(S)|G=1\) and vice versa.

  • Under optimal transport the positive bias-game and negative bias-game, respectively, are defined by:

    • \(v^{bias+}(S)\) is the effort of transporting \(E(S)|G=0\) in the non-favorable direction.

    • \(v^{bias-}(S)\) is the effort of transporting \(E(S)|G=0\) in the favorable direction.

    The above values are specified in Lemma 2 for \(q=1\).

  • Net bias-game is defined by

    $$\begin{aligned} v^{bias,net}=v^{bias+}-v^{bias+}. \end{aligned}$$
  • The Shapley-bias explanations of (Xf) based on the group explainer E are defined by

    $$\begin{aligned} \begin{aligned} \varphi ^{bias}(f|G) = \varphi [v^{bias}], \quad \varphi ^{bias\pm }(f|G)=\varphi [v^{bias\pm }], \quad \varphi ^{bias,net}(f|G) = \varphi [v^{bias,net}] \end{aligned} \end{aligned}$$
    (25)

    where \(\varphi\) denotes the Shapley value (19) and where we suppressed the dependence on X and E.

Unlike the regular bias explanations which by construction are always non-negative, the Shapley-bias explanations are signed, that is, they can be both positive and negative.

Lemma 13

Given (Xf) and the explainer E, the Shapley bias-explanations defined in (25) satisfy

$$\begin{aligned} \begin{aligned} \sum _{i=1}^n \varphi _i^{bias}= \mathrm{Bias}_{W_1}(f|G), \quad \sum _{i=1}^n \varphi _i^{bias\pm }= \mathrm{Bias}_{W_1}^{\pm }(f|G), \quad \sum _{i=1}^n \varphi _i^{bias,net}= \mathrm{Bias}_{W_1}^{net}(f|G) \end{aligned} \end{aligned}$$

and, thus,

$$\begin{aligned} \begin{aligned} \varphi [v^{bias}]&= \varphi [v^{bias+}] + \varphi [v^{bias-}] \\ \varphi [v^{bias,net}]&= \varphi [v^{bias+}] - \varphi [v^{bias-}]. \end{aligned} \end{aligned}$$

Proof

The result follows from Shapley (1953) and the properties of the \(W_1\)-based model bias. \(\square\)

For Shapley-bias explanations based on the conditional and marginal games we have the following.

Theorem 5

Given (Xf), let the conditional and marginal bias games be defined by

$$\begin{aligned} \begin{aligned} v^{bias, C\!E}(S;X,f)&=v^{bias}(S \,;\varphi _S[v^{CE}(\cdot ;X,f)])\\ v^{bias, M\!E}(S;X,f)&=v^{bias}(S \,;\varphi _S[v^{ME}(\cdot ;X,f)]) \end{aligned} \end{aligned}$$

The conditional and marginal Shapley-bias explanations have the following properties:

  1. (i)

    \(|\varphi _i^{bias\pm }(f|G,\varphi _S[v^{ C\!E}])-\varphi _i^{bias\pm }(g|G,\varphi _S[v^{ C\!E}])| \le C\Vert f-g\Vert _{L^2(P_X)}\).

  2. (ii)

    \(|\varphi _i^{bias\pm }(f|G,\varphi _S[v^{ M\!E}])-\varphi _i^{bias\pm }(g|G,\varphi _S[v^{ M\!E}])| \le C\Vert f-g\Vert _{L^2(\widetilde{P}_X)}\).

Proof

The proof follows the same steps as in Theorem 4. \(\square\)

Example

Applying the above methodology to \(\hat{f}\) and G of the model (M6) we compute the Shapley-bias explanations of predictors \(X_i\), \(i\in \{1,2,\dots ,5\}\) using the group explainer \(E(S)=\varphi _S[v^{ M\!E}]\) defined in (24) for the construction of the bias-game.

The results are displayed in Fig. 10. On the left, the explanations are plotted in increasing order of the positive bias, and in the middle plot by the total bias, while the right plot contains the information on all four types of biases. By comparing these to the non-additive bias explanation plots in Fig. 9 we see how the signed values provide further information on how the predictors contribute to the model bias.

Fig. 10
figure 10

Additive Shapley-bias explanations based on the game \(v^{bias,ME}\) for the model (M6)

For example, from (M6) we have that \(X_3\), as a contributor to the model \(\hat{f}\), favors the class \(G=0\) since \(\beta _3^+>0\) and \(\beta _3^-=0\). Recall that \(\beta _3^+\) captures the total contribution to the increase of the positive model bias plus the decrease (or resistance) to the negative model bias. The Shapley-bias explanations, however, allow one to estimate separately the (signed) contributions to both positive and negative model bias.

In particular, the left plot of Fig. 10 informs us that \(X_3\) in \(\hat{f}\) contributes to the increase of the positive model bias (green), measuring the contribution to pushing the subpopulation of the non-protected class towards the favorable direction, while its contribution to the negative model bias (blue) is negative, which indicates the resistance towards the subpopulation’s pull in the non-favorable direction.

4.6 Group Shapley-bias explanations

It might be important for a practitioner to understand the main factors within the data itself that contribute to the bias in the response variable and not how the model structure contributes to it. To do this, one needs to generate bias explanations based on the conditional game \(v^{ C\!E}\). The conditional game, when predictors are independent, coincides with the marginal game and the conditional expectations \(\mathbb {E}[f(X)|X_S]\) can be computed through averaging with error control. However, under dependencies, the conditional expectations and corresponding Shapley-bias explanations are difficult to compute in light of the curse of dimensionality.

Another important aspect to consider is that highly dependent predictors carry similar information. For instance, in the case where a group of predictors is represented via a smaller collection of latent variables, the latent variable explanations are spread out among the predictors in that group; see Chen et al. (2020). Under dependencies, for practical and business purposes, one may want to explain the information carried by the entire group rather than the predictors themselves.

The two issues mentioned above can be addressed simultaneously by adapting the ideas from Aas et al. (2020); Miroshnikov et al. (2021a). In particular, grouping predictors based on dependencies and utilizing specially-designed group explainers to compute the contribution of the group help unite the marginal and conditional approaches. Therefore, applying similar techniques, one can approximate the conditional Shapley-bias explanations of weakly independent groups using the marginal approach, which only requires averaging over a small dataset. Furthermore, grouping allows one to reduce complexity.

In what follows we adapt the techniques from Miroshnikov et al. (2021a) to construct group Shapley-bias explanations. To this end, we first introduce required notation. Let \(X \in \mathbb {R}^n\) and \(\{S_j\}_{j=1}^m\) be disjoint sets that partition the set of the predictors’ indexes,

$$\begin{aligned} \textstyle N = \{1,2,\dots ,n\} = \bigcup _{j=1}^m S_j, \quad \mathcal {P}=\{S_1,S_2,\dots ,S_m\}, \end{aligned}$$
(26)

so that \(X_{S_1},X_{S_2},\dots ,X_{S_m}\) form weakly independent groups such that within each group the predictors share significant amount of mutual information. Given a cooperative game v on N, we define the quotient game by

$$\begin{aligned} \textstyle v^{\mathcal {P}}(A)=v\big ( \bigcup _{j \in U} S_j \big ), \quad A \subset M=\{1,2,\dots ,m\}. \end{aligned}$$

By design, \(v^{\mathcal {P}}(A)\) is played by the groups, viewing the elements of the partition as players.

Definition 12

Given XfG as in Definition (9), and the partition \(\mathcal {P}\) as in 26.

  • The conditional and marginal group bias-games are defined by

    $$\begin{aligned} v^{bias}_{\mathcal {P}}(A;X,G,f,v)=W_1\big ( v^{\mathcal {P}}(A) |G=0, v^{\mathcal {P}}(A)|G=1 \big ), \quad v \in \{v^{ C\!E},v^{ M\!E}\}. \end{aligned}$$
    (27)
  • The corresponding Shapley-bias explanations of \(\{X_{S_j}\}_{j=1}^m\) are then defined by

    $$\begin{aligned} \begin{aligned} \varphi ^{bias,\mathcal {P}}_{S_j}(f|X,G; v)&= \varphi _j[v^{bias}_{\mathcal {P}}(\cdot \,;v)], \quad v \in \{v^{ C\!E},v^{ M\!E}\}. \end{aligned} \end{aligned}$$

Lemma 14

Given \(X,f,G,\mathcal {P}\) as in Definition 12. If \(\{X_{S_j}\}_{j=1}^m\) are independent, then

$$\begin{aligned} \varphi ^{bias,\mathcal {P}}_{S_j}(f|X,G; v^{ C\!E})=\varphi ^{bias,\mathcal {P}}_{S_j}(f|X,G; v^{ M\!E}), \quad S_j \in \mathcal {P}. \end{aligned}$$
(28)

Consequently,

$$\begin{aligned} |\varphi ^{bias,\mathcal {P}}_{S_j}(f|X,G; v)-\varphi ^{bias,\mathcal {P}}_{S_j}(g|X,G; v)| \le C\Vert f-g\Vert _{L^2(P_X)}, \quad v \in \{v^{ C\!E},v^{ M\!E}\}. \end{aligned}$$

Proof

By independence, we have \(v^{ M\!E,\mathcal {P}}=v^{ C\!E,\mathcal {P}}\). Hence by (27) we obtain

$$\begin{aligned} v^{bias}_{\mathcal {P}}(A;v^{ C\!E})=v^{bias}_{\mathcal {P}}(A;v^{ M\!E}), \quad A \subset M \end{aligned}$$

and this yields (28). The stability argument can be carried out similarly to Lemma 4. \(\square\)

Similar construction is used to compute positive and negative bias explanations \(\varphi ^{bias+,\mathcal {P}}_{S_j}\) and \(\varphi ^{bias-,\mathcal {P}}_{S_j}\), respectively.

Remark 5

The importance of equality (28) is that the expression on the right-hand side can be computed via averaging using a dataset with \(O(\tau ^{-2})\) samples for a given error tolerance \(\tau\). This makes the computation of the conditional explanation feasible. Furthermore, the complexity of computations becomes \(O(2^m)\) where m is the number of independent groups. For example, given a classification score and \(X \in \mathbb {R}^{100}\), having 100 predictors split into 10 independent groups, it is sufficient to use a dataset with 10000 samples in order to compute conditional group Shapley-bias explanations of a classification score with error tolerance \(\tau =0.01\) and complexity \(O(2^{10} \cdot 10000^2)\), which is feasible and easily parallelizable. If the number of independent groups is still large the above technique can modified to incorporate recursive groupings.

5 On the application of the framework

5.1 Bias mitigation under regulatory constraints

In this section, we will discuss how the fairness interpretability framework can be used for real-world applications in financial institutions that work under regulatory constraints.

An operational flow for model development in many FIs may consists of the following stages: (1) Model training, (2) Fair Lending Compliance governance review, and (3) Production, which includes model prediction and decision-making steps. Steps 1 and 3 are carried out by quantitative departments, while step 2 by the dedicated Compliance Office (CO), a department separate from business. The CO provides oversight to the company’s compliance with federal and state regulations.

FIs are explicitly prohibited from collecting some protected information on customers such as race and ethnicity (apart from mortgage lending), as stated by the ECOA. Furthermore, protected attributes cannot be used in training or inference. However, proxy information on the protected attribute such as the one derived from Bayesian Improved Surname and Geocoding (BISG) is allowed to be used by the compliance office solely for fairness analysis (Elliot et al. 2009). Proxy information, however, must remain within the compliance office and the business does not (and should not) have access to the proxy data.

For fairness assessment, the CO carries out the bias assessment step. The CO can determine the main drivers contributing to model bias using our method and subsequently utilize bias mitigation methods. The bias mitigation step can include model postprocessing. However, in order to adhere to regulations, a post-processed model must not utilize the proxy attribute \(\tilde{G}\) or any information on the joint distribution \((X,\tilde{G})\), such as probabilities \(\mathbb {P}(\tilde{G}|X)\). The reasons for that are a) in the production step one can only have access to X, and b) a post-processed model is shared with business units that should be prevented from inferring the protected attribute from X.

Some rudimentary techniques for bias mitigation include recommendations on which predictors to drop from training or model post-processing via nullifying a given predictor by fixing its value. A more efficient technique has been proposed in our companion paper Miroshnikov et al. (2021b). There we construct an efficient frontier over a family of compliant post-processed models utilizing the interpretability framework developed in the current article. Other examples of compliant methods include those that vary hyper-parameters to get an efficient frontier, such as those in Schmidt et al. (2021).

5.2 Pedagogical example on bias mitigation

In this section we provide a pedagogical example that showcases how to properly utilize the information on the positive and negative bias explanations when it comes to bias mitigation. A rudimentary mitigation technique one can employ is to construct a regulatory-compliant post-processed model by neutralizing an appropriate collection of predictors \(X_S\). This is accomplished by fixing their values in the model to some reference values \(x_S^*\) and setting \(\tilde{f}(x;S,x^*)=f(x_S^*,x_{-S})\).

Often the objective of the bias mitigation in FIs is the reduction of the positive model bias which quantifies how much the model favors the majority class. In practice, regressor models are usually positively-biased, meaning \(\mathrm{Bias}_{W_1}^+(f|G)>0\) and \(\mathrm{Bias}_{W_1}^-(f|G)=0\).

Taking into account the above discussion, let us assume for the sake of explanation that \(f(X)=\sum _{i=1}^n f_i(X_i)\) is an additive and positively-biased model. Let \(\beta ^+_i\), \(\beta ^-_i\), where \(i \in N=\{1,\dots ,n\}\), be the positive and negative marginal bias explanations, respectively. Finally, let us decompose the predictor index set as follows: \(N=N_+ \cup N_- \cup N_0\) where

$$\begin{aligned} N_+=\{i: \beta _i^+>\beta _i^-\}, \quad N_-=\{i: \beta _i^->\beta _i^+ \}, \quad N_0=\{i: \beta _i^+=\beta _i^-\}. \end{aligned}$$

In this case, by Lemma 12 the model bias is given by

$$\begin{aligned} \mathrm{Bias}_{W_1}(f|X,G)=\mathrm{Bias}_{W_1}^+(f|X,G)= \sum _{i \in N_+} (\beta _i^+ - \beta _i^-) - \sum _{i \in N_-} (\beta _i^- - \beta _i^+) > 0 \end{aligned}$$

which illustrates the bias offsetting mechanism.

Note that neutralizing the predictor \(i_0 \in N_-\) would cause the model bias, which is equal to the positive model bias, to increase, while neutralizing \(i_1 \in N_+\)would cause the model bias to decrease.

Thus, one approach to reduce the model bias is to rank order the predictors in \(N_+\) by their net-bias explanations and, subsequently, neutralize them one by one in that order. This will incrementally reduce the positive model bias until the point where neutralizing the next predictor causes the model bias to become equal to the negative model bias (with the positive model bias being zero), which operates as a stopping criterion of the approach. This simple and rather naïve strategy illustrates that a) the decomposition of explanations is useful for bias mitigation and that b) neutralization of biased predictors ranked by total bias contribution is not always the optimal strategy.

5.3 Example on census income dataset

In this section, we showcase the application of the framework to the 1994 Census Income dataset from the UCI Machine Learning Repository (Dheeru et al. 2017).

This dataset contains fourteen predictors and a dependent variable Y that indicates if an individual earns more or less than $50K annually. After investigating the predictors, we removed the protected attributes ‘sex’, ‘race’, ‘age’, and ‘native-country’. We also excluded ‘fnlwght’ and ‘relationship’, the latter due to its high dependence with ‘sex’ since in the dataset the categorical values ‘Husband’ and ‘Wife’ correspond to ‘Male’ and ‘Female’, respectively. The remaining seven predictors used for model training are ‘workclass’, ‘education-num’, ‘occupation’, ‘marital-status’, ‘capital-gain’, ‘capital-loss’, and ‘hours-per-week’.

figure a
Fig. 11
figure 11

Model training and protected attribute analysis

For the model training, we use the training dataset \(D_{train}\) with 32561 samples to build a classification score

$$\begin{aligned}\hat{f}(x)=\widehat{\mathbb {P}}(Y=`>\text {50K'}|X=x),\end{aligned}$$

using Gradient Boosting. For training we use the following parameters: n_estimators=200, min_samples_split=5, subsample=0.8, learning_rate=0.1. The feature importance of each predictor can be seen in Fig. 11a, with the most significant predictors being ‘marital-status’, ‘capital-gain’, and ‘education-num’.

Performance metrics for the GBM model on the trained dataset, and test dataset with 16251 samples, were evaluated. Specifically, the mean logloss on the train and test set is approximately 0.288 and 0.292 respectively, and the AUC is 0.922 and 0.918 respectively.

The focus of the application is to evaluate and explain the model bias with respect to the protected attribute \(G=\)‘sex’, with values ‘Female’ and ‘Male’, where ‘Female’ is the protected class. To this end, following the steps in Algorithm 1, we form the dataset S containing the classification scores

$$\begin{aligned} S = \big \{\hat{f}(x^{(i)}): (x^{(i)},y^{(i)})\in D_{train} \big \}, \end{aligned}$$

and partition it based on each class of G. This yields the sets \(S_{M}\) and \(S_{F}\) containing the classification scores for ‘Female’ and ‘Male’ respectively, which we use to construct the empirical CDFs of the subpopulation scores, \(\hat{F}_{Female}\) and \(\hat{F}_{Male}\), using the ECDF class from the statsmodels library.

Figure 11b depicts the empirical CDFs, where we see that the model has almost exclusively positive bias, and the positive direction is assumed to be \(\varsigma _{\hat{f}}=1\). To confirm this observation, we subsequently compute the positive and negative model biases by integrating the difference of the two CDFs over the sets where \(\hat{F}_{Female}> \hat{F}_{Male}\) and \(\hat{F}_{Female}<\hat{F}_{Male}\), respectively, as indicated in Definition 7. This yields the following values:

$$\begin{aligned} \mathrm{Bias}^+_{W_1}(\hat{f}|X,G) \approx 0.19, \quad \mathrm{Bias}^-_{W_1}(\hat{f}|X,G) \approx 0.00. \end{aligned}$$

To understand the contributions of the predictors to the model bias, we next construct the bias explanations based on the marginal model explainer. To accomplish this, we subsample the predictors from the training set, and obtain a background dataset \(D_X\) with \(m=4000\) samples. Next, we compute the model explanations for each predictor \(X_i\) yielding the sets

$$\begin{aligned} S_{E_i} = \Big \{ \tfrac{1}{m}\sum _{x\in D_X}\hat{f}(x_i^{*},x_{-i}), \ x^{*}\in D_X \Big \}. \end{aligned}$$

Similar to obtaining the model bias, we then partition \(S_{E_i}\) based on each class of G and obtain the empirical CDFs of \(E_i(X)|G=g, \ g\in \{\text {`Female'},\text {`Male'}\}\), which are then used to compute the bias explanations \(\beta _i^{\pm }\) according to Definition 9. These are depicted in Fig. 12a and are ranked in ascending order of the positive bias. All the values for the negative bias explanations are close to zero, which further indicates the positively biased nature of the predictors. Observe that the most positively contributing predictor to the model bias is ‘marital-status’ by far with value \(\approx 0.12\).

Since ‘marital-status’ is the most impactful, it merits further investigation into its effect on the model bias. To this end, we group the different values of ‘marital-status’ into three categories: \(M_1 =\)‘never-married’, \(M_2=\)‘married’, and \(M_3=\)‘was-married’. Then, we segment the dataset S of classification scores into three subsets \(S_{M_i}, \ i\in \{1,2,3\}\), that correspond to the aforementioned categories. To gain further understanding on how each of these categories contributes to the model bias, we compute the model bias on each segment. The negative model bias on each segment turns out to be zero, while the positive model biases are plotted in Fig. 12b. The plot indicates that the category ‘never-married’ exhibits an insignificant level of bias, while there is some substantial positive bias in ‘married’ and ‘was-married’.

Fig. 12
figure 12

Model bias explanations

Fig. 13
figure 13

Bias explanations for the re-trained model without ‘marital-status’ predictor

Given the above analysis, one can attempt to reduce the model bias either by applying the postprocessing technique discussed in Sect. 5.2, or, alternatively, retrain the model by dropping some of the biased predictors. We showcase the latter approach by dropping ‘marital-status’ and retraining the model with the same parameters. We check the performance of the new model on the train and test sets. The mean logloss is 0.358 and 0.363 respectively, and AUC is 0.862 and 0.855 respectively. We then compute the model bias and bias explanations; see Fig. 13. The positive model bias has been reduced to approximately 0.10, while the negative stays zero. The trade-off is a drop in performance, as seen by the performance metric values above. The bias explanations in the retrained model have slightly increased since ‘marital-status’ was dropped and the importance of the remaining predictors increased.

We would like to point out that the technique used above might not lead to bias reduction under the presence of strong dependencies, since other predictors could be used as proxies for the dropped predictor. However, the postprocessing technique outlined in Sect. 5.2 modifies the model score directly and the dependencies do not play a significant role. Keep in mind that this technique is rather crude and one may opt to employ the postprocessing methods described in Miroshnikov et al. (2021b) which apply to numerical predictors, but can be adjusted for categorical ones.

6 Conclusion

In this paper, we presented a novel bias interpretability framework for measuring and explaining bias in classification and regression models at the level of a distribution that utilizes the Wasserstein metric and the theory of optimal mass transport. We introduced and theoretically characterized bias predictor attributions to the model bias and constructed additive bias explanations utilizing cooperative game theory. To our knowledge, bias interpretability methods at the level of a regressor distribution have not been addressed in the literature before.

At a higher level, the model bias is a non-trivial superposition of predictor bias attributions. The bias explanations we introduced determine the contribution of a given predictor to the model bias. However, any two or more predictors will interact in the context of the bias explanations. For example, if one predictor favors the non-protected class and the other favors the protected class, it might be possible that when both predictors are utilized by the model the total effect on model bias is zero. This phenomenon opens up numerous avenues for future research to investigate the interactions of predictors across subpopulation distributions in the context of bias explanations. This is where ML interpretability techniques can come into play and aid with the study of predictor interactions in the model bias.

To make bias explanations additive we utilized cooperative game theory which lead to additive Shapley-bias explanations. These explanations rely on the Shapley formula, which makes them computationally expensive. The intractability of such calculations can be mitigated by grouping predictors based on dependencies and then computing the Shapley bias attributions for each group (via a quotient game) which reduces the dimensionality. However, if the number of groups is large, the issue of computational intensity remains. Thus, a possible research direction is to investigate methods that allow for approximation of the additive bias explanations and their fast computations.

In this paper, we formulated a methodology that computes the model bias and quantifies the contribution of predictors to that bias. However, an important application of the bias explanation methodology lies in bias mitigation, which will be useful in regulatory settings such as the financial industry, and may utilize information about the main drivers of the model bias. This will be investigated in our upcoming paper. The framework is generic and in principle can be applied to a wide range of predictive ML systems. For instance, it might be insightful to understand the predictor attributions to probabilistic differences of populations studied in physics, biology, medicine, economics, etc.