Wasserstein-based fairness interpretability framework for machine learning models

The objective of this article is to introduce a fairness interpretability framework for measuring and explaining the bias in classification and regression models at the level of a distribution. In our work, we measure the model bias across sub-population distributions in the model output using the Wasserstein metric. To properly quantify the contributions of predictors, we take into account the favorability of both the model and predictors with respect to the non-protected class. The quantification is accomplished by the use of transport theory, which gives rise to the decomposition of the model bias and bias explanations to positive and negative contributions. To gain more insight into the role of favorability and allow for additivity of bias explanations, we adapt techniques from cooperative game theory.


Introduction
Contemporary machine learning (ML) techniques surpass traditional statistical methods in terms of their higher predictive power and their capability of processing a larger number of attributes.However, these novel ML algorithms generate models that have a complex structure which makes it difficult for their outputs to be interpreted with high precision.Another important issue is that a highly accurate predictive model might lack fairness by generating outputs that may result in discriminatory outcomes for protected subgroups.Thus, it is imperative to design predictive systems that are not only accurate but also achieve the desired fairness level.
When used in certain contexts, predictive models, and strategies that rely on such models, are subject to laws and regulations that ensure fairness.For instance, a hiring process in the United States (US) must comply with the Equal Employment Opportunity Act (EEOA, 1972).Similarly, financial institutions (FI) in the US that are in the business of extending credit to applicants are subject to the Equal Credit Opportunity Act (ECOA, 1974), the Fair Housing Act (FHA, 1968), and other fair lending laws.These laws often specify protected attributes that FIs must consider when maintaining fairness in lending decisions.
Examples of protected attributes include race, gender, age, ethnicity, national origin, marital status, and others.Under the ECOA, for example, it is unlawful for a creditor to discriminate against an applicant for a loan on the basis of race, gender or age.Even though direct usage of protected attributes in building a model is often prohibited by law (e.g.overt discrimination), some otherwise benign attributes can serve as "proxies" because they may share dependencies with a protected attribute.For this reason, it is crucial for data scientists to conduct a fairness review of their trained models in consultation with compliance professionals in order to evaluate the predictive modeling system for potential unfairness.In this paper, we develop a fairness interpretability framework to aid in this important task.
At an algorithmic level, bias can be viewed as an ability to differentiate between two subpopulations at the level of data or outcomes.Regardless of its definition, if bias is present in data when training an ML model, the ability to differentiate between subgroups might potentially lead to discriminatory outcomes.For this reason, the model bias can be viewed as a measure of unfairness and hence its measurement is central to the model fairness assessment.
There is a comprehensive body of research on ML fairness that discusses bias measurements and mitigation methodologies.Kamiran et al. (2009) introduced a classification scheme for learning unbiased models by modifying the biased data sets without direct knowledge of the protected attribute.Kamishima et al. (2012) proposed a regularization approach for discriminative probabilistic models.Zemel et al. (2013) designed an optimization problem that incorporates fairness constraints.Feldman et al. (2015) proposed a geometric repair scheme to remove disparate impact in classifiers by making data sets unbiased.Hardt et al. (2015) indtroduced post-processing techniques removing discrimination in classifiers based on equalized odds and equal opportunity fairness criteria.Woodworth et al. (2017) designed a framework for nearly-optimal learning predictors with equalized odds fairness constraint.Zhang et al. (2018) proposed to use adversarial learning to mitigate bias, and Jiang (2020) suggested a bias correction technique via re-weighting the data.
The work of Dwork et al. (2012) studies Lipschitz randomized classifiers and their statistical parity bias.It establishes a bound on that bias by a transport-like distance between the input subpopulation distributions.The bound aids in constructing an optimal Lipschitz classifier with control over the statistical parity bias by transporting one of the subpopulation input datasets into the other.The work of Gordaliza et al. (2019) establishes a similar bound for non-randomized classifiers by the total variance distance between input subpopulation distributions.Guided by the bound and utilizing optimal transport theory, their method focuses on repairing input datasets in a way that allows for control of the total variance distance, and hence the statistical parity bias.
Though the bounds in the aforementioned works are of theoretical and practical importance, they provide little information on how each component of the input contributes to the bias in the output.The main reason for that is that the bias from the inputs propagates through the model structure in a nontrivial way.For this reason, in our work, we focus on designing a fairness interpretability framework that evaluates how each predictor contributes to the model bias, incorporating the predictor's favorability with respect to protected (or minority) class into the framework.The construction is carried out by employing optimal transport theory and game-theoretic techniques.
Another issue regarding the ML fairness literature is that it mainly focuses on classifiers.Specifically, given the data (X, G, Y ), where X ∈ R n are predictors, G ∈ {0, 1} is a protected attribute and Y ∈ {0, 1} is a binary output variable, with favorable outcome Y = 1, the bias measurements are often based on fairness criteria such as statistical parity, which reads P( Ŷ = 1|G = 0) = P( Ŷ = 1|G = 1), or alternative criteria such as equalized odds and equal opportunity (Feldman et al., 2015, Hardt et al., 2015).
Many models in the financial industry, however, are regressors f = E[Y |X].In turn, classification models are usually obtained by thresholding the regressor, Yt(X) = 1 {f (X)>t} , but the thresholds are in general not chosen during the model development stage.Thus, data scientists select the classification score f (X) = P(Y = 1|X) based on the overall performance across all thresholds.The same is true for fairness assessment, which is conducted at the level of the whole classification score.The main reason for this is that the strategies and decision-making procedures in FIs may rely on the classification score or its distribution, not a single classifier with a fixed threshold.This motivates us to measure and explain the bias exclusively in the regressor model.Our interpretability framework in principle can be applied to a wide range of predictive ML systems.For instance, it can provide insight into predictor attributions for models that appear in economics, social sciences, medicine, and other fields.
Another application of the framework is for bias mitigation under regulatory constraints.In FIs, bias mitigation methodologies that require explicit consideration of protected class status in the training or prediction stages are not acceptable in view of ECOA.Consequently, bias mitigation methods such as those in Dwork et al. (2012), Feldman et al. (2015), Gordaliza et al. (2019) are not feasible.However, a probabilistic proxy model for a protected attribute G such as the Bayesian Improved Surname and • Motivated by optimal transport theory, we focus on the bias measurement in the model output via the Wasserstein metric W1 |x1 − x2| dπ(x1, x2), with marginals P f (X)|G=0 , P f (X)|G=1 , which measures the minimal cost of transporting one distribution into another; see Santambrogio (2015).More importantly, we introduce the model bias decomposition into the sum of the positive and negative model biases, Bias ± W 1 (f |G), which measure the transport effort for moving points of the unprotected subpopulation distribution f (X)|G = 0 in the non-favorable and favorable directions, respectively.This allows us to obtain a more informed perspective on the predictor's impact; see Sections 3.2,3.3.
• We establish the connection of the model bias with that of a classifier.We show that the positive and negative model bias can be viewed as the integrated statistical parity bias over the family of classifiers induced by the regressor.This integral relationship is then used to construct an extended family of transport metrics for regressor bias.Via integration, these metrics incorporate generic group parity fairness criteria for classifiers induced by the given regressor.Furthermore, we prove a more general version of Dwork et al. (2012, Theorem 3.3) that establishes the connection between the Wasserstein-based bias and the randomized classifier-based bias; see Sections 3.3, 3.4.
• We introduce bias predictor attributions called bias explanations in order to understand how predictors contribute to the model bias.The bias explanation βi of predictor Xi is computed as the cost of transporting the distribution of Ei|G = 0 to that of Ei|G = 1, where Ei(X; f ) quantifies the contribution of Xi to the model value.The transport theory gives rise to the decomposition βi = β + i + β − i into the sum of positive and negative model bias explanations.Roughly speaking, β + i quantifies the combined predictor contribution to the increase of the positive model bias and decrease in the negative model bias, and vice versa for β − i ; see Section 4.3.• The bias explanations are in general not additive, even if the predictor explanations are.To construct additive bias explanations and to better capture the interactions at the distribution level, 1 Compliance departments employ the proxy model for compliance purposes only.
we employ a cooperative game theory approach motivated by the ideas of Štrumbelj and Kononenko (2010).We design a cooperative bias game v bias which evaluates the bias in the model attributed to coalitions XS, S ⊂ {1, . . ., n}, and define bias explanations via the Shapley value ϕ[v bias ], which yields additivity.Similar approach is applied to construct additive positive and negative bias explanations; see Section 4.5.
• We choose to design the bias explanations based upon model explainers Ei that are either conditional or marginal expectations, or game-theoretic explainers in the form of the Shapley value ϕ[v] where v is either a conditional game v CE or a marginal game v ME .For each v ∈ {v CE , v ME } we perform the stability analysis of non-additive and additive bias explanations.By adapting the grouping techniques from Miroshnikov et al. (2021a), we reduce the complexity of game-theoretic bias explanations and unite marginal and conditional approaches; see Sections 4.4-4.6.
Structure of the paper.In Section 2, we introduce the requisite notation and fairness criteria for classifiers, and discuss ML fairness literature related to our work.In Section 3, we introduce the Wassersteinbased regressor bias and investigate its properties.In addition, we discuss a wide class of transport metrics that could be used for fairness assessment.In Section 4, we provide a theoretical characterization of the bias explanations and investigate their properties.In Section 5 we discuss some regulatory aspects of bias mitigation, and present an application of the framework to a UCI dataset.In Appendix A, we discuss the Kantorovich transport problem.In Appendix B, we state and prove auxiliary lemmas.

Notation and hypotheses
We consider the joint distribution (X, G, Y ), where X = (X1, X2, . . ., Xn) ∈ R n are the predictors, G ∈ {0, 1, . . ., K − 1} is the protected attribute and Y is either a response variable with values in R (not necessarily a continuous random variable) or a binary one with values in {0, 1}.We encode the nonprotected class as G = 0 and assume that all random variables are defined on the common probability space (Ω, F, P), where Ω is a sample space, P a probability measure, and F a σ-algebra of sets.
The true model and a trained one, which is assumed to be trained without access to G, are denoted by respectively.In the case of binary Y they read f (X) = P(Y = 1|X) and f (X) = P(Y = 1|X).We denote a classifier based on the trained model by The subpopulation cumulative distribution function (CDF) of f (X)|G = k is denoted by and the corresponding generalized inverse (or quantile function) is defined by: We assume that there is a predetermined favorable model direction, denoted by either ↑ or ↓.If the favorable direction is ↑ then the relationship f (x) > f (z) favors the input x, and if it is ↓ the input z.The sign of the favorable direction of f is denoted by ς f and satisfies In the case of binary Y , the favorable direction ↑ is equivalent to Y = 1 being a favorable outcome, and In what follows we first develop the framework in the context of the binary protected attribute G ∈ {0, 1} and then extend it to the case of the multi-labeled protected attribute; see Section 3.4.

Fairness criteria for classifiers
When undesired biases concerning demographic groups (or protected attributes) are in the training data, well-trained models will reflect those biases.There have been numerous articles devoted to ML systems that lead to fair decisions.In these works, various measurements for fairness have been suggested.In what follows, we describe several well-known definitions which help measure fairness of classifiers.
Definition 2.1.Suppose that Y is binary with values in {0, 1} and Y = 1 is the favorable outcome.Let Y be a classifier.
• Y satisfies statistical parity (Feldman et al., 2015) if • Y satisfies equalized odds (Hardt et al., 2015) if • Y satisfies equal opportunity (Hardt et al., 2015) if • The balanced error rate (BER) of Y (Feldman et al., 2015) is given by The statistical parity requires that the proportions of people in the favorable class Y = 1 within each group G = k, k ∈ {0, 1} are the same.The equalized odds constraint requires the classifier to have the same misclassification error rates for each class of the protected attribute G and the label Y .Equal opportunity constraint requires the misclassification rates to be the same for each class G = k only for the individuals labeled as Y = 1.The BER is the average class-conditioned error rate of Y .

Group classifier fairness example
There are numerous reasons why a trained classifier may lead to unfair outcomes.To illustrate, we provide an instructive example that shows how predictors and labels, as well as their relationship with the protected attribute, affect classifier fairness.
Consider a data set (X, Y, G) where the predictor X depends on G ∈ {0, 1}, Y ∈ {0, 1} is binary, with favorable outcome Y = 0, and the classification score f depends explicitly on X only: The data set is constructed in such a way that the proportions of Y = 0|G = k in the two groups are different: P(Y = 0|G = 0) = 0.5, P(Y = 0|G = 1) = 0.36.The predictor X serves as a good proxy for G, which can be seen in Figure 1a.The plot depicts the density of X and the conditional densities of X given G = 0 and G = 1, respectively.The shifted conditional densities clearly show the dependence of X on G. Though the true score f (X) does not depend explicitly on G, a classifier trained on X will learn that the higher the value of X the more likely it is that Y = 0. Using the logistic regression model f we observe that for any threshold t ∈ (0, 1) the classifier Yt satisfies neither the statistical parity, nor the equal opportunity, nor the equalized odds criterion.Furthermore, since both classes of G are equally likely, BER( Yt, G) < 0.5 implies that one can potentially infer G from X; see Figure 1b.The vertical axis in the plot represents the difference between the probabilities for each of the first three fairness metrics described in Definition 2.1 as well as the value of the balanced error rate.Notice how only in the trivial cases where t ∈ {0, 1} are all metrics satisfied and the balanced error rate is equal to 0.5, since Y0 = 1, Y1 = 0 for all X.

Classifier bias based on statistical parity
In this section we provide a definition for classifier bias based on the statistical parity fairness criterion and establish some basic properties of the classifier bias.In what follows, we suppress the symbol ˆ, using it only when it is necessary to differentiate between the true model and the trained one.The same rule applies to classifiers.
Definition 2.2.Let f be a model, X ∈ R n predictors, G ∈ {0, 1} protected attribute, G = 0 non-protected class, ς f the sign of the favorable direction, and F k the CDF of f (X)|G = k.
• The signed classifier (or statistical parity) bias for a threshold t ∈ R is defined by • The classifier bias at t ∈ R is defined by We say that Yt favors the non-protected class G = 0 if the signed bias is positive.Respectively, Yt favors the protected class G = 1 if the signed bias is negative.
Remark 2.1.Suppose that Y ∈ {0, 1} is binary and that the favorable direction is ↑, which implies that 1 {ς f =1} = 1.Then Yt favors the non-protected class G = 0 if and only if there is a larger proportion of individuals from class G = 0 for which Yt = 1 compared to the class G = 1.This, from a statistical parity perspective, describes the outcome Y = 1 as favorable.Similar remarks apply to the case when the favorable direction is ↓.Thus, the favorable direction is ↑ (↓) is equivalent to the favorable outcome Y = 1 (Y = 0).

Quantile bias and geometric parity
Given a model f and a threshold t ∈ R, the classifier bias based on statistical parity measures the difference in population sizes corresponding to groups G = {0, 1} for which Yt = 0.This measurement however does not take into account the geometry of the model distribution, that is, the score values themselves.
For example, when measuring the bias in incomes among 'females' and 'males' one can view the difference of expected incomes in the two groups as 'bias'.Alternatively, one can measure an income bias by evaluating the absolute difference of the 'female' median income and 'male' median income, which is often done in various social studies.This motivates us to take into account the geometry of the score distribution when defining bias.For this reason, we propose the notion of the quantile bias which operates on the domain of the score rather than the sample space.Definition 2.3.Let f, X, G, ς f and F k be as in Definition 2.2.Let p ∈ (0, 1).
• The signed p-th quantile is defined by As a counterpart to statistical parity, we also introduce quantile (geometric) parity.
• We say that the model f satisfies p-th quantile (or geometric) parity if • Let t ∈ R. The classifier Yt satisfies quantile (or geometric) parity if Given a score f , the quantile bias measures the difference between subpopulation quantile values.For a given threshold t, the p0-quantile signed bias, with p0 = F0(t), measures by how much the corresponding score values of the protected class G = 1 differ from that of G = 0 or equivalently by how much the threshold for the protected group should be shifted to achieve the quantile parity (and in some cases statistical parity) between the two populations.
Lemma 2.1.Let f be a model, G ∈ {0, 1} the protected attribute, and G = 0 the non-protected class.Suppose that t0 ∈ R is a point at which the CDFs F0 and F1 are continuous and strictly increasing.Then Yt 0 satisfies statistical parity if and only if it satisfies geometric parity.
Proof.The result follows from Definition 2.2, Definition 2.3, and the fact that F0 and F1 are locally invertible at t0.
To better understand the classifier and quantile biases and their connection, see Figure 2a.The conditional CDFs of the model scores are plotted given the protected attribute G.The blue line (corresponding to the scores given G = 0) is above the red line (scores given G = 1) for all values of t.Thus, for a given threshold t0 we have that F0(t0) − F1(t0) > 0, which means that if the favorable direction is ↑ (↓) then the classifier favors the class G = 1 (G = 0).In view of the quantile bias, the green horizontal line segment represents the amount we would have to shift the threshold for one of the classes in order to achieve geometric parity.Since the CDFs are shown to be continuous and strictly increasing, the above lemma implies that doing so would achieve statistical parity as well.

Classifier bias mitigation via repaired datasets
Two notable works that utilize optimal transport theory to reduce statistical parity bias are Feldman et al. (2015) and Gordaliza et al. (2019).
The approach in Feldman et al. (2015) seeks to create an unbiased dataset by transforming predictors and then training a classifier on it.The authors propose a geometric repair scheme, which partially moves the two subpopulation distributions µi,0 and µi,1 of predictor Xi along the Wasserstein geodesic towards their (unidimensional) Wasserstein barycenter µi,B, a distribution minimizing the variance of the collection {µi,0, µi,1}; see Appendix A. The transformed dataset is then used to train a model that reduces disparate impact.Gordaliza et al. (2019) proposes a method for transforming the multivariate distribution of predictors called random repair.Given two subpopulation distributions of predictors µ k = P X|G=k , with k ∈ {0, 1}, and a repair parameter λ ∈ [0, 1], the algorithm randomly chooses between the Wasserstein barycenter µB of {µ0, µ1} and the original subpopulation distribution µ k , with λ determining the probability of selecting µB.
The authors establish the upper bound on the disparate impact (DI) and balanced error rate (BER) of classifiers with respect to (X, G) using the total variance distance between the subpopulation distributions of predictors, min )), and show that the T V -distance between repaired subpopulation distributions μ0,λ ,μ 1,λ is bounded by 1 − λ.This, in turn, allows to control the bound on the DI and BER, and hence the closely related statistical parity bias on the repaired dataset is bounded by The random repair algorithm allows for a tight control of TV-distance between repaired subpopulations unlike the geometric repair approach.They also establish bounds on the loss in performance due to modifying the data by the Wasserstein distance between the two subpopulation distributions of predictors.The performance loss is expressed as the difference in classification risk between the repaired and original data on (X, G).
Remark 2.2.Given the regulatory constraints, the approaches of Feldman et al. (2015) and Gordaliza et al. (2019) would not be permitted in financial institutions that extend credit because a) the protected attribute cannot be used in training or prediction, and b) introducing randomness into the input dataset is prohibited; for details see (Hall et al., 2021).To take into account the regulatory constraints and practical applications, in our companion paper (Miroshnikov et al., 2021b) we propose a post-processing approach that relies on the fairness interpretability framework presented in the current article.

Individual fairness
The work of Dwork et al. (2012) studies individual fairness of randomized classifiers.To understand the main results of the article, we first provide relevant definitions.
(iv) The distance Drc between µ, ν ∈ P(X ) is defined by Individual fairness is defined by imposing a Lipschitz property on the map x → M (x) ∈ P({0, 1}), x ∈ X .As in Gordaliza et al. (2019), the work of Dwork et al. (2012) relates the bias in the output to the bias in the input.In particular, the paper establishes the upper bound DT V (MP 0 , MP 1 ) ≤ Drc(P0, P1) for the statistical parity bias of Lipschitz randomized classifiers.Roughly speaking, the above bound means that when two subpopulations P0, P1 are "similar" in the sense of the Drc metric, then the Lipschitz condition ensures that the statistical parity bias is small.
The Drc metric has transport-like properties and is related to the Wasserstein metric; see Dwork et al. (2012, Theorem 3.3) and Theorem 3.3 in Section 3.5.

Model bias metric
In our work we shift the focus from measuring the bias in classifiers to the bias in regressor outputs.This is motivated by the fact that many strategies and decisions in the real-world make use of the regressor values or the classification scores of the trained ML models.Furthermore, in the case of classification scores, the bias assessment in FIs is carried out before any classifier threshold is determined.
In this section, we discuss how to measure the regressor bias using optimal transport.We also establish the connection between the regressor bias and the bias in the collection of classifiers induced by thresholding the regressor, and make use of this integral relationship to design generic regressor fairness metrics that incorporate group-based parity criteria, such as equalized odds (Hardt et al., 2015), into the transport formulation.
Definition 3.1 (D-model bias).Let X ∈ R n be predictors, f be a model, and G ∈ {0, 1} the protected attribute.Let D(•, •) be a metric on the space of probability measures Pq(R), with q ≥ 0. Provided E[|f (X)| q ] is finite, the D-based model bias is defined as the distance between the subpopulation distributions of the model: where Figure 2b illustrates the model bias for two choices of D: the 1-Wasserstein metric W1 and the Kolmogorov-Smirnov distance KS.Notice the stark difference between the two model biases.This raises the general question on which metric should one use to evaluate the bias.We discuss this issue in the following subsection.
In what follows we suppress the explicit dependence of the model bias on X.

Wasserstein distance
To determine an appropriate metric D to be used in (3.1) is not a trivial task.The choice depends on the context in which the model bias is measured.We argue that it is desirable for the metric to have the following properties: (P1) It should be continuous with respect to the change in the geometry of the distribution.
(P2) It should be non-invariant with respect to monotone transformations of the distributions.
The property (P1) makes sure that the metric keeps track of changes in the geometry.For instance, suppose an "income" of the group {G = 0} is x0 and that of {G = 1} is x1.A metric that measures income inequality should be able to sense the distance between x0 and x0 + ε.That is, having two delta measures δx 0 and δx 0 +ε the metric must ensure that as ε → 0 the distance D(δx 0 , δx 0 +ε) approaches zero.The property (P1) also makes sure that slight changes in the subpopulation distributions lead to a slight change in bias measurements, which is important for stability with respect to changes in the dataset X.
The property (P2) makes sure that the metric is non-invariant with respect to monotone transformations.That is, given two random variables X0 and X1 and a continuous, strictly increasing transformation T : R → R, one would expect the change in distance between T (X0) and T (X1) whenever T is not a shift.For example, if T (x) = αx, we would expect the distance between T (X0) = αX0 and T (X1) = αX1 depend continuously on α.
In what follows, we consider the Wasserstein distance Wq as a potential candidate for fairness interpretability; for use cases in the ML fairness community see Dwork et al. (2012), Feldman et al. (2015), Gordaliza et al. (2019).
To introduce the metric and investigate its properties we switch our focus to probability measures; recall that any random variable Z gives rise to the pushforward probability measure PZ (A) = P(Z ∈ A) on R, and the reverse is true, for any µ ∈ P(R) with the CDF Fµ(a) = µ((−∞, a]) there is a random variable Z such that PZ = µ.Similar remarks apply for random vectors; see Shiryaev (1980).Given T : R k → R m and µ ∈ P(R k ), we denote by T # µ a measure such that The Wasserstein distance Wq is connected to the concept of optimal mass transport.Given two probability measures µ1, µ2 ∈ Pq(R) with finite q-th moment and the cost function c(x1, x2) = |x1 −x2| q , the Wasserstein distance Wq is defined by where is the minimal cost of transporting the distribution µ1 into µ2, and vice versa in view of the symmetry of the cost function.A joint probability measure γ ∈ P(R 2 ) with marginals µ1 and µ2 is called a transport plan.It specifies how each point x1 from supp(µ1) gets distributed in the course of the transportation; specifically, the transport of x1 is described by the conditional probability measure γ x 2 |x 1 .
It can be shown that the Wasserstein metric for probability measures on R can be expressed in terms of the quantile functions which makes the computation straightforward; see Theorem A.2.
To get an understanding of the behavior of Wq, consider two delta measures located at x0 and x0 + ε, respectively.By definition of the metric it follows that Thus, Wq is continuous with respect to a shift of a point mass.Furthermore, for any two random variables X0 and X1 and α > 0 which implies that a multiplicative map T (x) = αx affects the Wasserstein distance.
To formally show that properties (P1) and (P2) are satisfied by the Wasserstein metric, we provide the following theorem.(b) Let T : R → R be a continuous, strictly increasing map.Wq is non-invariant under T , provided, Proof.See Appendix B.
Theorem 3.1 states that the Wasserstein metric relies on the geometry of the distribution.In particular, the distance is affected in a continuous way by the change in the geometry of the distribution.This, in turn, provides the desired sensitivity of the Wasserstein metric with respect to slight changes in the dataset distribution, including shifts, which is relevant for ML models with ragged CDFs, which makes the Wasserstein metric an appropriate candidate for the model bias measurement.In addition, as we will see, the Wasserstein distance enables us to assess the favorability at the level of the model, which is useful for applications in financial institutions.

Negative and positive flows under order preserving optimal transport plan
We now provide several properties of the Wasserstein metric, which we employ in the following sections.
Given two probability measures µ1, µ2 ∈ Pq(R), it can be shown that the joint probability measure  is an optimal transport plan for transporting µ1 into µ2 with the cost function c(x1, x2) = |x1 − x2| q , and thus, Most importantly, π * is the only monotone (order preserving) transport plan such that In a special case, when µ1 is atomless, π * is determined by the monotone map called an optimal transport map.Specifically, each point x1 of the distribution µ1 is transported to the point x2 = T * (x1); see Figure 3a for an illustration.Thus, µ2 = T * # µ1, and the conditional probability measure π * x 2 |x 1 = δ T * (x 1 ) for x1 ∈ supp(µ1).In this case, (3.4) reads In a general case, under the transport plan π * , points x1 ∈ supp(µ1) for which µ1({x1}) = 0 are transported as a whole, while the "atoms", points x1 for which µ1({x1}) > 0, are allowed to be split or spread along R; see Figure 3b that illustrates the transport flow under π * in the general case.The plot also provides a depiction of the order preservation; notice how the arrows do not intersect.
To compute the portion of the transport cost used for moving points of µ1 to the right or left, it is necessary to restrict the attention to the regions x1 < x2 and x1 > x2, respectively.Lemma 3.1.Let µ1, µ2 ∈ Pq(R), q ∈ [1, ∞).Under the monotone plan π * the transport efforts to the left and right for the cost function c(x1, x2) = |x1 − x2| q are given by: (3.7) Hence, the Wasserstein distance Wq can be expressed as Proof.By (3.3) the monotone plan can be expressed as where λ| [0,1] denotes the Lebesgue measure restricted to [0, 1].Then, by Proposition A.1, for any Borel set B ⊂ R 2 we have Finally, if µ1 is atomless, by Theorem A.2 the monotone plan π * = (I, T * ) # µ1, where T * is the optimal transport map given by (3.5).Then using Proposition A.1 we obtain (3.9).

W 1 -based model bias and its components
For q = 1 the Wasserstein distance W1 is known as the Earth Mover distance.Since the distance is symmetric, BiasW It can be shown that the W1-based model bias formulation is consistent with both statistical parity fairness criterion as well as quantile parity criterion, which is shown by the following theorem.
Lemma 3.2.Let f be a model and G ∈ {0, 1} the protected attribute.Then . Then, we have (Shorack and Wellner (1986)) Hence, the result follows from Definition 2.2, Definition 2.3, and the above equality.
Remark 3.1.The above lemma establishes the representation of the model bias as an integration over the statistical parity bias of classifiers obtained by considering all thresholds.Here, the consistency of the model bias with statistical parity is understood in the sense of the equality in the above lemma.In comparison, Dwork et al. (2012) establishes a connection of statistical parity of Lipschitz randomized classifiers and subpopulations in a dataset upon which the models are built.While the results in Dwork et al. (2012) do not imply the above lemma, it is appealing to provide a connection between the two.For example, consider the triplet (X, G, Y ) with Y ∈ {0, 1} and a smooth regressor f (X) = P (Y = 1|X).Consider a randomized classifier z → µz where z = (x, g, y), and µz(1) = f (x).Let Pg = P Z|G=g .Then, the upper bound on statistical parity bias of µz provided by Dwork et al. (2012) reads which illustrates the difference between Lemma 3.1 of Dwork et al. (2012) and our lemma.
Positive and negative model bias.According to Lemma 3.1, the cost of transporting a distribution is the sum of the transport effort to the left and the transport effort to the right.This motivates us to define the positive bias as the transport effort for moving the particles of f (X)|G = 0 in the non-favorable direction and the negative bias as the transport effort in the favorable one; equivalently the latter is the transport effort for moving the particles of f (X)|G = 1 into the favorable direction and the former is the transport effort into the non-favorable one.
Motivated by Lemma 3.1 we define positive and negative model biases as follows: Definition 3.2.Let f, G, ς f and F k be as in Definition 2.2.
• The positive and negative W1 based model biases are defined by In this case, the model bias is disaggregated as follows: • The net model bias is defined by We next establish that the positive and negative W1 model biases can be expressed in terms of classifier biases.To establish this, we first prove the following auxiliary lemma.
Theorem 3.2.Let f, G, ς f , P ± and F k be as in Definition 3.2.Then where Hence (3.10) follows from Definition 2.2, Definition 2.3, and the above equality.Next, by (3.10) and Lemma B.3 we have This proves (3.11).If the favorable direction is ↓, the proof of (3.10) and (3.11) is similar.Example.To understand the statement of Theorem 3.2 consider the following classification risk model (ς f = −1) with a predictor whose variance depends on the attribute G: which leads to the presence of both positive and negative bias components in the score distribution.Figure 4 depicts the subpopulation score CDFs of the trained GBM classifier and illustrates the fact that the integrated positive quantile and classifier biases yield the positive model bias (green region), and a similar relationship holds for the negative model bias (purple region).The monotone transport flows are also depicted, showing the connection between the signed model bias and the favorability.Since ς f = −1, in the green region the non-protected class is transported towards the non-favorable direction, while in the purple region it is transported towards the favorable one.
On renormalization of model bias.
which makes it easy to interpret the amount of the bias in the model distribution.
For regressors, however, the model bias can take any value in [0, ∞).One approach is to normalize the model bias as follows.First, pick an appropriate reference scale L > 0 corresponding to the response variable.Given the scale L one can define a generalized Wasserstein-based model bias as follows: where the link function g is strictly increasing and satisfies Having this setup yields Biasg,W 1 (f |G) = 1 L BiasW 1 (f |G) whenever the transport effort is within the scale of interest L, that is, when BiasW 1 (f |G) ≤ L 2 .In practice, for bounded distributions, one can pick L = supp P f (X) , while for unbounded distributions one can pick L = 2σ(f (X)).
In our work, we develop the bias explanation methods to explain the actual amount of transport effort between subpopulations.The generalization to (3.12) is trivial.

Generalized group-based parity model bias
In this section, we will generalize the notions of the Wasserstein-based bias to the case of generic groupbased parity for protected attributes with multiple classes.We then apply the generalization to the equalized odds and the equal opportunity parity conditions.Definition 3.3.Let f be a model, X ∈ R n predictors, G ∈ {0, 1, . . ., K − 1} protected attribute, G = 0 non-protected class, and ς f the sign of the favorable direction of f .Let A = {A1, . . ., AM } be a collection of disjoint subsets of the sample space Ω.Define events (i) We say that Yt = 1 {f (X)>t} satisfies A group-based parity if where the weights satisfy K−1 k=1 M m=1 w km = 1.(iii) The positive and negative (W1, A) weighted model biases are defined by Lemma 3.4.Let G and A be as in Definition 3.3.The (W1, A) model bias is consistent with the generic parity criterion (3.13) as given by the following: Similarly, the signed model biases can be expressed where Proof.The claim follows directly from Theorem 3.3.

Integral probability metrics for fairness assessment
When assessing fairness of model regressors, it is crucial to pick an appropriate metric because the model output is often used to make decisions.A wide class of candidate metrics could be integral probability metrics (IPMs).These provide a notion of "distance" between probability distributions and are designed as generalizations of the Kantorovich-Rubinstein variational formula.They can be defined directly using variational formulas (Müller, 1997, Sriperumbudur et al., 2009).Specifically, IPMs can be defined by maximizing the difference of expected values over a function space A, WA(ν0, ν1) := sup Dropping the regularity of test functions leads to a discontinuous response to shifting of delta masses.For example, by setting A = {ϕ : ϕ ∞ ≤ 1}, one obtains the total variation metric DT V .An interesting aspect of the above variational formula is that it can be generalized to include a broader family of distances between probability distributions, namely divergences such as the Kullback-Leibler divergence; see Birrell et al. (2020) for more information.Thus, IPMs with regular test functions serve as good candidates for assessing the fairness of the regressor via formula (3.1).One of the interesting contenders is WA * where which is an equivalent metric to the Dudley metric and has the appealing property that its values are in the unit interval.WA * provides meaning in fairness assessment, as it could be expressed via a supremum over all "agents" in the form of regular randomized classifiers that detect the differences between two probability subpopulations.Specifically, it can be shown that WA * coincides with the Drc metric introduced in Dwork et al. ( 2012) and discussed in Section 2.6.Lemma 3.5.Let (X , d) be a metric space.Then Drc(µ, ν; DT V , d) = WA * (µ, ν).

Proof. See Appendix B.
Recall that Dwork et al. (2012) established that the statistical parity bias of a randomized classifier is bounded by the Drc distance between subpopulation input distributions.In contrast, we focus on measuring and explaining the bias in the output of non-randomized regressors, including classification scores, for which the notion of statistical parity is not, in general, applicable.In particular, we assess the distance between regressor output subpopulations via the W1 metric.In general, any transport metric can be considered for this task, such as WA * .Furthermore, we propose a framework that quantifies the contribution of predictors to that distance, which serves as a mechanism that pinpoints the main drivers to the regressor bias.
The lemma below illustrates the different behavior of the two metrics under scaling.
Proof.See Appendix B.
Notice that for large c the values of Drc with the dc norm saturate and approximate one, which is an upper bound for the metric.However, W1 is unbounded and the distance between the pushforward measures T # µ, T # ν scales linearly by c, which is an appealing property.Dwork et al. (2012) establishes the connection between Drc and W1 under the assumption that the subpopulation distributions are discrete and d ≤ 1.In what follows, we prove a more general version of (Dwork et al., 2012, Theorem 3.3) that connects the two metrics and holds for all probability measures with bounded support.
When using Drc for fairness, the above theorem implies that saturation can be partially avoided via scaling.For example, the rescaling factor can be chosen as the second moment of the two probability measures.However, in our paper we focus on the Wasserstein metric because of its appealing scaling property.

Relationship between model fairness and predictors
It is shown in Gordaliza et al. (2019) that the statistical parity bias of (non-randomized) classifiers can be bounded by the total variance distance between predictors subpopulations, while the Wasserstein metric, in general, does not allow for such control (in the sense of a bound).In contrast to the bound in Gordaliza et al. (2019), W1-bias in predictors can control the statistical parity bias of Lipschitz randomized classifiers as shown in Dwork et al. (2012), as well as the W1-regressor bias as shown by the following lemma.
Lemma 4.1.Let X, G, f be as in Definition 3.3.If f is Lipschitz continuous then Proof.The proof follows directly from the Kantorovich-Rubinstein variational formula.
While the fairness of predictors as a bound is of theoretical importance, it provides little information on the contribution of each predictor to the model unfairness.This is because fairness of predictors is a sufficient requirement for fairness of the model, but not a necessary one.In particular, a model can be slightly unfair while having wildly biased predictors.For example, consider the data generating model This pedagogical example motivates us to directly assess the contribution of predictors to the model bias.To accomplish this, we design an interpretability framework that employs optimal transport theory in order to pinpoint the main drivers of the model bias.Information from these drivers can then be used for policy decision-making, regulatory-compliant bias mitigation (Miroshnikov et al., 2021b), as well as in other settings.

Model interpretability
The bias explanations we develop in the next section make use of model explainers, whose objective is to quantify the contribution of each predictor to the value of f (x).Several methods of interpreting ML model outputs have been designed and used over the years.Some notable ones are Partial Dependence Plots (PDP) (Friedman, 2001) and SHAP values (Lundberg and Lee, 2017).
Partial dependence plots.PDP marginalizes out the variables whose impacts to the output are not of interest, quantifying an overall impact of the values of the remaining features.
Let X ∈ R n be predictors, XS with S ⊆ {1, 2, . . ., n} a subvector of X, and −S the complement set.Given a model f , the partial dependence plot of f on XS is defined by −S ), (4.3) where we abuse the notation and ignore the variable ordering in f .

Shapley additive explanations.
In its original form the Shapley values appear in the context of cooperative games; see Shapley (1953), Young (1985).A cooperative game with n players is a superadditive set function v that acts on N = {1, 2, . . ., n} and satisfies v(∅) = 0. Shapley was interested in determining the contribution by each player to the game value v(N ).It turns out that under certain symmetry assumptions the contributions are unique and they are called Shapley values; furthermore, the super-additivity assumption can in principle be dropped (uniqueness and existence still hold).It is shown in Shapley (1953) that there exists a unique collection of values {ϕi} n i=1 satisfying the axioms of symmetry, efficiency, and law of aggregation, ((A1)-(A3) in Shapley (1953)), it is given by ϕi The values provide a disaggregation of the value v(N ) of the game into n parts that represent a contribution to the worth by each player: ).The explanation techniques explored in Štrumbelj and Kononenko (2010) and Lundberg and Lee (2017) utilize cooperative game theory to compute the contribution of each predictor to the model value.
In particular, given a model f , Lundberg and Lee (2017) consider the games The games defined in (4.5) are not cooperative since they do not satisfy the condition v(∅) = 0.However, by setting ϕ0 = E[f (X)], the values satisfy the additivity property: Throughout the text when the context is clear we suppress the explicit dependence of v(S; X, f ) on X and f .Furthermore, we will refer to values ϕi[v ME ] and ϕi[v CE ] as SHAP values and abusing the notation we write ϕi( Conditional and marginal games.In our work, we refer to the games v CE and v ME as conditional and marginal, respectively.If predictors X are independent, the two games coincide.In the presence of dependencies, however, the games are very different.Roughly speaking, the conditional game explores the data by taking into account dependencies, while the marginal game explores the model f in the space of its inputs, ignoring the dependencies.Strictly speaking, the conditional game is determined by the probability measure PX , while the marginal game is determined by the product probability measures PX S ⊗ PX −S , S ⊂ N as stated below. Lemma 4.2 (stability).The SHAP explanations have the following properties: By the properties of the conditional expectation and (4.4) we have Since ϕ is linear, the map in (i) is a bounded, linear operator with the unit norm.This proves (i).By (4.4) and (4.5) we have where C = C(n) is a constant that depends on n.This proves (ii).
To clarify the notation, we let L 2 ( PX ) denote the space of functions defined on R n such that where as before we ignore the variable ordering in f , and for S = ∅ we assign PX ∅ ⊗ PX = PX .We should point out that under dependencies the marginal explanation map (ii) in Lemma 4.2 is in general not continuous in L 2 (PX ).Hence the algorithm that produces marginal explanations may fail to satisfy the stability bounds in the sense discussed in Kearns and Ron (1999), Bousquet and Elisseeff (2002).For a more general version of the above proposition see Miroshnikov et al. (2021a).
In general, SHAPs are computationally intensive to evaluate due to the different combinations of predictors that need to be considered; in addition, computing ϕ[v CE ] is challenging when the predictor's dimension is large in light of the curse of dimensionality; see Hastie et al. (2016).Lundberg et al. (2019) created a fast method called TreeSHAP but it can only be applied to ML algorithms that incorporate tree-based techniques.The algorithm evaluates ϕ[ν] for the game ν that can be chosen as either one that is based upon tree paths and resembles v CE , or the marginal game v ME .To understand the difference between the two games, see Janzing et al. (2019), Sundararajan and Najmi (2019), Chen et al. (2020), Miroshnikov et al. (2021a).

Bias explanations of predictors
In this section, given a model, we define the bias explanation (or contribution) of each predictor.An extension to groups of predictors maybe found in Section 4. 6.
In what follows we will be using the following notation.Given predictors X = (X1, X2, . . ., Xn) and a model f , a generic single feature explainer of f that quantifies the attribution of each predictor Xi to the model value f (X) is denoted by For example, a simple way of setting up an explainer Ei is by specifying each component via a conditional or marginal expectation Ei(X; f A more advanced way of computing single feature explanations is via the Shapley value E(X; f For more details on appropriate game values and their properties see Miroshnikov et al. (2021a).
Definition 4.1.Let X ∈ R n be predictors, f a model, G ∈ {0, 1} the protected attribute, G = 0 the non-protected class, and ς f the sign of the favorable direction of f .Let E(X; f ) be an explainer of f that satisfies E |E(X; f )| < ∞.
• The bias explanation of the predictor Xi is defined by • The positive bias and negative bias explanations of the predictor Xi are defined by In this case the Xi bias explanation is disaggregated as follows: • The Xi net bias explanation is defined by • The classifier (or statistical parity) bias of the explainer Ei for a threshold t ∈ R is defined by By design the contribution β + i measures the positive contribution to the total model bias, not the positive one.In particular, it measures the contribution to the increase in the positive flow and the decrease to the negative one.The meaning of β − i is similar.To better understand their meaning, consider the following data generating model: where X1, X2 are independent.Note that BiasW 1 (f |X, G) = 0, while the bias explanations are τ for either model explainer discussed in this section.Note also that both positive and negative model biases are zero.The positive contribution β + 1 = τ measures how much in total is added to the positive model bias and subtracted from the negative one.A similar discussion holds for β − i .Thus, the amount that X1 contributes to the positive bias is offset by the amount that X2 resists to its increase.This leads to zero positive model bias.A similar discussion applies to the negative model bias.
Lemma 4.3.Let X, f , G, Ei(X; f ), and ς f be as in the definition 4.1.Then Proof.Similar to the proof of Theorem 3.2 with the assumption ςE i = ς f .
Observe that the bias explanations for a classification score always lie in the unit interval.
Lemma 4.4.Let f be a classification score and G ∈ {0, 1} the protected attribute.Let the explainer Ei be either v({i}; X, f The lemma follows from the fact that f ∈ [0, 1] and the definition of explainer values.
The explainer Ei that appears in Definition 4.1 is a generic one.In the examples that follow we chose to work with explainers based on marginal SHAPs because of the ease of computation.Note that when predictors are independent then the two types of explanations coincide; for the case when dependencies are present see the discussion at the end of the section.
Intuition.For a given model f and the explainer Ei the explanation βi quantifies the W1 distance between the distributions of the explainer Ei|G = 0 and Ei|G = 1, that is, this value is an assessment of the bias introduced by the predictor Xi.The value βi is the area between the corresponding subpopulation explainer CDFs F E i |G=k , k ∈ {0, 1}, similar to the area depicted in   In what follows we consider several simple examples to get more intuition behind the bias explanation values as well as discuss their additivity or the lack thereof.To avoid complex notation when the context is clear we suppress the dependence of the bias explanations on X and the explainer E.
Definition 4.2.Let f , X, G, and Ei be as in Definition 4.1.
• We say that Ei strictly favors class Offsetting.Since each predictor may favor one class or the other, the predictors may offset each other in terms of the bias contributions to the model bias.To understand the offsetting effect consider a binary classification risk model (ς f = −1) with two predictors: where µ = 5, and {Xi|G = k} i,k are independent and P(G = 0) = P(G = 1).We next train logistic regression score f (X), with ς f = −1, and choose the explainer to be Ei = PDPi.By construction the explanation E1 of the predictor X1 strictly favors class G = 0, while that of X2 strictly favors class G = 1.Moreover,  Another important point we need to make is that the equality β net i = 0 does not in general imply that the predictor Xi has no effect on the model bias.This is a consequence of (4.7).Moreover, predictors with mixed bias might amplify the model bias as well as offset it.To understand how mixed bias predictors interact at the level of the model bias consider the following risk classification model (ς f = −1).
where µ = 5, and {Xi|G = k} i,k are independent and P(G = 0) = P(G = 1).As before we train a logistic regression score f , with ς f = −1, and choose Ei = PDPi.By construction, the true classification score f satisfies β net i (f |G) = 0 for each predictor Xi.Furthermore, the CDFs of explainers satisfy for any threshold t = 0.5.Combining the two predictors at the level of the model leads to amplifying the positive and negative model biases and hence the model bias itself.Figure 6 displays the CDFs for the trained score subpopulations f |G = k and the corresponding explainers Ei( f )|G = k.The numerics illustrate that as long as the regions for positive and negative bias of mixed predictors agree, when combined they will increase the model bias.
If the regions of positive and negative bias for two predictors do not agree, then offsetting will happen.To see this, let us modify the above example as follows: By construction, β net i (f |G) = 0 for each predictor.However, the region of thresholds where the explainer E1(f ) favors class G = 0 coincides with the region where E2(f ) favors class G = 1, and the same holds for the two complimentary regions.This leads to bias offsetting so that BiasW 1 (f |G) = 0.The numerical results for this example are displayed in Figure 7. Bias explanation plots.Given a machine learning model f , predictors X ∈ R n , protected attribute G, and the explainers Ei, the corresponding bias explanations are sorted according to any desired entry in the 4-tuple and then displayed in that order.This plot is called Bias Explanation Plot (BEP).
To showcase how BEP works, consider a classification risk model (ς f = −1): where {Xi|G = k} i,k are independent and P(G = 0) = P(G = 1).We next generate 20, 000 samples from the distribution (X, Y ) and train a regularized XGBoost model which produces the score f .Figure 8 displays the CDFs of the subpopulation scores f |G = k (top left), and those of the explainers Ei = ϕi( f , v ME ).We see that there is positive model bias in the plot showing the CDFs, thus class G = 0 is favored.For the predictors, the bias explanation plots show that X1, X4 and X5 have mixed biases that arise due to differences in subpopulation variances of predictors, while the bias in X2 strictly favors class G = 1 and the bias in X3 favors G = 0.
Relationship with model bias.The positive and negative bias explanations provide an informative way to determine the main drivers for positive and negative bias among predictors, which can be done by ranking the bias attributions.However, though informative, the positive and negative bias explanations are not additive.That is, in general The main reason for lack of additivity is the presence of bias interactions which happen at the level of quantiles, or thresholds.The bias explanations by design compute the contribution to the cost of transport but do not track how mass is transported; see Figures 6, 7. To better understand the bias interactions, motivated by Štrumbelj and Kononenko (2010), we introduce a game theoretic approach in Section 4.5 that yields additive bias explanations.
For additive models with independent predictors, however, we have the following result.
Lemma 4.5.Let X ∈ R n be predictors.Let the model f be additive, that is, }i be the bias explanations of (X, f ).Then If X are independent then the lemma holds for Ei in the form v CE ({i}; X, f Proof.Suppose that Ei(X; f ) = v ME ({i}; X, f ).Then, in view of the additivity of f , we have and hence by Lemma 4.3 we have Summing up the net bias explanations gives i (4.8) Suppose that Ei(X; f ) = ϕi(X; f, v ME ).Since {Xi} n i=1 are independent and f is additive, Since a shift in the distribution does not affect the bias, the bias explanation based on ϕi[v ME ] coincide with that of v ME .This together with (4.8) and the independence assumption proves the lemma.
Example.Let f be as in Lemma 4.5.Suppose that f is either positively biased or negatively biased, that is, BiasW

Stability of marginal and conditional bias explanations
Under dependencies the marginal and conditional bias explanations differ in their description.The conditional bias explanations rely on the joint distribution (X, Y ) and encapsulate the interaction between the bias in predictors and the response variable, while the marginal explanations encapsulate the interaction between bias in predictors and the structure of the model, that is, the map x → f (x); for details see Miroshnikov et al. (2021a).In particular, we have the following result.
The bias explanations based on the marginal and conditional Shapley values satisfy the following: (i) For all f, g ∈ L 2 (PX ), we have (ii) For all f, g ∈ L 2 ( PX ), we have Proof.Take f, g ∈ L 2 (PX ).Take i ∈ {1, 2, . . ., n} and set Let µ k = P A|G=k , ν k = P B|G=k , and γ = P (A,B)|G=k for k ∈ {0, 1}.By construction γ k ∈ Π(µ k , ν k ) and hence where C = max k∈{0,1} and the last inequality follows from Lemma 4.2(i).Then, using the triangle inequality and the inequality above, we obtain We next establish the bounds for the net-bias explanations.Assuming ς f = ςg and using Lemma 4.3 we obtain Combining the above inequalities and using the fact that β ± = 1 2 (β ± β net ) gives (i).To prove (ii), we follow the same steps as above and use Lemma 4.2(ii).

Shapley-bias explanations
As discussed in Section 4.2, the non-additive bias explanations measure the positive and negative contributions to the model bias, but not to each flow.To this end, we measure signed contributions to each positive and negative model bias by employing a game-theoretic approach, which has been explored in numerous works in the area of machine learning interpretability; see Lipovetsky and Conklin (2001), Štrumbelj and Kononenko (2010), Lundberg and Lee (2017).In the spirit of Štrumbelj and Kononenko (2010), we define a cooperative game in which the players are predictors and the payoff is their bias contributions and then compute corresponding additive Shapley values.
Group explainers.Let X ∈ R n be predictors and f a model.A generic group explainer of f is denoted by E(S; X, f ), S ⊂ {1, 2, . . ., n}.
We assume that E(S; X, f ) quantifies the attribution of each predictor XS with S ⊂ {1, 2, . . ., n} to the model value f (X) and satisfies Relatively straightforward group explainers can be constructed using conditional and marginal game or game value.In particular, for a nonempty S ⊂ {1, 2, . . ., n} one can set a trivial group explainer as (4.9) Definition 4.3.Let X, G, f, ς f be as in Definition 4.1.Let E(• ; X, f ) be a group explainer.
• Cooperative bias-game v bias associated with X, G, f and E is defined by v bias (S) is the minimal cost of transporting E(S)|G = 0 to E(S)|G = 1 and vice versa.
• Under optimal transport the positive bias-game and negative bias-game, respectively, are defined by: • v bias+ (S) is the effort of transporting E(S)|G = 0 in the non-favorable direction.
• v bias− (S) is the effort of transporting E(S)|G = 0 in the favorable direction.
The above values are specified in Lemma 3.1 for q = 1.
• Net bias-game is defined by v bias,net = v bias+ − v bias+ .
• The Shapley-bias explanations of (X, f ) based on the group explainer E are defined by where ϕ denotes the Shapley value (4.4) and where we suppressed the dependence on X and E.
Unlike the regular bias explanations which by construction are always non-negative, the Shapley-bias explanations are signed, that is, they can be both positive and negative.
Lemma 4.6.Given (X, f ) and the explainer E, the Shapley bias-explanations defined in (4.10) satisfy n i=1 and, thus, Proof.The result follows from Shapley (1953) and the properties of the W1-based model bias.
For Shapley-bias explanations based on the conditional and marginal games we have the following.
Theorem 4.2.Given (X, f ), let the conditional and marginal bias games be defined by The conditional and marginal Shapley-bias explanations have the following properties: Proof.The proof follows the same steps as in Theorem 4.1.
Example.Applying the above methodology to f and G of the model (M6) we compute the Shapleybias explanations of predictors Xi, i ∈ {1, 2, . . ., 5} using the group explainer E(S) = ϕS[v ME ] defined in (4.9) for the construction of the bias-game.
The results are displayed in Figure 10.On the left, the explanations are plotted in increasing order of the positive bias, and in the middle plot by the total bias, while the right plot contains the information on all four types of biases.By comparing these to the non-additive bias explanation plots in Figure 9 we see how the signed values provide further information on how the predictors contribute to the model bias.
For example, from (M6) we have that X3, as a contributor to the model f , favors the class G = 0 since β + 3 > 0 and β − 3 = 0. Recall that β + 3 captures the total contribution to the increase of the positive model bias plus the decrease (or resistance) to the negative model bias.The Shapley-bias explanations, however, allow one to estimate separately the (signed) contributions to both positive and negative model bias.
In particular, the left plot of Figure 10 informs us that X3 in f contributes to the increase of the positive model bias (green), measuring the contribution to pushing the subpopulation of the non-protected class towards the favorable direction, while its contribution to the negative model bias (blue) is negative, which indicates the resistance towards the subpopulation's pull in the non-favorable direction.

Group Shapley-bias explanations
It might be important for a practitioner to understand the main factors within the data itself that contribute to the bias in the response variable and not how the model structure contributes to it.To do this, one needs to generate bias explanations based on the conditional game v CE .The conditional game, when predictors are independent, coincides with the marginal game and the conditional expectations E[f (X)|XS] can be computed through averaging with error control.However, under dependencies, the conditional expectations and corresponding Shapley-bias explanations are difficult to compute in light of the curse of dimensionality.
Another important aspect to consider is that highly dependent predictors carry similar information.For instance, in the case where a group of predictors is represented via a smaller collection of latent variables, the latent variable explanations are spread out among the predictors in that group; see Chen et al. (2020).Under dependencies, for practical and business purposes, one may want to explain the information carried by the entire group rather than the predictors themselves.
The two issues mentioned above can be addressed simultaneously by adapting the ideas from Aas et al. ( 2020), Miroshnikov et al. (2021a).In particular, grouping predictors based on dependencies and utilizing specially-designed group explainers to compute the contribution of the group help unite the marginal and conditional approaches.Therefore, applying similar techniques, one can approximate the conditional Shapley-bias explanations of weakly independent groups using the marginal approach, which only requires averaging over a small dataset.Furthermore, grouping allows one to reduce complexity.
In what follows we adapt the techniques from Miroshnikov et al. (2021a) to construct group Shapleybias explanations.To this end, we first introduce required notation.Let X ∈ R n and {Sj} m j=1 be disjoint sets that partition the set of the predictors' indexes, N = {1, 2, . . ., n} = m j=1 Sj, P = {S1, S2, . . ., Sm}, (4.11) so that XS 1 , XS 2 , . . ., XS m form weakly independent groups such that within each group the predictors share significant amount of mutual information.Given a cooperative game v on N , we define the quotient game by v P (A) = v j∈U Sj , A ⊂ M = {1, 2, . . ., m}.By design, v P (A) is played by the groups, viewing the elements of the partition as players.
Definition 4.4.Given X, f, G as in Definition (4.1), and the partition P as in 4.11.
• The conditional and marginal group bias-games are defined by (4.12) • The corresponding Shapley-bias explanations of {XS j } m j=1 are then defined by Lemma 4.7.Given X, f, G, P as in Definition 4.4.If {XS j } m j=1 are independent, then Consequently, Proof.By independence, we have v ME,P = v CE,P .Hence by (4.12) we obtain and this yields (4.13).The stability argument can be carried out similarly to Lemma 4.1.
Similar construction is used to compute positive and negative bias explanations ϕ bias+,P S j and ϕ bias−,P S j , respectively.
Remark 4.2.The importance of equality (4.13) is that the expression on the right-hand side can be computed via averaging using a dataset with O(τ −2 ) samples for a given error tolerance τ .This makes the computation of the conditional explanation feasible.Furthermore, the complexity of computations becomes O(2 m ) where m is the number of independent groups.For example, given a classification score and X ∈ R 100 , having 100 predictors split into 10 independent groups, it is sufficient to use a dataset with 10000 samples in order to compute conditional group Shapley-bias explanations of a classification score with error tolerance τ = 0.01 and complexity O(2 10 •10000 2 ), which is feasible and easily parallelizable.If the number of independent groups is still large the above technique can modified to incorporate recursive groupings.
5 On the application of the framework

Bias mitigation under regulatory constraints
In this section, we will discuss how the fairness interpretability framework can be used for real-world applications in financial institutions that work under regulatory constraints.
An operational flow for model development in many FIs may consists of the following stages: 1) Model training, 2) Fair Lending Compliance governance review, and 3) Production, which includes model prediction and decision-making steps.Steps 1 and 3 are carried out by quantitative departments, while step 2 by the dedicated Compliance Office (CO), a department separate from business.The CO provides oversight to the company's compliance with federal and state regulations.
FIs are explicitly prohibited from collecting some protected information on customers such as race and ethnicity (apart from mortgage lending), as stated by the ECOA.Furthermore, protected attributes cannot be used in training or inference.However, proxy information on the protected attribute such as the one derived from Bayesian Improved Surname and Geocoding (BISG) is allowed to be used by the compliance office solely for fairness analysis (Elliot et al., 2009).Proxy information, however, must remain within the compliance office and the business does not (and should not) have access to the proxy data.
For fairness assessment, the CO carries out the bias assessment step.The CO can determine the main drivers contributing to model bias using our method and subsequently utilize bias mitigation methods.The bias mitigation step can include model postprocessing.However, in order to adhere to regulations, a post-processed model must not utilize the proxy attribute G or any information on the joint distribution (X, G), such as probabilities P( G|X).The reasons for that are a) in the production step one can only have access to X, and b) a post-processed model is shared with business units that should be prevented from inferring the protected attribute from X.
Some rudimentary techniques for bias mitigation include recommendations on which predictors to drop from training or model post-processing via nullifying a given predictor by fixing its value.A more efficient technique has been proposed in our companion paper Miroshnikov et al. (2021b).There we construct an efficient frontier over a family of compliant post-processed models utilizing the interpretability framework developed in the current article.Other examples of compliant methods include those that vary hyper-parameters to get an efficient frontier, such as those in Schmidt et al. (2021).

Pedagogical example on bias mitigation
In this section we provide a pedagogical example that showcases how to properly utilize the information on the positive and negative bias explanations when it comes to bias mitigation.A rudimentary mitigation technique one can employ is to construct a regulatory-compliant post-processed model by neutralizing an appropriate collection of predictors XS.This is accomplished by fixing their values in the model to some reference values x * S and setting f (x; S, x * ) = f (x * S , x−S).Often the objective of the bias mitigation in FIs is the reduction of the positive model bias which quantifies how much the model favors the majority class.In practice, regressor models are usually positively-biased, meaning Bias + W 1 (f |G) > 0 and Bias − W 1 (f |G) = 0. Taking into account the above discussion, let us assume for the sake of explanation that f (X) = n i=1 fi(Xi) is an additive and positively-biased model.Let β + i , β − i , where i ∈ N = {1, . . ., n}, be the positive and negative marginal bias explanations, respectively.Finally, let us decompose the predictor index set as follows: N = N+ ∪ N− ∪ N0 where In this case, by Lemma 12 the model bias is given by BiasW which illustrates the bias offsetting mechanism.Note that neutralizing the predictor i0 ∈ N− would cause the model bias, which is equal to the positive model bias, to increase, while neutralizing i1 ∈ N+would cause the model bias to decrease.Thus, one approach to reduce the model bias is to rank order the predictors in N+ by their netbias explanations and, subsequently, neutralize them one by one in that order.This will incrementally reduce the positive model bias until the point where neutralizing the next predictor causes the model bias to become equal to the negative model bias (with the positive model bias being zero), which operates as a stopping criterion of the approach.This simple and rather naïve strategy illustrates that a) the decomposition of explanations is useful for bias mitigation and that b) neutralization of biased predictors ranked by total bias contribution is not always the optimal strategy.

Example on census income dataset
In this section, we showcase the application of the framework to the 1994 Census Income dataset from the UCI Machine Learning Repository (Dheeru et al., 2017).
This dataset contains fourteen predictors and a dependent variable Y that indicates if an individual earns more or less than $50K annually.After investigating the predictors, we removed the protected attributes 'sex', 'race', 'age', and 'native-country'.We also excluded 'fnlwght' and 'relationship', the latter due to its high dependence with 'sex' since in the dataset the categorical values 'Husband' and | dp with P ± as in Definition 3.2.
5 for i in {1, . . ., n} do 6 For each k ∈ {0, 1} evaluate E i (x) on D X,k to obtain the set of subpopulation values S i,k . 7 | dp with P i± as in Definition 4.1.9 end 'Wife' correspond to 'Male' and 'Female', respectively.The remaining seven predictors used for model training are 'workclass ', 'education-num', 'occupation', 'marital-status', 'capital-gain', 'capital-loss', and 'hours-per-week'.For the model training, we use the training dataset Dtrain with 32561 samples to build a classification score f (x) = P(Y = '>50K'|X = x), using Gradient Boosting.For training we use the following parameters: n estimators=200, min samples split=5, subsample=0.8,learning rate=0.1.The feature importance of each predictor can be seen in Figure 11a, with the most significant predictors being 'marital-status', 'capital-gain', and 'education-num'.
Performance metrics for the GBM model on the trained dataset, and test dataset with 16251 samples, were evaluated.Specifically, the mean logloss on the train and test set is approximately 0.288 and 0.292 respectively, and the AUC is 0.922 and 0.918 respectively.
The focus of the application is to evaluate and explain the model bias with respect to the protected attribute G ='sex', with values 'Female' and 'Male', where 'Female' is the protected class.To this end, following the steps in Algorithm 1, we form the dataset S containing the classification scores S = f (x (i) ) : (x (i) , y (i) ) ∈ Dtrain , and partition it based on each class of G.This yields the sets SM and SF containing the classification scores for 'Female' and 'Male' respectively, which we use to construct the empirical CDFs of the subpopulation scores, FF emale and FMale , using the ECDF class from the statsmodels library.
Figure 11b depicts the empirical CDFs, where we see that the model has almost exclusively positive bias, and the positive direction is assumed to be ς f = 1.To confirm this observation, we subsequently compute the positive and negative model biases by integrating the difference of the two CDFs over the sets where FF emale > FMale and FF emale < FMale , respectively, as indicated in Definition 3.2.This yields the following values: To understand the contributions of the predictors to the model bias, we next construct the bias explanations based on the marginal model explainer.To accomplish this, we subsample the predictors from the training set, and obtain a background dataset DX with m = 4000 samples.Next, we compute the model explanations for each predictor Xi yielding the sets Similar to obtaining the model bias, we then partition SE i based on each class of G and obtain the empirical CDFs of Ei(X)|G = g, g ∈ {'Female', 'Male'}, which are then used to compute the bias  explanations β ± i according to Definition 4.1.These are depicted in Figure 12a and are ranked in ascending order of the positive bias.All the values for the negative bias explanations are close to zero, which further indicates the positively biased nature of the predictors.Observe that the most positively contributing predictor to the model bias is 'marital-status' by far with value ≈ 0.12.
Since 'marital-status' is the most impactful, it merits further investigation into its effect on the model bias.To this end, we group the different values of 'marital-status' into three categories: M1 ='nevermarried', M2 ='married', and M3 ='was-married'.Then, we segment the dataset S of classification scores into three subsets SM i , i ∈ {1, 2, 3}, that correspond to the aforementioned categories.To gain further understanding on how each of these categories contributes to the model bias, we compute the model bias on each segment.The negative model bias on each segment turns out to be zero, while the positive model biases are plotted in Figure 12b.The plot indicates that the category 'never-married' exhibits an insignificant level of bias, while there is some substantial positive bias in 'married' and 'was-married'.
Given the above analysis, one can attempt to reduce the model bias either by applying the postprocessing technique discussed in Section 5.2, or, alternatively, retrain the model by dropping some of the biased predictors.We showcase the latter approach by dropping 'marital-status' and retraining the model with the same parameters.We check the performance of the new model on the train and test sets.The mean logloss is 0.358 and 0.363 respectively, and AUC is 0.862 and 0.855 respectively.We then compute the model bias and bias explanations; see Figure 13.The positive model bias has been reduced to approximately 0.10, while the negative stays zero.The trade-off is a drop in performance, as seen by the performance metric values above.The bias explanations in the retrained model have slightly increased since 'marital-status' was dropped and the importance of the remaining predictors increased.
We would like to point out that the technique used above might not lead to bias reduction under the presence of strong dependencies, since other predictors could be used as proxies for the dropped predictor.However, the postprocessing technique outlined in Section 5.2 modifies the model score directly and the dependencies do not play a significant role.Keep in mind that this technique is rather crude and one may opt to employ the postprocessing methods described in Miroshnikov et al. (2021b) which apply to numerical predictors, but can be adjusted for categorical ones.

Conclusion
In this paper, we presented a novel bias interpretability framework for measuring and explaining bias in classification and regression models at the level of a distribution that utilizes the Wasserstein metric and the theory of optimal mass transport.We introduced and theoretically characterized bias predictor    attributions to the model bias and constructed additive bias explanations utilizing cooperative game theory.To our knowledge, bias interpretability methods at the level of a regressor distribution have not been addressed in the literature before.At a higher level, the model bias is a non-trivial superposition of predictor bias attributions.The bias explanations we introduced determine the contribution of a given predictor to the model bias.However, any two or more predictors will interact in the context of the bias explanations.For example, if one predictor favors the non-protected class and the other favors the protected class, it might be possible that when both predictors are utilized by the model the total effect on model bias is zero.This phenomenon opens up numerous avenues for future research to investigate the interactions of predictors across subpopulation distributions in the context of bias explanations.This is where ML interpretability techniques can come into play and aid with the study of predictor interactions in the model bias.
To make bias explanations additive we utilized cooperative game theory which lead to additive Shapley-bias explanations.These explanations rely on the Shapley formula, which makes them computationally expensive.The intractability of such calculations can be mitigated by grouping predictors based on dependencies and then computing the Shapley bias attributions for each group (via a quotient game) which reduces the dimensionality.However, if the number of groups is large, the issue of computational intensity remains.Thus, a possible research direction is to investigate methods that allow for approximation of the additive bias explanations and their fast computations.
In this paper, we formulated a methodology that computes the model bias and quantifies the contribution of predictors to that bias.However, an important application of the bias explanation methodology lies in bias mitigation, which will be useful in regulatory settings such as the financial industry, and may utilize information about the main drivers of the model bias.This will be investigated in our upcoming paper.The framework is generic and in principle can be applied to a wide range of predictive ML systems.For instance, it might be insightful to understand the predictor attributions to probabilistic differences of populations studied in physics, biology, medicine, economics, etc. Proof.The lemma follows directly from the definition of W1 and the fact that d is a norm.
Proof of Lemma 3.6 Proof.The proof follows from Lemma B.1 and Lemma B.2.
Proof of Theorem 3.3 Proof.Take any L > 0 and x * such that the supports of µ and ν are contained in B(x * , L 2 ; d).By the Kantorovich-Rubinstein duality theorem (Kantorovich, 1958, Dudley, 1976)  Proof.Suppose that f − g ∈ L 1 (R).Since f and g are measurable, the set {(x, y) : g(x) < y < f (x) is measurable with respect to the product measure λ 2 = λ ⊗ λ.Then by the Tonelli's theorem we obtain Suppose that λ 2 {(x, y) : g(x) < y < f (x)} < ∞.Following the calculations above in the reverse order we conclude that f − g ∈ L 1 (R) and hence (B.3) holds.This proves (ii).
(a) Classifier and quantile biases.(b) Model bias based on W 1 and KS.

Figure 2 :
Figure 2: Classifier and quantile bias, and model bias for the model (M1).

Theorem 3. 1 .
The distance Wq satisfies:(a) Wq on Pq(R) is continuous with respect to the geometry of the distribution.

Figure 4 .
The value β + i represents the bias across quantiles of the explainer Ei for which the predictor Xi favors the non-protected class G = 0 and β − i represents the bias across quantiles for which Xi favors the protected class G = 1.The
17. Combining the two predictors at the model level leads to bias offsetting.By construction the resulting model bias is BiasW 1 (f |G) = 0. Figure 5 displays the CDFs for the trained score subpopulations f |G = k and the corresponding explainers Ei|G = k, which illustrates the offsetting phenomena numerically.

Figure 10 :
Figure 10: Additive Shapley-bias explanations based on the game v bias,M E for the model (M6).

Algorithm 1 :
Model bias and bias explanations Data: Model f , dataset D = (X, G) with m samples, X ∈ R n and G ∈ {0, 1}, and model explainer E i .Result: Output the model biases Bias ± W1 (f |X, G) and predictor bias explanations {β ± i } n i=1 . 1 Partition the predictors in D according to G = k, k ∈ {0, 1}.This yields subsets D X,0 , D X,1 . 2 For each k ∈ {0, 1} evaluate f (x) on D X,k to obtain the set of subpopulation model values S k .3 For each k ∈ {0, 1} compute the empirical CDF Fk of f (X)|G = k based upon S k . 4 (a) Feature importance.(b) Score subpopulation CDFs; bias ≈ 0.19.

Figure 11 :
Figure 11: Model training and protected attribute analysis.
(a) Bias explanations.(b) Model bias segmented by marital status.

Figure 13 :
Figure 13: Bias explanations for the re-trained model without 'marital-status' predictor.