1 Background

Automated decision-making models are progressively being used in situations that affect humans in a broad range of areas, such as credit risk analysis [2], criminal recidivism prediction [3, 4], hiring [5] and the provision of social services [6]. For example, a bank may decide to extend credit based on whether a machine learning (ML) model predicts that an individual may default on a loan. Conversely, a judge may determine that a defendant should not be released while awaiting trial if an artificial intelligence (AI) model suggests that the defendant has a high risk of recommitting a crime. The growing prevalence of ML algorithms in decisions that affect humans is due, in part, to their perceived accuracy and ability to detect hidden patterns in data. Yet, in some cases, these models have been demonstrated to incorporate biases, such as in hiring decisions [7], face recognition [8] and even translation [9], resulting in concerns about the fairness of machine learning algorithms [10]. Despite these concerns, it is likely that the use of automated decision-making will only increase in the future, as AI becomes more wide-spread in society, government and business. Therefore, there is a growing awareness that ML algorithms should be both accurate and fair, which is underscored by the recently released Artificial Intelligence Act—a legal framework promulgated by the European Commission that mandates non-discrimination, among other requirements, for ML models that affect individuals [11].

Although it is desirable for ML models to be both fair and accurate, there is often a trade-off between these two goals [3, 12, 13], such that increasing fairness comes at the cost of reduced accuracy. Accuracy and fairness are often at odds because they are influenced by imbalanced data. In many cases, the data used to train ML algorithms is imbalanced with respect to class and protected features, such that one class or group is under-represented with respect to another. Since most ML models learn parameters based on data, data imbalance can cause a particular class or sub-group to be over-weighted, such that preference is given to the over-represented class or group. Hashimoto et al. [14] has referred to the under-representation of a protected group in training data as representation disparity, such that minority groups contribute less to a ML model objective because they are under-represented in the training data, and hence model accuracy may be lower for the minority class.

Summary. The algorithmic fairness domain focuses on combating bias in decision-making originating in protected features that could affect the objectiveness of the decision. At the same time, the class imbalance domain focuses on countering bias originating from skewed class distributions, as majority classes may be preferred over minority ones during classifier training. We take a step toward bridging the gap between algorithmic discrimination and imbalanced learning by discussing the key concepts and metrics that underpin both areas. Because it is often not possible for a ML algorithm to meet multiple fairness criteria (e.g., individual and group non-discrimination) at the same time [15, 16], we focus on a single element—group fairness. We show that a common approach used in imbalanced learning—data oversampling—can be used to increase model fairness and accuracy.

Main contributions. This paper offers the following insights and contributions towards fair machine learning:

  • Bridging the Gap Between Fairness and Imbalanced Learning: We take a step toward bridging the gap between the algorithmic fairness and imbalanced learning fields by discussing commonalities and differences in the approaches that both fields use to overcome bias in machine learning.

  • Fair Oversampling: We propose a new data pre-processing technique, Fair Oversampling (FOS), that enhances fairness and classifier accuracy. Unlike existing fairness pre-processing methods, which seek to balance both class instances and protected groups, FOS numerically balances classes and de-biases protected features through feature blurring. Balancing the number of class instances improves prediction accuracy and blurring protected features improves group fairness metrics.

  • Fair Utility Metric: We propose a new metric that combines fairness with imbalanced learning—Fair Utility—that relies on measures commonly used in both fields.

Organization The paper is organized as follows. It first discusses algorithmic fairness and its key concepts. Then, it reviews the central elements of imbalanced learning. Next, it introduces our algorithm and proposed fairness metric. Finally, the paper discusses experimental results, and the commonalities and differences between algorithmic fairness and imbalanced learning.

2 Related work

2.1 Algorithmic fairness

Discrimination can generally be defined as the prejudicial treatment of an individual based on membership in a legally protected group. Algorithmic fairness is concerned with ensuring that decisions made by machine learning models are equitable with respect to protected groups [17, 18]. Algorithmic fairness commonly invokes one of the following changes in order to produce equitable results: (1) modifications to the training data, (2) changes in the machine learning model, or (3) modifications to the decisions themselves [19].

We concentrate on supervised learning in the context of binary classification. In binary classification, the goal of algorithmic fairness is to fairly select between two actions, \(a_0\) and \(a_1\) (e.g., approve or decline the extension of credit in banking). In our discussion, we adopt notation used by Speicher et al. [20] to describe our algorithmic environment. Thus, a ML decision algorithm, A, can be described as a function \(A: \mathbb {R} \rightarrow \{0,1\}\) that outputs a binary decision. The machine learning algorithm, A, parameterized by \(\theta \), accepts as input training data, D, minimizes a loss function \(l(\theta )\), and predicts a label (i.e., 0 or 1).

More formally, A accepts as input training data, \(D=\{(x_i,y_i)\}^n_{i=1}\), with n examples, features \(x_i \in X\), where \(y_i \in Y\) represents the prediction or label (\(Y = \{0,1\}\)) for each individual i. The features, X, can either be discrete or continuous. We partition the set of features (or attributes) into two groups, sensitive or protected features, such as gender, race or age, and unprotected features, such that \(x = (x_p, x_u)\). We also assume that protected features can be further partitioned into two classes, privileged and unprivileged (i.e., \(x_p = (x_{pr}, x_{up})\)). For purposes of this paper, we assume that the label contained in the dataset is the correct, unbiased label.

Narayanan described at least 21 mathematical definitions of fairness that have been proposed by the fairness research community [21]. Two broad classes of algorithmic equity have gained prominence: group and individual fairness. Group fairness requires that the ML algorithm, A, produces parity for a given metric, M, for protected features, such that \(M_{x_p}(A) = M_{x_p'} \forall \ x_p, x_p' \in X\). Individual fairness requires that similar individuals are treated similarly. It implies the presence of a similarity metric that is capable of determining if a pair of individuals are similar.

Because of the challenges in finding suitable individual fairness similarity metrics, we focus on group fairness in this paper. Corbett-Davies et al. describe three central concepts embodied by group fairness: anti-classification, classification parity and calibration [22]. Anti-classification requires that AI algorithms do not consider protected features when making decisions [9, 23, 24]. Thus, anti-classification provides that: \(A(x) = A(x')\ \forall \ x\), such that \(x_u = x_u'\). Classification parity (sometimes referred to as statistical parity) requires that certain measures are equal across sensitive features. Statistical parity can be expressed in a variety of ways. Under one formulation, the proportion of members in a protected group receiving a positive classification must be identical to the proportion in the population as a whole [25]. Other measures focus on the difference in positive or negative rates (instead of proportions) between sensitive groups (e.g., equal true positive rates for both male and female applicants). Classification parity has been widely used as a fairness metric in machine learning [26,27,28]. As discussed below, we use classification parity in our metrics. Demographic parity, or the proportion of positive decisions, means that \(Pr(A(X)=1 | x_p) = Pr(A(X)= 1)\) [29]. Whereas, parity of false positives requires that \(Pr(A(X) = 1 | Y = 0, x_p) = Pr(A(X) = 1 | Y = 0)\). We also incorporate demographic parity into our metrics, although we focus on differences in true positive, false positive, and true negative rates, instead of their relative proportions.

In order to achieve group fairness in machine learning, a variety of techniques have been employed, which can be broadly separated into pre-processing, in-processing and post-processing methods. Pre-processing techniques involve manipulating the training data before it is consumed by a classification algorithm, in-processing incorporates fairness into a ML algorithm loss function, and post-processing aims to adjust the decisions of a classifier to be fair. We briefly survey below the key pre-processing techniques that are relevant to our approach.

Kamiran and Calders propose a pre-processing method, Reweighing, that creates weights for the training instances to ensure fairness [30]. They effectively divide the training set into four groups: (1) privileged group, majority class; (2) unprivileged group, majority class; (3) privileged group, minority class; and (4) unprivileged group, minority class. They then develop separate weights for each of the four groups and apply the weights to each instance. Similar to Kamiran and Calders, Li and Liu propose to reweight data samples to improve fairness by granularly modeling the influence of each training sample [31]. Feldman et al. propose a pre-processing method, Disparate Impact Remover, that modifies features to enhance group fairness while preserving rank-ordering within protected groups [29].

2.2 Imbalanced learning

Imbalanced learning is concerned with disproportions among classes. In binary classification, the number of instances of one class (the majority) outnumber the other (minority). The skewed distribution of examples in favor of the majority class can cause classifiers to be biased toward the majority because the algorithm’s parameters are more heavily weighted toward more frequently occurring examples. Classifiers can achieve high accuracy by merely selecting the majority class. However, the minority class is often the more important one from the data inference perspective because it may carry more relevant information.

There are three broad approaches within imbalanced learning: data-level methods that modify the training data to balance class distributions, algorithm level methods that ameliorate bias in classifiers towards the majority class, and ensemble methods that are a combination of the first two with classifier committees.

Data-level approaches. This group of methods focus on modifying the training set by balancing the number of minority and majority class examples. Oversampling generates new minority class examples, while under-sampling removes instances from the majority class. Under-sampling can result in removal of important data from the training set and therefore is often not preferred. Simple random oversampling (ROS) merely duplicates instances of the minority class to impose parity. SMOTE, or the Synthetic Minority Oversampling Technique, [1], is a popular oversampling method used in the imbalanced learning community. It randomly selects a nearest neighbor of a minority instance and linearly generates synthetic examples based on the original instance and a nearest neighbor. SMOTE has been adapted to enhance the importance of class borderline instances [32], define safe regions that do not sample from noisy or overlapping instances [33] and has been applied in the deep learning [34] and big data [35] contexts. Alternative approaches to SMOTE have been proposed recently that do not rely on k-nearest neighbors, instead using alternative measures such as class potential [36], Mahalanobis distance [37], or manifold approximation [38].

Algorithm-level approaches. This group of methods modify the training procedure of a classifier to make it skew-insensitive, or incorporate alternative cost functions. Cost sensitive learning, which is a form of importance sampling [39], magnifies the importance of minority examples by increasing the penalty associated with the instances. Recent examples of cost-sensitive methods that have been used in imbalanced learning include the focal loss [40], the class-balanced margin loss [41], the distribution aware margin loss [42] and the asymmetric loss [43].

Ensemble approaches. Combining multiple classifiers is considered as one of the most effective approaches in modern machine learning [44]. Ensembles find their natural application in learning from imbalanced data, as they leverage the predictive power of multiple learners. By combining base classifiers with data or algorithm-level solutions, they achieve locally specialized robustness and maintain diversity among ensemble members. Most popular solutions combine resampling with Bagging [45] or Boosting [46], use mutually complimentary cost-sensitive learners [47], or rely on dynamic selection mechanisms to tackle locally difficult decision regions [48].

2.3 Imbalance within algorithmic fairness

Several papers have discussed the relationship between fairness and imbalance in machine learning. Shui et al. examine fairness with respect to group sufficiency where sub-groups have a limited number of instances [49]. They apply their method to natural language processing. Subramanian et al. observe that protected features may be associated with class labels, which may result in stereo-typing in natural language processing [50]. They propose an approach that algorithmically reweights class instances and protected features so that the cost associated with features and classes are balanced. Yan et al. observe that traditional imbalanced learning methods, such as SMOTE, can actually increase group discrimination [51]. They use a variant, K-Means SMOTE [52], and clustering to remove class instances from the original dataset that are near decision boundaries, which improves fairness (under-sampling). Their method is called Fair Class Balancing (FCB). Ferrari and Bacciu state that class and protected feature bias are related because they are caused by data complexity, as well as class and feature imbalance [53]. They propose to modify standard cross-entropy loss with an adaptive hyper-parameter that takes into account feature and class imbalance. Iosifidis and Ntoutsi state that one of the main reasons for bias in ML models is under-represented features [54]. They use SMOTE and feature generation to balance the number of protected features in training data. Chakraborty et al. attribute bias to data imbalance and improper class labeling [55]. They use SMOTE-based interpolation to equalize the number of class and protected feature instances; and then proactively remove data from the training set that is deemed biased (under-sampling). Their method works directly on categorical data. Wang et al. develop a method, Fair Streaming, to balance sub-groups in streaming data [56]. They also design multiple pseudo models in order to develop a baseline related to the trade-off between fairness and accuracy, which we address with a simple metric, Fair Utility (discussed below). Tarzanagh et al. balance sub-groups with a tri-level optimization framework that uses local predictors [57].

Although some of the above works incorporate SMOTE to address fairness, they do so to balance the number of class and protected features; and sometimes follow this balancing step with under-sampling. The above papers generally attribute bias in algorithmic fairness to spurious associations between labels and features, or to numerical disparity in classes and features. Because several of the works attribute bias in algorithmic fairness to under-representation of protected features; they seek to numerically balance them through data augmentation, under-sampling, or equalizing costs through loss functions.

In contrast, we introduce feature blurring and add an additional pre-processing step that converts categorical data into integers, which facilitates feature interpolation. We also do not attempt to numerically balance protected features and do not use under-sampling. Instead, we de-bias features by causing a classifier not to rely on a protected feature for prediction—thus enhancing fairness. Separately, we balance class instances, which improves prediction accuracy.

3 Why we need to bridge algorithmic fairness and imbalanced learning?

Different views on bias. The previous sections provided general background and reviewed recent advancements in algorithmic fairness and imbalanced learning. This allows us to see the strong parallel between them, as they both deal with the problem of countering class and feature bias, however from different perspectives.

  • Bias according to algorithmic fairness. Here, bias is seen as a lack of fairness and transparency, originating from social background and the nature of the data itself. Fairness focuses on bias based on using sensitive or protected information (e.g., race or gender) to make a decision. Fairness-aware algorithms also focus on using safe information for training classifiers and debiasing them with respect to protected features.

  • Bias according to imbalanced learning. Class imbalance focuses on bias originating in disproportion among classes, as most machine learning algorithms will become biased towards classes with a higher number of training instances. This puts smaller, yet often more important, classes at a disadvantage. Imbalance-aware algorithms focus on either balancing class distributions or removing the bias towards majority classes from the training process.

Interaction of class and feature bias. Algorithmic fairness views the source of bias with respect to protected features (e.g., race, gender) of a class instance (e.g., a student denied admission to a university), while imbalanced learning views bias as arising from a numerical disproportion among class instances themselves. In both algorithmic fairness and imbalanced learning, bias can emerge from a machine learning model performing supervised classification (e.g., a support vector machine, logistic regression classifier). In this paper, we argue that model fairness can be enhanced by addressing both class and feature bias; however, each concern requires a different remedy. Class bias can be addressed through data augmentation by numerically balancing instances; whereas feature bias can be addressed by causing a model to discount protected features—thus forcing it to pay attention to other relevant (non-protected) features when rendering a decision.

4 Fair oversampling

Our algorithm, Fair Oversampling (FOS), is designed to improve fairness and increase classifier accuracy. When training a machine learning model to accurately predict classes, it is often necessary to equalize the number of training examples between classes to ensure that models based on parametric learning are able to balance weights between specific classes. If a classifier observes very few instances of a minority class, it’s parameters may be biased toward recognizing the dominant class. At the same time, FOS addresses fairness by debiasing protected features. It does this by effectively mixing samples between protected group members, which causes the classifier to become confused about a particular feature, thus forcing it to rely on other features for accurate prediction.

FOS modifies a training dataset D so that it can be input to the machine learning model. FOS acts on two types of independent variables (X) in the training data, protected features \(x_p\), and unprotected features \(x_{u}\). FOS incorporates SMOTE, which uses feature interpolation to create synthetic instances. SMOTE relies on features being expressed as real numbers; therefore, as a pre-processing step, we convert categorical features to integers, if they are present in D. The pseudocode for FOS is displayed in Algorithm 1.

Algorithm 1
figure a

Fair Over-Sampling

FOS first determines the minority and majority classes (\(Y = \{0,1\}\) or \(Y = \{min,maj\}\)). It then subdivides the protected features \(x_p\) into two categories - privileged and unprivileged (\(x_p = (x_{pr}, x_{up})\)). This categorization results in four sub-groups: privileged majority (\(D_{prmaj}\)), unprivileged majority (\(D_{upmaj}\)), privileged minority (\(D_{prmin}\)) and unprivileged minority (\(D_{upmin}\)).

The objective of FOS is to restore balance between the classes through random oversampling and nearest neighbor metrics, using the mechanics of the SMOTE algorithm. FOS numerically balances the classes, such that the number of examples (N) in the majority class (\(N_{maj}\)) equals the number of examples in the minority class (\(N_{min}\)), or \(N_{maj}\)=\(N_{min}\). It determines the protected group \(x_p\) in the dataset D that requires the least number of samples to obtain equivalency (denoted as \(D_1\)), and selects the K nearest neighbors of a random sample of \(D_1\) (e.g., the unprivileged, minority group \(D_{upmin}\)). The number of random samples selected from this group equals the number of samples required to make it equal in number to the same group in the majority class. For \(D_1\), the samples are drawn from a single protected sub-group.

Next, the same oversampling procedure is repeated for the protected group \(x_p\) with the larger number of samples that are required to obtain numerical equivalency which is denoted as \(D_2\), except that instead of drawing the nearest neighbors exclusively from the \(D_2\) pool, they are drawn from the entire minority class. This approach reduces bias because it blurs the difference between privileged minority \(D_{prmin}\) and unprivileged minority \(D_{upmin}\) group members, since it draws a nearest neighbor from the entire minority class \(D_{min}\), which consists of both privileged and unprivileged members. FOS balances the number of class instances within a training dataset. It does not balance protected feature ratios.

5 Fair utility

5.1 Background

For purposes of our experiments (see Sect. 6), group fairness metrics were selected that could be expressed as elements of a binary classification confusion matrix consisting of True Positive Rate (TPR), False Positive Rate (FPR), True Negative Rate (TNR), and False Negative Rate (FNR). Metrics were chosen that are widely used in the fairness and imbalanced learning communities: Balanced Accuracy, Average Odds Difference (AOD), Absolute Average Odds Difference (AAO), Equal Opportunity Difference (EOD), and True Negative Rate Difference (TNRD) [55, 58, 59]. AOD is the average difference in the False Positive Rate plus the True Positive Rate for privileged and unprivileged groups [60]. It can be expressed as: \(\frac{1}{2} ((TPR_p - TPR_{up}) + (FPR_p - FPR_{up}))\), where \(TPR_p\) is the TPR of privileged instances, \(TPR_{up}\) is the TPR of the unprivileged instances, \(FPR_p\) is the FPR of privileged instances, and \(FPR_{up}\) is the FPR of the unprivileged instances. AAO is the same as AOD, except that TPR and FPR are absolute value calculations. EOD is the difference between the True Positive Rate of privileged and unprivileged groups [60], and can be expressed as: \((TPR_p - TPR_{up}) + (FPR_p - FPR_{up})\). TNRD is \((TNR_p - TNR_{up})\), where \(TNR_p\) is the TNR of privileged instances, and \(TNR_{up}\) is the TNR of the unprivileged instances.

5.2 Proposed metric

In addition to the metrics discussed above, we propose a new metric, called Fair Utility. In developing this metric, we are inspired by Corbett-Davies et al. [16]. They characterize algorithmic fairness in terms of constrained optimization in the context of the COMPAS algorithm for determining whether defendants in Broward County, FL, who were awaiting trial, were too dangerous to be released. In their formulation, the objective of algorithmic fairness is to both maximize public safety and reduce racial disparities. We also view algorithmic fairness as a multi-objective optimization problem, where the goal is to maximize the accuracy of a classifier and reduce group inequality. We approach the optimization problem with a data pre-processing technique designed to balance class accuracy prediction with protected group equity. We are also inspired by Halevy, Norvig and Pereira, who postulated that, in machine learning, a large quantity of data is more important than a strong algorithm [61]. Our metric named Fair Utility can be expressed as balanced accuracy multiplied by the average of TPRD plus TNRD. More explicitly, it is: \(BA\ X\ \frac{1}{2}\ X\ ((1 - |TPRD|) + (1 - |FPRD|))\), where BA is balanced accuracy, TPRD is \((TPR_p - TPR_{up})\), and FPRD is \((FPR_p - FPR_{up})\). Utility involves maximizing the benefit of taking an action, compared with its costs. Here, we treat accuracy as equivalent with utility, which assumes that the class label assigned by the dataset is correct and does not contain inherent bias. The objective of Fair Utility is to combine accuracy and fairness into a single metric by incorporating balanced accuracy (which reflects the impact of class imbalance) with two fairness metrics (true positive and true negative rates) that track whether a classifier consistently accepts or rejects protected group members.

6 Experiments

The following experiments were designed to answer our research questions (RQ):

  • RQ1: Does FOS improve both algorithmic fairness and robustness to class imbalance for popular standard classifiers?

  • RQ2: What is the FOS trade-off between fairness and skew-insensitive metrics when handling varying imbalance ratio levels?

  • RQ3: Does FOS reduce protected feature importance?

6.1 Datasets

Table 1 Description of the Datasets

Three popular datasets are selected for testing that are used by the fairness research community [62]: German Credit [63], Adult Census Income [64], and Compas Two-Year Recidivism [65]. The key statistics of each dataset are summarized in Table 1. All three datasets involve binary classification. The German Credit dataset contains data that allows a classifier to predict whether an individual should have a positive or negative credit rating. The Adult Census Income dataset predicts whether an individual earns more or less than $50K. The Compas dataset can be used to predict whether a defendent will commit a crime within a two year period.

Table 2 Class and protected feature imbalance ratios for each dataset

As we can see from Table 2, all of the datasets exhibit both class and protected attribute (gender) imbalance. Compas shows the least amount of class imbalance, with a ratio of 1.22:1, while the German Credit and Adult Census datasets have class imbalance ratios ranging from approximately 2:1 to 3:1. All three datasets show greater protected attribute imbalance than class imbalance, with the ratios ranging from approximately 2:1 to 4:1. In the minority class, the maximum protected attribute imbalance ratios are even higher, ranging from 1.75:1 in German Credit to 5.61:1 in Adult Census.

6.2 Experimental design

Experiment 1: Oversampling for standard classifiers. First, modified training data produced by FOS was used as input to two standard machine learning classifiers: SVM and Logistic Regression (LG). The performance of the models was assessed based on metrics, which are discussed below. The performance of our algorithm was compared against four benchmarks for each standard classifier: (1) a baseline (no modifications to the training dataset); (2) a popular imbalanced learning oversampling method—SMOTE [1]; and two pre-processing algorithms that are specifically designed to improve fairness—(3) Reweighing [19] and (4) Disparate Impact Remover [29]. The purpose of this experiment was to determine how FOS compared to other data pre-processing algorithms that are used in both the imbalanced learning and fairness research communities.

Experiment 2: Robustness to increasing imbalance ratios. Second, we assessed how the performance of a standard ML classifier was affected by increasing levels of class and protected group imbalance. For this test, SVM was used as the ML algorithm with varying degrees of imbalance. Instances were randomly removed from classes and protected groups to achieve the intended imbalance levels. The selected imbalance levels were: \(I \in \{{1,1.2,1.4,1.6,1.8,2\}}\) for German Credit; \(I \in \{{1,2,2.5,3,3.5,4\}}\) for Compas; and \(I \in \{{1,2,4,6,8,10\}}\) for Adult Census, where I represents the denominator of the fraction that reduced the number of original protected group members. The reason for different levels of imbalance by dataset is that classifiers trained with the Reweighing and Disparate Impact Remover algorithms produced unstable results for datasets with a relatively small number of examples (i.e., German Credit and Compas), such that the classifiers predicted all labels to reside in a single class, thus causing True Negatives to be zero, yielding “Nan” metrics. In contrast, both SMOTE and FOS were able to work at the \(I = 10\) level for all datasets. Therefore, the imbalance ratio scaling was adjusted so that all pre-processing algorithms could be assessed for all datasets.

Experiment 3: Impact of oversampling on feature importance. Third, we considered whether FOS caused a standard classifier to change the selection of the features that it used to formulate its decision boundary.

Setup. All experiments were performed using five fold cross-validation. The reported results are averaged over the respective held out validation sets. See Tables  3 and  4.

6.3 RQ1: DA for standard classifiers

Table 3 Results for SVM classifier

FOS displays strong performance with respect to the standard classifiers, with clear improvements in fairness, as measured by AOD, AAO, TNRD, and EOD; although it shows better results with SVM as compared to Logistic Regression. For the SVM classifier, it consistently outperforms the other algorithms in terms of average odds, absolute average odds, TNRD, and Fair Utility. See Table 3. It came in a close second to SMOTE in terms of Balanced Accuracy and clearly outperformed SMOTE in terms of the fairness metrics. Although SMOTE displays strong balanced accuracy, it often does not produce fair results with respect to protected groups.

Table 4 Results for Logistic Regression classifier

This is likely because it balances the class distribution, which improves the class false positive rate; while it is not designed to improve the false positive rate with respect to specific instance features. In terms of equal odds, FOS demonstrates significant reductions in unfairness compared to the baseline, with a first place finish for Compas and second place finishes for German Credit and Adult Census.

Fig. 1
figure 1

Impact of varying imbalance levels on Equal Odds (EO) and Fair Utility for a SVM classifier for 3 datasets. FOS shows high resilience to increasing imbalance levels

For Logistic Regression, FOS consistently produced the top Fair Utility results, with first place finishes in terms of average odds and absolute average odds, and first place misses by less than.0057 points. It also showed significant reductions in equal odds and TNRD, when compared to baselines, with first or second place results. See Table 4.

For purposes of this experiment, FOS consistently demonstrates that it improves both accuracy and fairness over baselines. It also outperforms other fairness pre-processing algorithms on a number of measures. Thus, this experiment shows that an oversampling technique that is adopted from imbalanced learning can achieve significant improvements in group fairness measures. This also shows the close relationship between class and protected group imbalance and fairness—by jointly improving class and protected group imbalance ratios, we can affect a substantial improvement in group fairness measures. These results also indicate that it is possible to increase both balanced accuracy and fairness simultaneously (RQ1 answered).

Fig. 2
figure 2

This figure illustrates the impact of increasing imbalance levels on the F1 and recall measures. For the Compas and Adult Census datasets, which experience relatively more class and protected group imbalance, FOS shows greater resilience

6.4 RQ2: robustness to increasing imbalance ratios

FOS performs at the top of the benchmark group in terms of Balanced Accuracy and Fair Utility under increasing levels of imbalance on all three datasets. See Fig. 1. However, at first glance, it does not outperform in terms of discrimination mitigation at higher levels of imbalance. Upon closer inspection, we believe that the reason why the baseline and other algorithms appear to be more stable at higher imbalance ratios is because their predictions focus on the true positives at the expense of true negatives. This can be seen in the Adult Census and Compas datasets, which have higher levels of imbalance. In those cases, as depicted in Fig. 2, the precision ratios increased and the recall ratios decreased for most algorithms, except for FOS and SMOTE (RQ2 answered). As discussed in the Experiments section, it should be remembered that other pre-processing techniques initially failed at imbalance levels greater than 2 and 4 on the German Credit and Compas datasets, respectively.

Fig. 3
figure 3

These figures show that FOS generally increases the magnitude of all logistic regression model coefficients (except one), which is the opposite of weight regularization. Thus, it is able to improve sensitive feature and minority class generalization without weight regularization; notably, it markedly reduces the feature importance of gender (sex), which is the tested protected feature here

6.5 RQ3: impact of FOS on feature importance

Figure 3 displays the importance of each feature for logistic regression models. It compares feature importance for models trained with FOS and baseline imbalanced datasets. Feature importance is measured based on the absolute value of model weights. Because a LG model is shallow, there is a direct correspondence between features and weights. More important features have higher weight magnitudes. Since the model uses a single classification layer with summation, both negatively and positively signed weights are equally important; thus, we take the absolute value of the weights. The magnitudes are averaged across 5 cross-validation runs. In all cases, FOS changes the magnitude, and sometimes the sequence, of feature importance (model weights).

Importantly, FOS increases the magnitude of almost every feature, except one—gender. Thus, FOS reduces the magnitude, or importance, of the protected feature (sex). By reducing the importance of the protected feature, it reduces model bias toward that attribute. (RQ3 answered.)

7 Conclusion

A key facet in reducing algorithmic discrimination is the simultaneous reduction of class and protected feature bias in training data. We show that reducing data imbalance facilitates improvements in model accuracy and that debiasing protected features improves group fairness. We discussed the importance of bridging imbalanced learning and group fairness, by showing how key concepts in these fields overlap, and proposed a novel oversampling algorithm, Fair Oversampling, that addresses both class and protected feature bias. We take a step toward bridging the gap between fairness and imbalanced learning with a new metric, Fair Utility, that combines balanced accuracy with group fairness measures.