This section will propose an alternative approach that can address the contextual complexity and provides a more precise proxy for what represents the “best” algorithm for the decision-maker. This provides information that cannot be derived from the fairness analysis by assessing an algorithm’s effectiveness in achieving concrete objectives specific to the domain area. This section will demonstrate this methodology using the US mortgage lending data set.
Operationalisation of Variables
Two of a lender’s objectives that also intersect with public interest are discussed: increasing financial inclusion and lowering denial rate of black borrowers. An increase in access to credit represents the lender’s growth in market share, as well as being a global development goal due to its importance for the economy and for individuals’ upward mobility (Demirguc-Kunt and Klapper 2012; King and Levine 1993). This is affected by the accuracy of the algorithm in predicting default (i.e. potential revenue loss). The impact on racial minorities is important to consider to comply with regulatory requirements, to manage reputational and ethical risks, and to mitigate the racial bias embedded in the data.
Financial Inclusion
In order to estimate the impact of an algorithm on financial inclusion, the following assumptions are made:
Assumption 1
HMDA data represent the perfect model. In other words, the loans accepted by the HMDA lenders were repaid in full, and the loans denied would have defaulted. Note that in reality, we would not know whether those who are denied a loan would have defaulted; however, this information is not used in this analysis (see definition of negative impact on minorities in Sect. 4.1.2).
Assumption 2
There is one hypothetical lender. The HMDA data represent a multitude of US-based lenders, but this analysis will focus on a lender-level analysis. Future work can assess the impact on a market-level in a multi-player model.
Assumption 3
The lender has a capital limit of $1 billion and gives out loans through the following process:
-
1.
The lender uses an algorithm to predict whether or not the loan will default;
-
2.
The lender sorts the loans by the highest probability of repayment; and
-
3.
The lender accepts the loans with most certain repayments until the capital limit has been reached.
Assumption 4
The loans are either fully paid or will default, i.e. the lender gets the full amount back, or none. This simplifies the expected return calculation by ignoring the differential interest rates and partial payments.
Assumption 5
The lender aims to maximise the expected value of the loan, which is the accuracy of the algorithm × the loan amount. In other words, if the accuracy of the algorithm is 60% in predicting default, and the loan is for $1 million, then the expected value is $600,000 given there is a 60% chance of full repayment vs. default.
Assumption 6
There is no differentiation in terms, conditions, and interest rates between racial groups. Different rates can be used to pricediscriminate, resulting in unequal distribution of the benefits of financial inclusion. While this is an important consideration for future studies, this analysis will only consider the aggregate-level financial inclusion.
With the above assumptions, financial inclusion can be roughly estimated with the expected value of the loans, which, when maximised, represents the lender’s ability to give out more loans. A total expected value of $600 million represents 4000 loans of $150,000 (the median loan value in the data set).
This is a rough definition from a single lender’s perspective. While the multi-faceted macroeconomic definition and measurement of financial inclusion have been contentious, the primary objective for building an inclusive financial system is to minimise the percentage of individuals involuntarily excluded from the market due to imperfect markets, including incomplete/imperfect information (Amidži et al. 2014). The reduction of portfolio risk with more complete information on each loan’s credit-worthiness can be viewed as reducing this asymmetry and improving efficiency. Future work may revisit this definition to expand beyond the lender-level to a market-level and focus on the improved access specifically for low-income and high-risk applicants.
Negative Impact on Minorities
Moving away from what is fair, what is the comparative adverse impact on black mortgage applicants for each algorithm? This will be measured simply as the percentage of black applicants who were denied a loan under the algorithm, which is the number of denied loans by black applicants/total number of black applicants. Note this does not consider whether those who were denied a loan would have defaulted. Alternative objectives may be constructed from the fairness metrics, e.g. black-white outcome disparity (group fairness). To demonstrate this as a supplement to the fairness metrics, we will use this raw measure of impact on potential black borrowers.
Trade-off Analysis
With the variables operationalised as per above, Fig. 3 shows a chart of financial inclusion vs. negative impact on minorities. The baseline represents the outcome at 50% accuracy (random chance) in predicting default. The error bars around each expected value corresponds to the standard deviation in the accuracy of the algorithm in a tenfold cross-validation.
Note that the ordering of the expected value of the loans coincides with the algorithms’ accuracies. Some algorithms are better at predicting the loan outcome than others, yet the existing fairness metrics overlook the opportunities and customer benefit provided by a more accurate predictor, which needs to be considered in any evaluation of an algorithm. If the true relationship between default and the input features were linear, a linear model would return the best accuracy. However, given that defaulting on a loan is a complex phenomenon to model with an unknown combination of potential causal factors, it is reasonable to expect that its prediction will be better modelled by more complex (non-linear) algorithms.
Also note that overall (e.g. KNN, LR, RF), negative impact on black applicants increases with higher algorithm accuracy. In other words, the increase in the aggregate benefit (financial inclusion) tends to be at the expense of the welfare of the minority and disadvantaged group. There is one notable exception: NB has a much higher denial rates of black applicants than RF, even though its accuracy is lower. This could suggest that NB may be overfitting to the proxies of race. This aligns with the finding of Fuster et al. using the HMDA 2011 data: in a lending system using random forest, the “losers” who would have received a loan under a logistic regression algorithm who no longer qualify are predominantly from racial minority groups, especially black (Fuster et al. 2017).
Random forest is better in absolute terms (in both financial inclusion and impact on minorities) than Naïve Bayes. Random Forest has 10.7 lower percentage points in denial rates for black applicants. Given the median loan value is $150,000, moving from NB to RF results in 368 new medianvalue loans totalling $55.27 million. The decision is more ambiguous between CART and LR. While CART is more accurate and results in greater financial inclusion (equivalent of $15.6 million of loans, or 103 median-value loans), CART results in a 3.8 percentage points increase in denial rates for black loan applicants compared to LR.
This analysis reveals and quantifies the concrete stakes for the decisionmaker, which provides additional information to the absolute fairness tests. It would enable the lender to select the algorithm that best reflects its values and its risk appetite. Of course, not all competing objectives can be quantified; regulatory/legal requirements and explainability of algorithms should all also be considered. For example, RF may be deemed unacceptable due to the relative challenge in interpretability compared to LR. This gap may be narrowing, however, as important progress has been made in recent years in developing model-agnostic techniques to explain machine learning algorithms’ predictions in human-readable formats (Ribeiro et al. 2016; Wachter et al. 2017). While these methods are not without their challenges in bridging the gap between real-life and machinelearning objectives (Lipton 2018), they could be used to help interrogate the drivers of each individual prediction regardless of model complexity.
One of the benefits of this methodology is its flexibility. The axes can be adapted to the domain area and the decision-maker’s interests. In other use cases, e.g. hiring or insurance pricing, the trade-off curve would look very different. Other algorithms—whether rules-based process or a stochastic model—can also be mapped on this trade-off chart.
Proxies of Race
Why would some algorithms affect black applicants more than others? Fuster et al. have shown that some machine learning algorithms are able to triangulate race from other variables and features in the model, reinforcing this racial bias (Fuster et al. 2017). This phenomenon is termed “proxy discrimination” (Datta et al. 2017).
When a logistic regression algorithm is run to predict race instead of loan outcome with the same set of features, all features are statistically significant in their associations with race except for: FSA/RHS-guaranteed loan type, owner occupancy status, and property type. The following features are associated with the applicant’s race being black: having a lower income and a lower loan amount and being from a census tract area with low median family income, smaller number of 1–4 family units and owner-occupied units, larger population size, and a lower tract-to-MSA/MD income. Of the categorical features, one feature with especially high regression coefficient is the FHA-insured loan type.
With all other categorical features set as the baseline and given the data set’s median values for each of the continuous features, the probability of the applicant’s race being black if his loan type is FHA-insured is 36.16%. Otherwise, the probability that he/she is black is 19.38%—a 16.78 percentage point difference. This supports the finding from the paired audit study (Ross et al. 2008) that black applicants tend to be referred to the more expensive FHA-insured loans.
Loan type is also statistically significant in a logistic regression to predict loan outcome, with FHA-insured loan type being negatively associated with loan approval probability. Therefore, having an FHA-insured loan type is both a proxy for risk (given the higher cost) and a proxy for race. Given the statistical significance of these features, it is clear that the applicant’s race can be inferred to a certain extent from their combination.
This discredits the existing “fairness through unawareness” approach of lenders (Ladd 1998), which attempts to argue that non-inclusion of protected features implies fair treatment. Some have suggested removing the proxies as well as the protected characteristics (Datta et al. 2017). The attempt to “repair” the proxies through pre-processing the data to remove the racial bias has been shown to be impractical and ineffective when the predictors are correlated to the protected characteristic; even strong covariates are often legitimate factors for decisions (Corbett-Davies and Goel 2018). There is no simple mathematical solution to unfairness; the proxy dependencies must be addressed on the systemic level (Ramadan et al. 2018). In the case of mortgages, lenders must review their policy on why an FHA-insured loan type may be suggested over others and how this may affect the racial distribution of loan types.
In addition, recall that one of the key limitations of both ex post and ex ante fairness metrics is their inability to account for discrimination embedded in the data (Gajane 2017). Given that the proxy for risk cannot be separated from the proxies for race, individual fairness metrics would be challenging to set up. With the proof of past discrimination embedded in the data through the marketing, paired audit, and regression studies, mortgage lending fairness also cannot be evaluated through simple disparity in error rates. Having a perfect counterfactual would be desirable, as it would disentangle the complex dependencies between covariates if we had the true underlying causal directed acylic graph. However, this is challenging to achieve in predicting default. For this use case, the trade-off-driven approach is more appropriate than the fairness metrics.
Triangulation of Applicant’s Race
If race can, in fact, be triangulated through the remaining loan and borrower characteristics, this implies: (1) the inclusion of race in the algorithm would likely not make a difference in the algorithm’s accuracy, and (2) the extent to which the accuracy changes with the inclusion of race depends on the algorithm’s ability to triangulate this information. To test this, we included the race in the features to predict loan outcome and re-ran each of the algorithms. With δij as accuracy of algorithm j on sample i with race, the changes due to inclusion of race are plotted in Fig. 4.
Two-sample paired t-tests were run on the accuracies before and after the inclusion of race, given they are from the sample cross-validation samples with an added predictor. The results are below with * next to those that are statistically significant at a 5% level:
-
LR: t-statistic = 1.71, p-value = 0.12
-
KNN: t-statistic = 0.56, p-value = 0.59
-
CART: t-statistic = 2.41, p-value = 0.040*
-
NB: t-statistic = − 6.22, p-value = 0.00015*
-
RF: t-statistic = 4.04, p-value = 0.0029*
There is minimal difference in most algorithms’ accuracies. NB and KNN’s accuracies are, in fact, worsened by the inclusion of race, but only NB’s change is statistically significant. Inclusion of race in CART, LR, and RF positively affect the accuracy in predicting the loan outcome, with the results of CART and RF being statistically significant. NB results may be unstable given that its assumption that the predictor variables are independent is violated, and including race in the feature set results in the algorithms over-fitting to race. RF is better at handling feature dependencies and avoiding over-fitting through its bootstrapping methods (Kotsiantis et al. 2007), and its robustness to redundant information may explain this result.
If, in fact, racial information is embedded in the other features, can algorithms predict race? Each algorithm was run to predict race based on the given features, rather than loan outcome. Given the imbalance in the race, instead of accuracy metrics, the performance of the algorithms would be best evaluated by the Receiver Operating Characteristic (ROC) and its Area Under the Curve, which measure the True Positive Rate and the False
Positive rate of the algorithms The ROC curves are plotted in Fig. 5. The ideal ROC curve hugs the upper left corner of the chart, with AUC close to 1. The red dotted line can be interpreted as random chance. In this particular data sample, it appears that RF, KNN and NB have relatively higher AUC than LR and CART, showing that they are better performers in predicting race with the given set of features. Further study is required to understand what types of mathematical models are better able to predict protected characteristics and the corresponding impact on the results.
Given these outcomes, it is reasonable that the trade-off curve does not shift much with the inclusion of race. Figure 6 shows the impact of adding race on the trade-off curve. While it does tend to increase the denial rates of black applicants, the difference is small compared to the impact of algorithm selection.
The applicant’s race can be triangulated from the remaining features, highlighting the importance of algorithm selection, which has a greater impact on the trade-off between objectives than the inclusion of race as a predictor. The indicative results show that some algorithms are better at triangulating race than others, which should be explored in future studies to help inform the role of model type on the trade-offs.
In summary, the trade-off can be set up by: (1) defining and quantifying the real-world objectives into measurements, (2) building algorithms and computing relevant metrics, (3) identifying potential proxies of protected characteristics to interrogate whether their inclusion in the model is justifiable, and (4) selecting the algorithm that best reflect the prioritisation of competing objectives.
Overall, the trade-off analysis reveals concrete and measurable impact of selecting an algorithm in relation to the alternatives. Given the ability of algorithms to triangulate race from other features, the current standard approach of excluding race is shown to be ineffective. The disentanglement of proxies for race and proxies for loan outcome, as demonstrated through the analysis of FHA-insured loan types, challenges the assumptions of the existing absolute fairness metrics. When there are multiple competing objectives, this approach can provide actionable information to the lender on which algorithm best meets them.
Limitations and Future Work
This article aimed to demonstrate the trade-off methodology within the scope of one case study: racial bias in US mortgage lending. The analysis was limited by the data set, which did not provide the full set of information required to mimic the decision-making process of a lender. With additional information on default outcome, terms and conditions of the loans, credit history, profitability of the loan, etc., a more empirically meaningful assessment would be possible of the algorithms and their effectiveness in meeting the lender’s goals. Despite the incomplete data, the methodology demonstrated that it is possible to expose an interpretable and domain-specific outcome in selecting a decision-making model. The assumptions made in the operationalisation of variables to compensate for the missing information are mutable and removable.
This paper only begins to unravel the possibilities of the relativistic tradeoff technique, in contrast to the existing approaches to fairness in literature. Some of the areas for future exploration include:
The change in trade-off curve based on different joint distributions between race and default: It would be interesting to visualise the changes in the trade-offs depending on the amount of outcome disparity between black and white borrowers in the data set. The degree to which the increase in aggregate benefit is at the expense of the protected group is likely related to the joint distribution between the predicted outcome and race.
Addition of other algorithm types: It is important to better understand what mathematical set-up leads an algorithm to be more affected by bias in the data. In addition, algorithms that are post-processed to calibrate the predictions, while critiqued in their appropriateness and efficacy (Corbett-Davies and Goel 2018), can be added to the trade-off analysis to examine how they impact the denial rates and financial inclusion.
Generalisation to other domain areas, e.g. pricing, hiring, and criminal recidivism: Depending on the different competing objectives in other domain areas, the variables may change in the trade-off analysis. For example, two of the objectives in a hiring algorithm may be increase in diversity (i.e. percentage of minority groups hired) and an increase in performance metrics of the team. While the technique may be generalisable to other case studies, it would be useful to identify the nuances in the differences in underlying data sets, ethical considerations, and legal and regulatory precedents.
Generalisation to other local markets While this analysis is limited to the US context, further work should contextualise the methodology to local regional contexts, including any documented history of discriminatory practices against legally protected groups and competing priorities in market.
Multi-player model Only one lender’s perspective was considered; regulators and policy-makers would be interested in the market-level analysis in which all the lenders’ decisions are aggregated. While it was assumed that the lender has the sole authority to decide on which algorithm is the most appropriate, this may be limited by the perspectives of customers and of the market regulators. To understand the policy implications, multiple stakeholders’ objectives must be considered.