1 Introduction

Credit is the blood of the economy. Its efficient provision is critical to economic growth and the creation of jobs. Regarding credit to families, a longtime public policy objective has been to avoid discrimination in credit provision, especially in mortgages.

Banks and financial institutions are a particular type of business because they intermediate money deposited by people or other companies. As with any business, profit is their goal. They invest the money deposited with them. Government central banks restrict their behavior to prevent too risky operations that may lead to their liquidation, affecting the entire economy. Lending money to people with a high probability of default increases the managers’ liability in case of bank failure.

The approval of a loan depends on various borrowers’ characteristics that reflect their ability and willingness to pay the debt. One of the essential characteristics considered is the loan applicant’s credit history. Unfortunately, this information is not always available. For example, immigrants, students, and young professionals take time to build a credit history. Moreover, most poor people are invisible to banks. It has been challenging for banks to deal with this lack of information. Fintechs are addressing this issue by including other types of information, such as applicants’ behavior in social media and Telecom payments.

A credit analysis determines the degree of risk rating to assign to a loan applicant. Several applicants’ characteristics are morally accepted by society to differentiate applicants. On the other hand, discriminating against race, sex, sexual orientation, age, human disability, religion, or marital status is a crime that must be identified and punished. For example, in 2019, Wells Fargo Bank wrote the City of Philadelphia a check for $10 million to settle a lawsuit alleging that the bank engaged in discriminatory lending practices.Footnote 1

To empirically identify discrimination, however, is no easy task. Machine learning systems have been broadly used to suggest or decide upon loan approval. These systems learn from datasets that register lenders’ decisions on past loan applications. Consequently, these systems consolidate any existing discrimination behavior under the veil of outcomes’ precision and accuracy.

Fighting discrimination became a priority all over the world. For instance, in the US, in addition to laws aimed at preventing discrimination by ethnicity, gender, age, and religion, there are specific laws for the financial sector, such as the ECOA (Equal Credit Opportunity Act) and the FHA norm (Fair Housing Act), to prevent prejudice towards minorities. According to Barocas and Selbst (2016), these laws establish two legal doctrines:

  • Disparate treatment: decision explicitly takes into account group membership (direct or indirectly), and

  • Disparate impact: decision outcomes disproportionately hurt (or benefit) individuals of certain groups or with certain sensitive attribute values.

The objective of this paper is to present a systematic literature review of current research dealing with identifying, preventing, and mitigating discrimination in the credit domain. We analyzed papers from 2017 to 2022. We raised two research questions:

  1. 1.

    What were the research settings?

    • What was the discrimination theory grounding the research?

    • What was (if any) the legal framework grounding the research?

    • What was the research perspective (domain area) when addressing the algorithmic discrimination topic?

  2. 2.

    Which issues were addressed?

    • What were the specific research topics addressed?

    • What was the research contribution?

  3. 3.

    Which open questions still need to be addressed?

We reviewed a set of 78 papers on algorithmic discrimination in the credit domain that either formalize the concept or present methods for identifying, preventing, and mitigating discrimination. Although some papers refer to justice, the authors meant equality and, at most, equity. Equality, equity, and justice are three different concepts guiding the papers. As illustrated in Fig. 1, equality implies providing people with the same resources. Equity means to provide people with the number of resources each one needs to achieve their goal. Justice refers to providing people with means so they will all have the same opportunity to achieve their goals. While equality and equity may be addressed through fair algorithms and methods, justice requires an outside agent, such as an affirmative action law. Imagine a country in which the government dedicated most of the education funds to subside public universities, aiming at providing high-quality college education. Since resources are limited, federal universities can hold only a limited number of students. Assume further that to enter these universities, one has to be well ranked in a national exam. This policy tends to increase social inequality, giving better chances to well-off individuals that went to good and expensive high schools. An example of a policy measure to promote equality (in a partial equilibrium context) would be to increase the federal universities’ enrollment so that there is a place for all students (very expensive). Examples of a strategy to promote equity could be implementing an affirmative action program, such as a quota system favoring economically challenged students or granting scholarships for poor students that get into private universities. Finally, an example of a strategy to promote justice could be to allocate funds to invest in fundamental and high school education so that all students at the college entrance level would have a similar opportunity to enter federal universities. While justice requires policymakers to act, fair algorithms may lead to equity.

Fig. 1
figure 1

Inequality vs equality vs equity vs justice: a visual explanation

The remainder of the paper is organized in the following way. Section 2 presents background knowledge to understand discrimination in the context of the credit domain, the possible sources for discrimination, and the different meanings attributed to fairness in the literature. Section 3 brings the research method followed in this systematic literature review. Section 4 presents the analysis of the overall set of papers, highlighting the research findings and the pieces of evidence to tackle the research questions. Section 5 presents a discussion of issues that have not yet been sufficiently researched, the challenges involved, and a few concluding remarks.

2 Background

2.1 The effects of discrimination in the credit domain

This section discusses the outcomes of a loan application. Disparate impact and disparate treatment play important roles.

2.1.1 Access discrimination

The most important outcome of a loan application analysis is the approval or rejection of the loan. Rejecting a loan application means an applicant cannot access the credit line service. The credit analysis is based on the assigned risk of default of applicants and the threshold of the maximum risk defined by the bank.

Applicants are allowed to question the reasons for the rejection. In general, the applicant’s credit score is the most important reason. Nevertheless, the credit scoring technology is usually proprietary, with the methodology not publicly available.

The credit score is used to determine the loan concession. It is also crucial to decide on the payment conditions, such as the interest rate, the maximum number of installments, and the collateral requirement. Pricing, as well as the denial of credit, may constitute a form of discrimination.

2.1.2 Price discrimination

Price discrimination in credit markets has been chiefly associated with strategies that harm minorities or specific groups for a long time (Ladd 1998), for different types of credit, such as auto loans (Charles et al. 2008), business loans (Alesina et al. 2013), and mortgages (Bartlett et al. 2022). Nevertheless, price discrimination has been a successful selling strategy, not necessarily unethical. Considering Stigler’s definition (Stigler 1987), price discrimination happens when similar goods are sold at different prices although produced at similar marginal costs. There are three types of price discrimination, depending on the amount of information available to sellers to guess the value customers assign to the product.

First-degree price discrimination, also called perfect discrimination, happens when the sellers have perfect information on how each customer values the same product for charging the maximum price. This strategy is challenging because such detailed information is difficult to obtain. And when it becomes available to sellers, it may get customers angry, especially when sellers are getting information from users without their explicit consent. In 2000, Amazon experimented with selling the same DVD with very different discounts for different users. It used data on each customer’s previous purchases and navigation patterns to set the discounts (Streltfeld 2020). The strategy was discovered and made the news. Customers were angry. Amazon had to apologize and discontinue the strategy.

In the credit domain, first-degree price discrimination refers to offering different payment conditions depending on the bank’s assessment of the likelihood of repayment. For example, a lower interest rate can be offered when payment is automatically deducted from the client’s salary (possible in some countries).

Second-degree price discrimination refers to different prices according to different amounts of the product being sold, such as discounts for larger purchases or rewards for the next purchase. This strategy discriminates the product price, increases revenues, and is well accepted by customers.

In the credit domain, second-degree pricing discrimination refers to offering different payment conditions depending on the client’s relationship with the bank (number and volume of bank products consumed).

Third-degree price discrimination refers to charging customers differently, for similar products, according to the group they belong to, inferred by attributes of the group, such as location, age, sex, and economic status. There are acceptable business examples of this type of price discrimination, such as software pricing depending on whether it will be used for educational or professional purposes or senior discounts in theaters. Furthermore, there are examples in which price discrimination aims at better global welfare, such as the example presented in Elegido (2011) transcribed below:

”A young doctor in a developing country is looking for ways to establish a medical practice in the rural community where she was born, but cannot find a way to make the practice economically viable. She can see, on average, 400 patients per month. So, to cover her costs of $4000 per month (which includes her modest salary), she should charge, on average, at least $10 per visit. However, most people in her community can afford to pay, at most, $5 per visit. An economist friend suggests that she charges 90 percent of her patients only $5 per visit but charges $55 per visit to the 10 percent of her patients who can afford to pay this amount. This way, she could cover all her costs, and the rural practice would be viable. Of course, poor patients like this solution. The rich patients also like it: they would rather pay $55 per visit than travel by bad roads to the nearest hospital 50 km away, and they also like the bonus of having a doctor close at hand in case of an emergency. The doctor also is happy: this solution would allow her to practice medicine in her own community. ”

Although there are positive examples, third-degree price discrimination may harm minorities and increase structural social inequalities. In the credit domain, third-degree price discrimination refers to offering different payment conditions to clients depending on attributes that reflect the client’s group, such as race, gender, age, marital status, address, and education level. The following section will address the sources of algorithmic discrimination.

2.2 Sources of algorithmic discrimination

Any software, including machine learning software, goes through a development lifecycle that starts with identifying the needs and specifying the requirements for deployment and maintenance phases. There are many different development lifecycle models, including waterfall, spiral, unified, and rapid application development, including the current in fashion agile, extreme programming, and scrum (Ruparelia 2010). Discrimination might appear in any software development phase, as illustrated in Fig. 2 and summarized in Table 1.

Fig. 2
figure 2

Sources of algorithmic discrimination

Table 1 Source of discrimination considering the system development phase

Discrimination can be triggered in the software specification phase as the owner and companies define goals that exclude groups on purpose. For example, many companies selling games require registered users to be older than 18.Footnote 2 Another example is denying credit to people living in certain zip codes (Alliance 2014; Atkins et al. 2022). While the first kind of discrimination is acceptable, even sometimes required by law, the latter is illegal. In general, corporate decisions are registered in datasets comprising the know-how of the businesses.

Discrimination may also occur during the data processing phase, in which developers may include bias via the data acquisition and sampling method, the appropriateness of the labeling for the training data, and the dataset representativeness (Cai et al. 2020). Developers should first understand the dataset and the risks involved in using it. For example, an American health insurance company aiming at starting a preventive health care program used the number of medical appointments per year as a proxy for health fragility. Because of this procedure, it mistakenly offered preventive treatment, primarily to white patients. Blacks in that area were poorer and avoided medical visits because of the deductible and difficulties of leaving work for medical appointments. The data representation led to this misunderstanding that harmed blacks equally subjected to the considered sickness (Obermeyer et al. 2019).

The model development phase (training phase) is also a source of discrimination. The technique selection, the blindly looking for accuracy imprinted in the training, the parameter configuration, and the sampling methods to deal with unbalanced datasets are also sources of unintended discrimination (Schoeffer et al. 2021).

Last but not least, discrimination can be triggered by the usage of the computer systems’ results. Decision-making can be automated, letting the computer implement its results, such as deciding to order milk as it predicts from the owners’ behavior that they will run out of milk soon, (Aztiria et al. 2010), or the autonomous driving systems (Dikmen and Burns 2016). Discrimination in the latter case can be a matter of life and death since the car’s image recognition might not be tuned to recognize well blacks crossing the streets (Gogoll and Müller 2017). Furthermore, results can also be presented to humans, letting them be in the decision-making loop or even be in control, auditing results and requiring explanations for a better understanding of the results. Algorithmic fairness is the subject of the next Section.

2.3 Algorithmic fairness concepts

We start this section by introducing the meaning of words that are commonly used to represent fairness: equality, equity, and justice.

According to Kassam and Marino (2021), there is no unique meaning for algorithmic fairness. Table 2 describes different metrics for fairness, reflecting the different meanings, their limitations, and the kind of discrimination they are addressing. Figure 3 illustrates the differences in the statistical meaning of the different equity concepts. In Fig. 3, we make use of the Receiver Operating Characteristic (ROC) curve for classifiers for two groups to make clearer the different equity concepts. The ROC curve was built considering the classifier’s performance for classifying individuals of a group as belonging to a class or not, such as a class of ”good payers” in the credit domain. The confusion matrix is the basis for building the ROC curve. A true positive means the loan was approved for a creditworthy person; a false positive means the loan was approved for a defaulter. The area under the ROC curve (AUC) reflects the quality of the classifier, a purely random classifier would produce a diagonal straight line. We used the ROC curves to exemplify the different notions of equity.

Table 2 Fairness definitions and limitations
Fig. 3
figure 3

Different notions of equity explained via ROC curve behavior

Let us examine the statistical meaning of the several fairness criteria portrayed in Fig. 3. The Equality of Opportunity fairness criterion is akin to requiring that the classifier for both groups produce the same true positive rates. This means, for example, in the case of gender discrimination, that, irrespective of gender, a creditworthy person would have the same probability of getting the loan application approved. Note, however, that, in general, the false positive rates would differ, i.e., the bank would expect more defaults from one group than from the other. Equality of Odds, on the other hand, requires that both the true positive and the false positive rates be equalized. This is only possible if the two ROC curves intercept, which may not always happen. Even if they intercept, the resulting true and false positive rates may not be compatible with the bank’s economic objectives. For example, assume that both ROC curves intercept at the 25 percent true positive rate. This true positive rate would probably be too low for any reasonable credit scoring system. The Demographic Parity fairness concept assumes that the bank sets up a maximum false positive rate, and applies such rate to both groups. Assuming that the creditworthiness of both groups is different, such a procedure implies possibly different credit score thresholds for the acceptance of loan applications from the two groups. People from the more creditworthy group would have their loan applications denied, while people with similar credit scores from the less creditworthy group would have their applications accepted. And the true positive rates would also be presumably different unless this criterion coincided with the Equality of Odds criterion. The Equality of Treatment criterion requires similar ratios of false negative to false positive rates across groups.

It is almost impossibleFootnote 3 to accomplish the requirements for all definitions (Chouldechova 2017; Pleiss et al. 2017). Trade-offs between fairness and accuracy (Corbett-Davies and Goel 2018) must be considered in each application domain, from loans to job applications and parole decisions. No matter how many definitions of fairness we can arrive at, they will remain contestable vis-à-vis some other definitions.

Programming fair algorithms hinges on the definition of fairness, which varies, as shown in Table 2. Algorithmic fairness can be accomplished through strategies during the data collection, pre-processing, at-processing, or/and post-processing treatments. During data collection, the focus should be on guaranteeing or verifying the representativeness of the data. Are the instances (cases) properly described using all needed variables? Are the set of instances representative of the population? These are usual issues for any robust statistical analysis. Whenever data collection is not adequate, imperfect labeling is an issue. Strategies to deal with biased labels include verifying the compatibility of the outcome with other features to deal with meritocracy unfairness (Schoeffer et al. 2021; Wang and Gupta 2020). Pre-processing strategies for algorithmic fairness refer to dataset manipulation, such as changing labels in randomized data points (Calmon et al. 2017; Gordaliza et al. 2019), adding synthetic minority class examples (“over-sampling technique”) (Chawla et al. 2002; Chakraborty et al. 2021), removing examples of majority class (“under-sampling technique”) (Elhassan and Aljurf 2016), re-weighting data pairs (Kamiran and Calders 2012), or combination of techniques (Banasik and Crook 2007). At-processing approaches refer to changing the way the computation learning process works, such as by including a social welfare constraint (Cohen et al. 2022) or a regularization term to the existing optimization objective function (Zafar et al. 2017), using adversarial debiasing (Zhang et al. 2018) or even analysis of counterfactual (Russell et al. 2017). Finally, post-processing approaches refer to changes in the decision function after the training process, such as finding a proper threshold that would allow a fairer result (Hardt et al. 2016) or trading off accuracy and fairness (Chen et al. 2018).

3 Systematic review method

This research followed the Kitchenham et al. systematic literature review protocol (Kitchenham et al. 2009) for planning, conducting, and reporting the results. This section describes the main activities in the review planning and conducting phase.

3.1 Planning phase

The planning phase encompasses the identification of the research questions, the search strategy, the selection of the proper search keywords, and the definition of the inclusion and exclusion criteria. Since we are interested in algorithmic discrimination, we first apply a more generic search on Google Scholar, ACM, and IEEE Digital libraries to check the search space and the research interest evolution through the years. Figure 4 illustrates the increasing interest in “algorithmic discrimination” or “algorithmic bias” topics. As shown below, we better delineate the search scope to address our research objective.

Fig. 4
figure 4

Number of publication addressing “algorithmic discrimination” or “algorithmic bias” through the years

To address our research questions, we used the search keywords were:

  • “algorithmic discrimination” and

  • “bank loan”.

The initial search identified related expressions, such as: “algorithmic fairness” and “discriminatory” for “algorithmic discrimination” and “mortgage” for “bank loan”. Additionally, “loan” was more general than “bank loan” and was used instead. These changes helped tune the formulated search string to:

((“loan”) OR (“credit”) OR (“mortgage”)) AND

((“algorithmic discrimination”) OR (“algorithmic fairness”) OR (“discriminatory”))

The search string was adjusted to fit the database’s search format. Papers were retrieved from the following databases: ACM Digital Library, Google Scholar, IEEE Digital Library, Springer Link, and Scopus.

3.2 Conducting phase

After conducting an automatic search of the databases, the three authors reviewed the retrieved papers to verify the paper’s conformance to the inclusion and exclusion criteria, previously agreed upon. The paper selection process follows the strategy presented in Fig. 5. ParsifalFootnote 4 and Publish or PerishFootnote 5 tools were used to automatically search the papers.

Fig. 5
figure 5

Literature review process

The inclusion criteria consisted of the following rules:

  • I1: describes a model for defining, detecting, preventing, or mitigating discrimination in credit domains,

  • I2: describes parameters that explain discrimination in the bank credit domain,

  • I3: is published as a conference paper, journal article, book chapter, or technical report,

  • I4: is published between 2017 and 2022,

Additional papers were included as a result of finding relevant material from citations in the original list of retrieved papers, in a process called snowballing (Wohlin 2014).

The set of exclusion criteria are:

  • E1: describes discrimination only theoretically,

  • E2: does not present empirical data,

  • E3: focuses on system implementation without highlighting the discrimination aspect,

  • E4: describes a general proposal without describing a discrimination model or implementation details,

  • E5: is published as a complete book, presentation slides, editorial, thesis, or has not been published yet,

  • E6: is not written in English,

  • E7: cites credit loans, but is not applicable to the domain.

The initial selection contained 1320 papers. After applying the exclusion criteria, we reached a set of 78 papers, including journal papers and conference papers from 2017 to April of 2022. Three studies were included outside this period for being relevant to this review.

4 Findings

This section presents the analysis and the findings for answering the two research questions guiding this literature review. Appendix A presents the details of each of these papers.

4.1 Preliminary analysis

According to the publication source and declared objective, the selected papers were classified into five domain areas: computer science, economics, law, operational research, and philosophy. As shown in Table 3, most papers brought computer science or economics perspectives.

Table 3 Papers timeline: 2017–2022. 1999, 2003, and 2016 papers were included by snowballing (Wohlin 2014)

We merged the abstracts of all papers, removed the stop words, and created a word cloud. Figure 6 presents the most frequent words that appear in the set of papers. The words are not comprehensive but highlight the most frequent topics. A possible reading from the figure may say:

”The set of papers addresses the issues of fairness and discrimination coming from data and algorithms that impact decisions causing consequences. Discrimination is mostly racial and gender based. Credit loans and mortgages are the tasks being studied in which subjects are individuals, borrowers, groups, and people. The research goals include either classification, prediction, definition, or perception of discrimination. Papers are proposing models, methods, approaches, frameworks, systems, or literature reviews. The papers talk about metrics, evaluation, accuracy, error, and performance of their proposals. Many studies are domain-, case- or application-oriented. Researchers also address the what, where, when, and how discrimination occurs.”

In general, papers are presenting research on: algorithm discrimination or algorithm fairness in which data play a very important role and impact decision-making. Mostly, discrimination is related to race and gender. Research goals have been mostly classification, but prediction and definition are also frequent. The studies mostly propose models, methods, approaches, frameworks, and systems. The subjects have been individuals, people, or groups. The task involves getting credit either a loan or a mortgage. There are case studies for specific datasets and countries. The intervention that may be the discrimination cause or the tool to find discrimination mostly involves machine learning, descriptive and inferential statistics models, and ontological models (mainly for defining fairness and discrimination).

Fig. 6
figure 6

Word cloud containing the most frequent words that appear in the abstract of the 87 selected papers

Fig. 7
figure 7

Geographic distribution of the publications considering the country of the dataset source. The “Non-specified Countries” means the paper did not use datasets either for using synthetic or academic datasets or for a theoretical argumentation

Most countries included in our survey present case studies using inferential statistics to identify discrimination on public or private loan datasets. As shown in Fig. 7, the United States leads the research on discrimination, especially in terms of creating a theoretical framework for defining fairness metrics. European countries focus on general discrimination. Developing and low-income countries have focused on identifying discrimination, mostly sex discrimination from bank datasets. Somewhat surprisingly, the first authors of the selected papers were mostly white men, as shown in Table 4. The identification of sex and race/ethnicity of the first author of each paper is an approximation. This information was inferred by visual perception of the author’s name and facial image available on the Internet.

Table 4 Sex and race/ethnicity of the first authors of the retrieved papers

4.2 The research settings

To answer our first research question concerning the research settings, we analyzed each of the 78 papers looking for descriptors for clustering or distinguishing the approaches presented in the papers. We came up with an ontology for discrimination research in the credit domain. As described in Fig. 8, research on discrimination in the credit domain has its findings delimited to a domain area and to a country from which the data came. It is developed according to a type of research, using specific research methods applied over a dataset containing data that will support the findings. The dataset records instances of loans and loan applications from a country. The data discrimination analysis is done considering fairness metrics. The research should be grounded on a discrimination theory and related to a legal framework. The research focuses on a topic delimited by the scope and looks at the impact of discrimination on the lives of individuals or society.

Fig. 8
figure 8

An ontology for describing research on the discrimination topic

Seven attributes describe the research settings:

  1. 1.

    the domain area: research published in forums such as computer science (CS), economics/business (ECO), operation research (OR), Law, philosophy, or social science (PHY).

  2. 2.

    the type of research: argumentative/essay, explanatory (ex-post facto), theoretical, and empirical. Depending on the type of research, one or more research methods were applied, such as literature review, essay, inferential statistics, descriptive statistics, content analysis, counterfactual analysis, and machine learning. Depending on the type of work, inferences are driven from the analysis of datasets that can be public, private, created by the experiments (own dataset), academic or synthetic. The academic datasets are sample datasets, properly anonymized, donated by commercial banks, and available in public repositories, such as the UC Irvine machine learning repository.Footnote 6 Two datasets are primarily used in the credit domain: the Australian dataset containing 600 instances described by 14 attributes and the German dataset containing 1000 instances described by 20 attributes. There is a benchmarking study comparing credit scoring methods (Baesens et al. 2003; Lessmann et al. 2015) that, in addition to using the German and the Australian datasets, also used other academic datasets provided by Benelux and UK financial institutions: the Bene1 dataset containing 3123 instances described by 27 attributes, the Bene2 dataset containing 7190 instances described by 18 attributes, and the UK dataset containing 30,000 instances described by 14 attributes. They also used the Pak dataset containing 50,000 instances described by 37 attributes provided by companies for PAKDD data mining competition.Footnote 7 The datasets depict decisions on a loan from a Country.

  3. 3.

    the fairness metric used to evaluate the work: demographic parity, equalized odds, counterfactual fairness, predicted parity, individual fairness, and performance accuracy.

  4. 4.

    the discrimination theory grounding the research, shown in Table 5: taste-based discrimination (Becker 2010), information asymmetry (Stiglitz and Weiss 1992), statistic discrimination (focus on the outcome) (Phelps 1972), statistic discrimination (focus on the skills) (Arrow 2015), implicit discrimination (Bertrand et al. 2005), intersectionality discrimination (Crenshaw 1989), symbolic capital discrimination (Bourdieu 2018), visual legibility (Latour 1986), statistic sampling (Heckman 1979), race tri-stratification (Bonilla-Silva 2004). Becker’s taste-based discrimination is the fundamental grounding theory pervading all discrimination research.

  5. 5.

    the legal framework: the USA Home Mortgage Conflict of interest Act (HMDA),Footnote 8 the USA Equal Credit Opportunity Act (ECOA),Footnote 9 the USA California Consumer Privacy Act of 2020 (CCPA),Footnote 10 The European General Data Protection Regulation (GDPR),Footnote 11 the Brazilian General Data Protection Law (LGPD).Footnote 12

  6. 6.

    the research topic: formulation of fairness/discrimination concept, perception of fairness/discrimination on outcomes and decision-making process, diagnosing discrimination on datasets, and algorithmic discrimination. The research topic focuses on discrimination leading to impacts of either unfair results, credit rejection, or price differentiation. The research topic is delimited to a research scope, such as any loan, car loan, first-time loan, micro-credit loan, or mortgage. There was no educational loan mentioned in our set of reviewed papers.

  7. 7.

    the Country of the dataset: Albania, Brazil, China, France, Ghana, India, Jordan, Spain, the USA, a set of developing countries, loan data from an international funding agency with data from 52 developing countries, and data from micro-credit platform (Kiva) containing data from low-income countries. Thus, the academic datasets are sample datasets donated by companies. In addition, the academic datasets are from Australia, Benelux, German, the UK, the USA, and KDD competitions.

Table 5 Theories of discrimination mentioned by the retrieved papers

As previously observed and shown in the table 3, most papers were from computer science and economics domains, looking at generic discrimination or unfair outcomes, and specific bias concerning sex, race/ethnicity, and a few regarding sexual orientation, poor people, people without credit history and small farmers. There is no evidence of studies relating to another type of discrimination, such as religion, in credit markets.

Another interesting finding is the worldwide concern with discrimination, but with a total American dominance, in identifying discrimination by analyzing the datasets. Few studies explicitly cite the discrimination theory grounding their work, but they were implicitly considering Becker’s taste-based discrimination behavior.

4.3 The research issues

After considering the studies, we mapped the research issues into one of seven categories, as presented in Fig. 9: literature reviews, data analytic/econometric, human perception, definition, dataset, algorithmic and outcome issues. Some studies focus on more than one issue category.

Fig. 9
figure 9

Research issues discussed in the papers

4.4 Literature reviews and essays

Studies on race and ethnicity discrimination in credit loans are not new. Tables 6 and 7 summarize the findings in previous literature reviews and essays. In 1999, Black Black (1999) presented a literature review on race and ethnicity discrimination in the credit market domain. He claimed discrimination is more against properties (redlining) than individuals. He highlighted the effect of job instability and the lack of credit history as the significant factors for the loan acceptance gap between black and white applicants. Nevertheless, he did not investigate the root cause for these effects that might be explained by structural discrimination fostered by the credit approval distinction (loan rejections for blacks and Hispanics are three-to-one above for whites—demographic parity fairness metric).

Table 6 Reviews and essays concerning algorithmic discrimination—Part 1
Table 7 Reviews and essays concerning algorithmic discrimination (part 2)

Black Black (1999) did not consider the impact of the new techniques to calculate the credit scores of individuals that may be the cause for discrimination. Lessmann et al. (2015) reviewed new techniques, including machine learning, to assess credit scoring. They analyzed 41 different classifiers, considering the number of datasets used for testing, the number of variables per dataset, and the technique itself, such as: Artificial neural network, Support vector machine, and Ensemble classifier. They evaluated the classifiers using eight academic datasets, including the Australian credit (AC) and German credit (GC) from the UCI Library. Their results indicated that the new techniques perform better than traditional logistic regression, especially heterogeneous ensemble classifiers. Besides, they present evidence of increased financial returns with more accurate scorecards. Moscato et al. (2021) proposed a benchmark study that evaluates the performance (e.g., AUC, Sensitivity, Specificity) of different classifiers (e.g., random forest, logistic regression, and artificial neural networks) under different sampling strategies (e.g., under-sampling and over-sampling strategies) to deal with unbalanced datasets. They also considered the fit of different explicability methods, especially Lime and Shap, to offer transparency of computer decision-making. They used a public dataset from the “Lending Club” fintech marketplace bank (data available at Kaggle repositoriesFootnote 13) containing 877,956 samples of loan applications.

In 2017, Aitken (2017) had already realized the usefulness of using informal information from social media as an alternative credit scoring for the “unbanked”. Unbanked are people without formal documentation proving their credit status, people without credit history, having no access to loans. They are considered too risky for lenders to loan them money. Based on Latour (1986) visual legibility theory for making visible the different perspectives of a subject, Aitken reflects upon the advantage and disadvantages of current approaches to making visible the financially invisible unbaked, such as the experiment using social behavior information—360-degree views of borrowers being experienced by FICO.Footnote 14 He brought up the paradox the literature poses in which the under-development alternative scores may work for increasing loan inclusion, in the equity, or ratifying the credit-worthless unbanked.

More recently, in 2021, there are reviews looking specifically at bias in credit scoring techniques. Corrales-Barquero et al. (2021) recently (2021) published a literature review on sex bias in credit scoring methods. They reviewed a set of 20 papers to identify the techniques that have been used to reduce sex bias in credit scoring models. They differentiated the bias into disparate treatment (when the protected attributes were present in the dataset), associated bias (due to proxies), selection bias (dataset with misrepresented groups), and intentional bias. The authors also listed mitigation techniques and the requirements for applying them.

Kordzadeh and Ghasemaghaei (2022) reviewed 56 papers on algorithmic bias through the stimulus–organism–response theory and organizational justice theory. Their objective was to understand the impact on decision-making. Based on the literature, they present a conceptual model showing algorithmic bias impacts decision-makers’ fairness perception of the computational outcome and, consequently, their acceptance and adoption of the suggestions. This stimulus–response flow is influenced by individual characteristics in terms of beliefs and moral identity, task characteristics such as automated or human control, technology characteristics such as reasoning transparency, organizational characteristics, such as organizations’ norms and rules, and environmental characteristics, such as laws and social norms.

Proposing solutions to deal with discrimination requires a good definition of what is the problem. For that matter, understanding what is a fair result, a fair algorithm, and a fair decision is fundamental to leading to a solution. Mehrabi et al. (2021) review the literature on fairness definition for machine learning systems and proposed a taxonomy in which fairness is divided into three groups: individual fairness, including veil of ignorance and counterfactual fairness; group fairness, including demographic fairness and equalized odds; and subgroup fairness. They reviewed different approaches for fair machine learning systems either in the pre-processing, in-processing, and post-processing phases and tested with academic datasets (UCI datasets) from a variety of domains.

Wong (2020) reviewed the literature on algorithmic discrimination looking at the way researchers have been addressing the problem. He distinguished the different approaches for removing bias as pre-processing, during the learning processing, or as a post-processing technique. He claimed that many technical solutions have been proposed to remove bias without a clear definition of a fairness system.

Ragnedda (2020) discussed the challenges imposed by the introduction in the society of new digital technologies, such as machine learning, to increase social inequalities. He classifies the inequalities in three types: (a) knowledge inequalities reflecting the different understanding among people of the impact in our lives from the outcome of these technologies, (b) inequalities imprinted in the data guiding the learning process of systems and (c) inequalities of the treatment nudged by intelligent systems according to socio-demographic characteristics of individuals in the society. This essay based on the literature organizes the topic.

Mitchell et al. (2021) discussed fairness in the context of decision-making either human or computational. The authors classified the existing definition of fairness into three categories: driven by a utility maximization with a single threshold over the predictions dividing the population, driven by an equal prediction measure with similar prediction impact across groups, and driven by an equal measure over the decision. They used this classification to shed light on the relation between the implicit assumptions and choices of prediction-based decision-making and the fairness of the outcomes.

Bruckner (2018) discusses the benefits and challenges of algorithmic lender 2.0. The benefits include offering loans faster, cheaper, and with more predictive credit than conventional lenders. Algorithmic lenders broaden the range of borrowers, extending to the credit invisible for being able to gather and process loan applicants’ information from varied sources including social media. Algorithmic lenders eliminate human decision-making subjectivity and discriminatory behavior. On the other hand, since these systems learn decision-making patterns from datasets, they can perpetuate discrimination. The authors emphasized the need for regulation of algorithmic lenders. They also discussed the adequacy of current American legislation, specifically the American Equal Credit Opportunity Act (ECOA). They provided examples of the two ways ECOA violation (disparate treatment and disparate impact) can be proven concerning the disparate treatment and disparate impact. For example, giving a more favorable credit term for older people is an example of disparate treatment. Defining a minimum value for a credit loan that heavily excludes a racial group, even if not intentionally discriminatory, is an example of disparate impact. Lenders can avoid liability if they can prove the discrimination has a valid purpose, other than maximizing profit. Lenders must show the relation between creditworthiness and the policy being violated.

4.5 Data analytics: econometric issues

A great deal of research focused on making sense of data and identifying trends that indicate discrimination in the decision-making process. The goal is to identify systematic discriminatory behaviors engraved in the datasets. Thirty-two of the seventy-eight reviewed papers addressed this issue, as summarized in Tables 8 and 9.Footnote 15

Table 8 Econometric studies (part1)
Table 9 Econometric studies (part2)

4.5.1 Sex discrimination

Beck et al. (2018) investigated whether the sex match between lender officers and borrowers influences the likelihood that borrowers return to the same lender. They used loan data from a large commercial Albanian lender that lends money to small and medium-sized firms. Their results indicate borrowers’ preferences to be attended by a person of the same sex, and that fact impacts credit market outcomes. First-time borrowers are less likely to return to lenders of the opposite sex for getting a loan. Blanco-Oliver et al. (2021) performed a similar study using data from the World Bank’s Microfinance Information Exchange (MIX) platform, that records financial and operational information on micro-credit from 52 developing countries. Their results also indicated the existence of sex affinity between borrowers and lender officers when deciding on a loan application. They found less discriminatory results on the overall performance of microfinance when borrowers and lenders are from the same sex (individual fairness metric).

Cozarenco and Szafarz (2018) investigated the existence of sex bias in microfinance institutions (MFI) as new barriers for female entrepreneurs to have access to micro-credit lines for their business in France. MFI is a government-subsidized institution looking at social performance within a budget constraint. In France, the MFI loan ceiling is EUR 10,000. Looking to broaden their clientele, but at low risk, banks are using MFI to co-finance loans for entrepreneurs without a credit history that need loans over the MFI ceiling. The authors empirically analyzed a dataset containing information manually fed of 1098 credit applicants from 2008 to 2012 from a French MFI. Their results suggest female micro-borrowers have lesser chances than male borrowers to get loans above the MFI loan ceiling. This discrimination seems to be caused by MFI transferring the clients’ evaluation to the banks’ assessment policy. At first look, the strategy of combining banks and MFIs to concede loans over the ceiling is discrimination neutral, but the study shows that it feeds the structural discrimination against women.

This credit discrimination against female entrepreneurs seems worldwide. In Spain, De Andrés et al. (2021) empirically evaluated a sample of 80,000 Spanish companies looking at the owners’ sex and their demand for credit, credit approval ratio, and credit performance. Their dataset comes from the Spanish Central Bank (CIRBE database).Footnote 16 Their results showed that female entrepreneurs are less likely to have their loan application approved in their firms founding year than their male peers, comparing people from the same industry. Interestingly, they also found that women who received loans in their founding years are less likely to default. All these differences fade out as the company builds a record of profits and losses.

In Vietnam, Le and Stefańczyk (2018) empirically analyzed data from national census interviews done in 2005, 2007, 2009, 2011, 2013 with managers and employees of small, medium, and large scale firms. Despite the Vietnam government’s efforts to reduce sex discrimination in entrepreneurship, their research showed the likelihood of loans being denied for women-led enterprises increases to 67% in male-intensive industries and 71% in periods of tight monetary policy. Tran et al. (2018) showed similar results related to the likelihood of obtaining any loan using public data from the Vietnam Access to Resources Household Survey.

Sackey and Amponsah (2018) investigated the factors that could explain the sex bias against women in Ghana. They looked at the micro-credit application form randomly selected from commercial banks micro-credit database. There were 1,408 borrowers which received the full or a part of the requested amount, from which 872 male and 536 female applicants. There was no information on rejected applications. They applied counterfactual analysis, but they only found a strong correlation with endowment amount (counterfactual fairness metric). This fact could signify structural prejudice, but it would need a macro investigation. In 2020, Sackey and Amponsah (2020) extended their investigation on sex bias in the credit market domain, by looking at loans’ applications from micro- and small-scale companies in Ghana. According to the 2019 Mastercard Index of Women Entrepreneurship,Footnote 17Ghana presents the second greatest business owners percentage considering the country’s all business (demographic parity fairness metric). In spite of this large entrepreneurship female presence, women have still timid participation in the credit market. Their research investigated the reasons to explain this paradox. Their research was based on the empirical analysis of questionnaire responses from 678 respondents from micro- and small-industry entrepreneurs.Footnote 18 Their results indicated that demographic characteristics including age, gender, education, size of the house and credit history, better explain the credit participation gap than sex alone.

Chen et al. (2017), using data from a Chinese online lending platform, showed a higher probability of women getting loans compared to men. However, their results showed that, although women presented a lower default rate than men, their loans were approved with higher interest rates.

Maaitah (2018) investigated factors that could explain differences in loan allocation for entrepreneur borrowers (price differentiation). He empirically analyzed a sample dataset containing 88,055 Jordanian loan applications from a Micro Fund for Women taking from the 2011–2017 period. He analyzed borrowers’ information including sex, years of formal education, address area, and nationality. He didn’t find any statistical explanation for the differences in the number of loans borrowed by male and female borrowers (demographic parity fairness metric).

Sex discrimination has not been restricted to personal, small, or micro business entrepreneurs. In India, Li (2021) developed an explanatory model using inferential statistics over a commercial financial institution dataset on auto-loan applications. Tran et al. (2018) showed similar results related to the likelihood of obtaining any type of loan using public data from the Vietnam Access to Resources Household Survey.

Contrary to the results of the main literature on discrimination, Salgado and Aires (2018) presented a study in which female entrepreneurs had better chances to get loans than their male counterparts. They used data from branches in the state of Paraíba, a northeastern state in Brazil, of a large Brazilian bank. Their results are consistent with the high female political participation in that state. They did not mention the interest rate differences.

4.5.2 Race discrimination

One of the most studied discriminatory behavior in the world is related to race or ethnicity, especially in the USA.

Bayer et al. (2017) empirically analyzed a very informative dataset containing a complete census of housing transactions information (mortgage) coming from a private real-estate monitoring service, (DataQuick) combined with registry information gathered under the (HMDA). Data were from the period between 1990 and 2008 of four American cities: San Francisco, Los Angeles, Chicago, and Baltimore. They were interested in price differentials in the housing market. Their results showed that black and Hispanic buyers pay more for housing regardless of the race or ethnicity of the seller. Their results suggested the estimated premia cannot be explained by racial prejudice, but the persistence of racial differences in home ownership, the segregation of neighborhoods, and the dynamics of wealth accumulation, i.e., signs of structural discrimination that are difficult to break.

Faber (2018) analyzed mortgage applications of the metropolitan statistical area housing market looking at data coming from the 2014 (HMDA) database. He was looking at the role of race in mortgages’ outcomes: mortgage approval and tax rate. His analysis showed racial inequalities in the USA mortgage domain. His results showed very different mortgage approval rates (71% of whites, 68% of Asians, 63% of Latinos, and 54% of blacks—demographic disparity). Additionally, black and Latino borrowers were three times more likely to receive high-cost loans compared with whites (demographic disparity) a practice that has accelerated since the 2007–8 subprime crisis.

Ambrose et al. (2021) made an analogous study but concerning race. The author performed an empirical analysis on data of US mortgage loans from a large private mortgage lender.Footnote 19 Their data encompassed the period from January 2003 and March 2007. The dataset contained information on the borrowers’ as well as the brokers’ race (inferred from their names). Their result showed that Asian borrowers would get lower fees than white borrowers when dealing with Asian brokers. White brokers demand higher fees for Blacks, Asians, and Latinos. There is no evidence of price differentiation when brokers are Black or Latino.

Bhutta and Hizmo (2021) analyzed a dataset combining data from the USA FHA\(^{*}\) insured loans originated in 2014 and 2015 with information coming from Optimal Blue Mortgage Market indexFootnote 20 to identify traces of racial and ethnic discrimination in mortgage pricing. They were looking at the interest rate, discount points, and mortgage fees. They found race and ethnicity impact interest rates, but these gaps are mitigated by discount points.

Loya (2022) departs from the tri-racial stratification theory (Bonilla-Silva 2004) that considers a racial hierarchy comprising of Caucasian whites at the top, followed by honorary whiteFootnote 21 and the collective black group.Footnote 22 His research focused on the racial/ethnic prejudice against the Latinos community (demographic parity). Looking at a stratified random sample of about 115 thousand complete mortgage applications (single family homes) drawn from the American HMDA from the period of 2010 to 2017, he showed that black Latinos got the highest mortgage fees and mortgage denials when compared with other honorary white groups. This result indicates a racial prejudice even within an ethnic group by skin color.

Yu (2022) investigated the impact of algorithmic underwriting on mortgage approval behavior during the month. He looked at 25 years of mortgage approval data of the American HMDA from the period of 1994 to 2019. He accessed a high-frequency data, allowing a detailed monthly analysis. He observed approval rates increase towards the end of the month, reaching the highest rate on the last day of the month. He also observed that the approval gap between blacks and whites drops from 7% to 3.5%. A deeper analysis indicated this behavior could be explained by the incentive structure in the mortgage bank domain. Loan officers must meet their monthly loan quota, which impacts their monthly income, and non-compliance challenges their job. Further details of this approach can be found in Giacoletti et al. (2021).

Steil et al. (2018) departed from the variety of quantitative research available in the loan literature, reporting significantly higher mortgage costs received by blacks and Latinos compared to whites in the USA, to investigate the lenders’ strategy to reach these borrowers (demographic parity). The authors performed a content analysis (qualitative research) of the textual depositions of four litigation cases against lenders whose facts were considered strong enough to hold a lawsuit and that are still ongoing cases (2022). Two cases involved the Wells Fargo Bank (the city of Baltimore vs. the Wells Fargo Bank and the City of Memphis vs. the Wells Fargo Bank), one case involved Morgan Stanley (Adkins et al. v. Morgan Stanley), and one case involved Olympia mortgage (Barkley v. Olympia Mortgage). These cases reported “reverse redlining” effect. Identifying minorities and using a trusting social network (such as local priests) to induce minorities to obtain mortgages at a high rate (predatory mortgages) via sub-primes.

4.5.3 Both race and sex discrimination

Some studies look at the data focusing on both race and sex discrimination. Hassani (2021) used a public dataset from KaggleFootnote 23 to show how algorithmic bank loan approval reverberates racial/ethnic and sex discrimination. The dataset contained 400 instances (race/ethnicity—99 “African-American”, 102 “Asian”, 199 “Caucasian”; sex—207 women and 193 men). 31% of the applications were rejected. Using credit scoring information, he predicted applicants’ race and sex. His results suggested there will be racial and sex credit loan discrimination, even removing the sensitive attributes from the data.

Park (2022) empirically analyzed data from the Federal Homing Administration insured mortgageFootnote 24 to investigate discrimination concerning race, ethnicity, and sex. He analyzed over 7,6 million loans collected between 2010 and 2019, looking at rates of insurance endorsement and default among purchase mortgage applications. His results indicated that white male applicants with a female co-applicant (e.g., a white family) have higher endorsement rates (access to service) than any other applicant demographics with the same characteristics (equality of opportunity fairness metric). His findings also showed default rates did not alter the bias against Asian, Hispanic, and female applicants.

4.5.4 Sexual orientation discrimination

Loan applications do not ask for sexual orientation. Nevertheless, this information can be inferred with some statistical noise. The sex disclosure of the borrower and co-borrower can hint at that. There might be cases of father–son, mother–daughter, or friends applying for a loan. Dillbary and Edwards (2019) found evidence of sexual orientation discrimination by looking at a sample of 20% of HDMA mortgage data between 2010 and 2015. They looked at over five million mortgage applications. Their findings indicated that same-sex male co-applicants are significantly less likely to get their loan request approved compared to white heterosexual co-applicants. They also found that this type of discrimination happened across many different lenders: big or small banks and in urban or rural areas. Other studies ratify sexual orientation discrimination. Looking at the USA mortgage data, Sun and Gao (2019) showed that same-sex loan applicants and co-applicant were 73% more likely to have their request denied or to get a loan with higher fees than different-sex loan applicants.

4.5.5 No-credit-history discrimination

Sometimes, the barrier to accessing credit lines is the lack of credit history. Cai et al. (2020) addressed the discrimination against people with no credit history. The lack of information usually prevents good-payer applicants from receiving a loan. Gathering more information concerning the applicants’ creditworthiness involves costs. Considering the lenders’ goal to maximize profit and the applicants’ needs for loans, the authors propose an algorithm (in-processing prevention technique)to identify applicants close to the decision thresholds, from which assessing more information would benefit all. They tested their ideas using a synthetic dataset and academic dataset (German Credit Dataset Hofmann 1994) containing 1,000 loan applications.

Liu et al. (2019) also addressed discrimination against poor people without a credit history. They looked at discrimination in micro-lending in undeveloped countries. They acknowledged the benefits of micro-lending for reaching people that do not have access to formal loan channels since the crowd is looking at criteria other than profit. Nevertheless, based on data from the Kiva lending platform,Footnote 25 they identified unbalanced lending among different groups of borrowers ( demographic parity fairness metric). They propose a Fairness-Aware Re-ranking (FAR) algorithm to balance ranking quality and borrower-side fairness. This is an in-processing discrimination prevention technique to address fairness lending. They tested their algorithm for accuracy and fairness using a sample of lending transactions from 9,597 lenders taken from an 8-month period from a proprietary Kiva dataset .

4.5.6 Line of work discrimination

Otieno et al. (2020) discussed the discrimination against small-scale farmers to access bank loans. The author argued small-scale farmers cannot prove many pieces of information banks require to assess applicants’ creditworthiness, such as land title and guarantees of buyers for their products. Kumar et al. (2021) did a similar study and concluded that the lack of credit history from small farmers might lead them away from mainstream banking transactions.

Pi et al. (2020) showed small farmers living in Chinese provinces get a similar discriminatory effect related to borrowers’ residency location. They analyzed data form a Chinese peer-to-peer online platform (RenRen) and concluded small farmers get higher interest rates from the loans than city applicants.

4.5.7 Technology-based discrimination

Considering the rapid increase of FinTech in the mortgage market, Haupert (2022) studied the impact on racial discrimination (Individual fairness) with the introduction of FinTechs in the mortgage market compared to traditional lenders (banks). Since the loan application and approval are online, any discrimination is more likely to be statistical-based and not taste-based. Haupert considered five racial values: Whites, Blacks, Latinos, Asians, and Others. He analyzed HDMA data (American public mortgage data) from the period of 2015 to 2017 and the neighborhood racial composition classifying into five classes according to the percentage of non-White residents (<20%, 20–40%, 40–60%,60–80%, >80%). The dataset comprised 7,630,193 first-lien mortgage applications, of which 625,474 were in the FinTech category and 6,830,311 in subprime applications, from which 529,415 were from FinTech lenders. Haupert concluded that except for Latinos, all Fintech mortgage applicants have lower mean incomes than applicants for traditional lenders. Disparities in approval rates and pricing between Whites and non-Whites remained in FinTech lenders, but slightly lower discrimination in approval rates and interest rates. When considering the neighborhood composition, Haupert concluded that predicted subprime rates increase as the neighborhood’s composition of non-White residents grows, especially for Latino applicants.

Fuster et al. (2019) studied the impact of the introduction of Fintechs in the mortgage market for borrowers with low access to formal financial lenders, in terms of access to loans and pricing differentiation. The authors looked at American mortgage data insured by the FHA* from 2010 to 2016. Their sample dataset contained about 51,448,444 bank loans, 3,473,506 Fintech loans, and 25,604,501 loans for other lenders. The data showed a mortgage processing time 20% faster than banks and a more elastic business. These facts may explain the increasing presence of Fintechs in the mortgage market, which went from 2% (2010) to 8% (2016).

Allen (2019) argues that structural race discrimination increases by algorithmic decision-making. He highlighted the risks of algorithmic redlining caused by systems that learned from the biased dataset and the bias of current credit scoring systems as a token for creditworthiness. The high-accuracy outcomes veil social fairness performance. His argumentation focuses on race discrimination in the US housing domain, especially on mortgage access and pricing discrimination. He mentioned the legal actions to refrain from race discrimination in the economic environment, such as the American FHA, for broadening home ownership to a more diverse group of people, and the ECOA to fight race redlining. He claimed the need for transparency and auditing of the training datasets and the algorithms.

4.6 Human perception issues

There are two ways of looking at perception issues: human perception of the concept of fairness and human perception of the computational results’ validity. Human perception of the computer outcome fairness and accuracy is fundamental for accepting and adopting ML systems. Table 10 summarizes the research findings on the human perspective of adopting automated loan approval systems. This section discusses the different approaches to dealing with human perception issues.

Table 10 Human perception issues

Many technical solutions have been proposed to lead to fair decision-making systems. However, there is no consensus on what fairness means. Wong (2020) claimed that before proposing technical solutions, it is fundamental that the society agrees on the meaning of algorithmic fairness. He sees this definition as a political issue. Wong grounded the discussion using Daniels and Sabin’s accountability for reasonableness framework (Daniels and Sabin 2008), which requires the design of any algorithm to consider the interests of the affected people. Algorithm developers must consider three conditions: (a) Publicity—fairness definition, fair metrics, and the trade-offs between fairness and accuracy of any algorithmic outcome should make; (b) Acceptability—decisions should be acceptable by the affected people; (c) Revision & Appeal—conflict resolution strategies should be available.

Albach and Wright (2021) investigated the laypeople perception of fairness across different domains. They extend the findings of Grgić-Hlača et al. (2019) to identify which attributes were considered fair by laypeople to justify algorithmic decisions. Albach and Wright surveyed Amazon Mechanical Turk 2157 workers (“turkers”), in 2019–2020. The questionnaire asked turkers to rate, on a Likert scale from fair to unfair, the features used by a machine learning system to produce an outcome, as well as the agreement with the system’s output. They evaluated the results using six domains, including loans, to identify moral reasoning differences. Their findings indicated that perceptions are primarily consistent across domains, except for insurance and health domains. A single dominant predictor may explain this behavior in each field that heavily impacts accuracy. Moreover, participants were turkers and not actual decision-affected people.

Binns et al. (2018) developed an empirical lab study asking 19 UK participants for their perception of fairness upon decisions proposed by an ML algorithm in three different scenarios, including credit loan application. Of the 19 participants, 11 were male, and eight were female. The authors were looking at five constructs: the lack of human touch in the explanation, the difficulties in interpreting the machine results, the possibility of acting upon the factors leading to the computer outcome to change it, and the relation to moral aspects of the machine outcomes. Their results suggest the explanation style for presenting the machine (understanding and being able to act) suggestion highly impacts the participant’s perception and acceptance of the outcome.

Saxena et al. (2019) also investigated the people’s perception of algorithm outcomes, but they took a different approach. They studied people’s acceptance of the various definitions of fairness proposed in the algorithmic fairness literature. They build different decision-making scenarios, including loan approval and the results according to different fairness metrics (individual fairness Dwork et al. 2012, meritocracy Kearns et al. 2017; Joseph et al. 2016, and calibration Liu et al. 2017) and asked American Mechanical Turk participants for their perception of fairness. Their results indicated that the “calibration” definition for fairness got the best acceptance rate from the American Mechanical Turk participants.

Karimi et al. (2021) propose a counterfactual model that would improve people’s perception of the “feasible” actions to change the outcome of algorithmic decision-making. For example, for a bank newcomer facing a computer outcome that jeopardized a loan application, a person may act by asking for small loans, even without needing one, to build a positive credit history and later get a bigger loan approved. An example of a non-feasible action would be “change your sex” or “get younger” to improve your credit score. The authors propose to augment counterfactual explanations to take into account the causal consequences of actions and the set of (physical) laws restricting the activities. They present a mathematical model for providing the minimal set of feasible measures a loan applicant can do to change the algorithm outcome.

Rebitschek et al. (2021) analyzed people’s error estimations and willingness to accept errors from automated decision systems. Based on a questionnaire using 3086 respondents in Germany, they identified that people underestimate the accuracy of credit scoring systems and highly educated people are more likely to underestimate systems’ errors. Surprisingly, respondents did not even accept the number of mistakes they expected the system to present in the credit scoring domain. Further, respondents were willing to take more errors from human experts than computer systems.

4.7 Definition issues

Defining decision fairness, either human or computational, is not a new issue, but it is still getting attention from researchers. Table 11 presents a summary of these studies further detailed in this section.

Table 11 Research focus on defining fairness in ML context

Researchers have been either looking to define the requirements for achieving fair algorithms (Kleinberg et al. 2016), mathematically defining metrics to measure a system’s degree of fairness Kearns (2017), or even offering a programming library containing many different definitions to be able to compute and compare (Bellamy et al. 2019).

Kleinberg et al. (2016) formalized three conditions for a fair algorithm: calibration within groups (demographic parity), balance for the negative class across groups, and balance for the positive type across groups. They showed that the three conditions cannot be simultaneously satisfied except when you have a perfect predictor or the average prediction values for the two groups are the same.

Kearns (2017) presented various fairness metrics, highlighting the lack of consensus and conflicting definitions among them. He claimed that although regulation is essential to prevent and mitigate biased algorithms, the rules should be “endogenized” into the learning process to lead to fair machine learning algorithms. His emphasis is on the need for “in-processing” solutions.

There are also efforts to define metrics for evaluating the trade-offs between results’ accuracy and fairness (in the sense of diminishing discrimination against groups).

Cohen et al. (2022) present four theoretical models for fairness in the credit domain: fairness in price, demand, consumer surplus, and no-purchase valuation. Price fairness refers to providing a similar price (analogous to equalized odds fairness metric) for the goods (loans) for the groups being considered that can be classified by race, sex, or other sensitive attributes. Demand fairness refers to offering the same access to loans across groups ((analogous to demographic parity fairness metric)). Surplus fairness refers to having a similar difference between consumer valuation and the price paid for the loan (analogous to equality of opportunity fairness metric). No-purchase valuation refers to a similar average valuation of the loan across groups ((analogous to equalized odds fairness metric). The authors mathematically show the impossibility of being well evaluated according to the four metrics at the same time. They also present a simulation study showing the behavior of these four metrics compared to the social welfare function, as described in Hu and Chen (2020). They show that enforcing no-purchase valuation increases social welfare. They also conclude that small increases in price fairness may increase social welfare. However, as price fairness increases outcomes worsen for lenders and borrowers. Moreover, increasing demand or surplus fairness always reduces social welfare.

Lee and Floridi (2021) addresses racial discrimination in the USA mortgage domain as a trade-off analysis among possible alternative solutions for implementing algorithmic decision-making. They empirically showed the impact on mortgage denial for blacks when using different machine learning (ML) techniques, instead of the usual logistic regression. The non-linear characteristic of ML techniques magnifies the differences boosting racial discrimination, proved by Fuster et al. (2022). They also acknowledged the multitude of fairness definitions in the literature and the impossibility to address all at the same time. Based on these two facts (ML booster effect and conflicting fairness metrics), they propose to rephrase the problem and instead of having one evaluation, decision-makers should understand the trade-offs between algorithm performance, according to different ML techniques, and discrimination performance, according to a set of fairness metrics. This strategy fosters awareness of decision-makers and fosters transparency and audibility of the algorithms. They showed their method using a dataset containing 50,000 accepted loans and 50,000 denied loans randomly taken from 2011 HMDA data.Footnote 26 They only considered black and white borrowers. The sampled data had 90.7% white vs. black applicants. In their study they could show, for instance, a system using the Random Forest technique had the best computational performance (AUC=78%), but it would deny mortgages for almost 85% of black applicants. On the other hand, using the K-nearest neighbors classification technique would provide the second-best performance (AUC=72%), but it would deny for 65% of black applicants.

Kozodoi et al. (2022) empirically compared different machine learning techniques evaluated, considering various fairness metrics to analyze the costs in terms of profit to increase fairness. They used seven academic datasets to assess their claims.

4.8 Dataset issues

Problems in the dataset are one of the main sources of discrimination or, at least unfair computational results. Issues related to the dataset include the data source, the label trustworthiness, the missing data, the proxies, and the repair techniques. Table 12 summarizes the issues.

Table 12 Research focus on defining fairness in ML context

4.8.1 Data source issues

Technological advances create opportunities to incorporate other sources of information to assess loan applicants’ credit scoring, such as cell phone payment history and social media data. According to Knight (2019), blacks are less likely to have a credit history. Including additional unstructured and semi-structured data sources improve the chances of increasing the approval rate of the “unbanked” minorities. Knight also advocates lighter regulation for AI systems to foster companies to innovate and to broaden ways to gather information to assess prospective clients without stereotyping. On the other hand, using informal sources of information may break people’s privacy. Bryant et al. (2019) proposed using variational auto-encoder technology to create synthetic instances from the original data to preserve privacy while retaining the utility of that original data.

4.8.2 Labels

Even for people with a credit history, the dataset presents the challenge of trusting the labels assigned to the records. Chakraborty et al. (2021) claim bias and discrimination in machine learning systems come from misleading labels. They dealt with the labeling problem by proposing a pre-processing technique called Fair-SMOTE that removes records in which labels are suspected of errors. Instead of removing the records, Wakchaure and Sane (2018) proposed a pre-processing technique that manipulates instances considered biased according to some fairness metric. Their method comprises three steps: recognize categories and groups of examples that have been directly or indirectly discriminated, change the labels of these instances to remove bias, and use the new pre-processed dataset to train the system. They tested their approach using two academic datasets: Adult Census Income and German credit datasets.

4.8.3 Missing data

Not rarely, the problem is not an untrustworthy label, but a lack of labels and other attributes’ values. Bogen et al. (2020) discuss the importance of collecting sensitive data, such as race and sex, to effectively combat discrimination. The authors surveyed sensitive data collection practices of American organizations in different domains, including credit. There is no single legal conduct. Even in the credit domain, there are different legal conducts. For instance, mortgage lenders must collect sensitive data from their borrowers and make the data public. The HDMA dataset contains race, ethnicity, sex, marital status, and age, among other attributes. It is not the same for consumer lenders regulated by the ECOA that prohibits gathering these data, except when used as monitoring information to avoid discriminatory behaviors.

Kallus and Zhou (2018) brought to bear that any dataset will always have some missing data and untrustworthy labels. For this reason, there will always be some degree of discrimination. They showed that even using fairness-adjusted algorithms, the “residual discrimination” caused by this intrinsic asymmetry of information enforces structural discrimination on the same groups focused on the fairness adjustments. They represented residual unfairness as distributions of the conditional risk score across censored and target groups. Singh et al. (2022) proposed a sampling pre-processing technique in which missing data is generated in the “neighborhood” of the minority group. Their approach is tuned to create fair classifiers for the USA mortgage domain concerning sex and race. It accounts for more than one sensitive attribute at a time. Their method was empirically tested using data from the HDMA national and state databases (period from 2018 and 2020).

4.8.4 Proxies

In addition to missing data and untrustworthy labels, some attributes are highly correlated with the sensitive attributes, the proxies. Even within the credit domain, there is conflicting legal guidance toward removing or maintaining sensitive attributes. For instance, the sensitive attribute should remain when dealing with mortgages but not with micro-credit. Cofone (2018) acknowledged this legal guidance conflict and claimed it is ineffective to block sensitive attributes from the training data, given the existence of many proxies for them. He defended the benefits of modifying the sensitive characteristics in the training data using pre-processing techniques to avoid discrimination. Measures to avoid algorithmic bias include (a) properly configuring training set data, (b) monitoring the outcomes to detect misbehavior continuously, and (c)regulation to enforce the accountability of the person deciding with or without a decision support system.

Kallus et al. (2022) analyzed the unfairness (equalized odds) of algorithmic decision-making in lending money. They showed that too many proxies lead to sex or race, such as loan applicants’ surnames and addresses. They looked at a sample of 14,903 American mortgage applications, containing only black and white applicants with annual income no more than $100,000, from the HMDA 2011–2012 dataset, to construct models to predict race from the proxy variables. They showed that inferring race from other variables might challenge even more fair lending, leading to more mortgage denials to minorities. They wanted to show that removing the protected class membership in the data may allow algorithms to infer the class membership and implicitly increase prejudice.

Furthermore, Hort and Sarro (2021) claimed that techniques to remove proxies that can lead to discrimination of minorities can remove essential attributes that can lead to a distorted reality. For example, students who turn in homework should perform better in exams. In this context, homework delivery should be considered an “anti-protected attribute”. The authors used the academic Adult Census dataset to show that increasing the fairness of sensitive attributes prevents the discriminatory effect of anti-protected attributes. Hort and Sarro showed that grid search mitigates gender bias when using the Adult Census dataset.

4.8.5 Dataset repair

Problems in the datasets can be seen as old database problems. Salimi et al. (2019) looked at bias in datasets as a database repair problem. They focused on the bias hidden by different degrees of statistical grouping (Simpson’s paradox). They proposed a pre-processing technique that gets, in addition to the standard list of input and output, a list of permissible attributes that may impact the outcome. The technique designates removing or including instances depending on whether it feeds a causal pathway to outcomes that include inadmissible attributes.

Valentim et al. (2019) studied the effect on fairness and performance metrics of applying different pre-processing techniques, such as removal of sensitive attributes, encoding of categorical features (integer encoding and one-hot encoding) and removing instances. They regarded as statistical parity, disparate impact, and the normalized prejudice index metrics for fairness. They used two academic datasets to perform their experiments: Adult Income and German credit data. Their results indicated that, as expected, there is no best overall pre-processing technique. Their findings suggested a high dependency on the characteristics of the domain for a trade-off analysis between fairness and outcome accuracy.

4.9 Algorithmic issues

The machine learning algorithm may introduce bias and cause morally unacceptable discrimination. The issues related to the algorithms include the amplification effect of the non-linear regression methods, classifiers’ performance, knowledge representation of fairness, and the challenge of learning from imperfect labels, as summarized in Table 13

Table 13 Summary of the algorithmic Issues

4.9.1 Amplification effect from non-linear techniques

Bono et al. (2021) empirically showed the outcome discrimination effect of using machine learning algorithms instead of the traditional logit credit scoring techniques. Since machine learning techniques are non-linear, slight differences are boosted. This effect not only improves accuracy performance but also increases the gap among groups. The authors investigated this effect using a private dataset of detailed credit data from 800,000 UK borrowers. Similarly, Fuster et al. (2022) showed that machine learning techniques worsened the mortgage arrangements for blacks and Latinos compared to whites using an American dataset of 9.37 million mortgage loans coming from 2009 to 2013 HMDA augmented with data from McDash, a private dataset from Black Knight company. Acknowledging the fast adoption of machine learning to evaluate creditworthiness, they mathematically proved that there is an interest rate increase as the group becomes more dispersed. They show that risky borrowers become even riskier while creditworthy borrowers become credit-worthier. Brotcke (2022) also discussed the challenges of introducing machine learning techniques to the credit marketing domain but looking at compliance issues with USA anti-discrimination laws (e.g., FHA and the ECOA).

4.9.2 Classifiers’ performance issues

Otieno, Wabwoba and Musumba claim the Boolean classification between bad and good applicants leads to a significant amount of loan rejection for people who have difficulties proving their assets, such as the case of small farmers demonstrating their steady clientele. They proposed a fuzzy classifier that deals with this information imprecision. Schoeffer et al. (2021) proposed a fair ranking-based decision method that uses the relationship between legitimate features and the outcomes to compensate for the lack of such information. They tested their approach using the German credit database (academic database) and a synthetic dataset. They evaluated their fair ranking technique using meritocratic unfairness and accuracy metrics. In the examples, they looked at the sex sensitive attribute. They measured the cost (in terms of accuracy) to get different levels of fairness.

Instead of improving the certainty of the outcome, Coenen et al. (2020) took a different approach. They proposed a method for estimating the default in the credit domain that identifies the scenarios for which the outcome should be “unknown” for not being able to generate a reliable answer. Their method uses unlabeled rejected instances to improve the performance of a classifier trained with granted instances in a semi-supervised fashion. In the credit domain, they tested using two datasets: Lending Club dataset containing public loan data issued between 2017 and 2018 available to borrowers and a private dataset of a European spot factoring company with credit lending individual invoices collected over two years.

4.9.3 Knowledge representation of fairness

Fairness and discrimination have been mainly represented as patterns inferred by data. Cai et al. (2020) represented fairness as a criterion that lenders should consider in this resource allocation problem, besides profit maximization. They proposed an algorithm to identify applicants on the border of having their loan application approved, close to the decision threshold. Lenders should consider spending some cash gathering extra information about these borderline applicants. The authors show it is worthwhile in terms of the return on lenders’ investment. Elzayn et al. (2019) also rephrase the discrimination problem of algorithmic decision fairness as a problem of resource allocation, including equality of opportunity as an additional criterion to be considered. They model the problem and allow measuring the cost for fairness by measuring the solution’s utility considering the available resources. The resource allocation algorithm starts with an unknown distribution of the candidates in each group needing resources. At each round, the algorithm allocates resources, so individuals from any group have similar probabilities of receiving resources. The allocation is evaluated, and feedback is considered by the learning algorithm, adjusting the allocation behavior for the next round. This in-processing approach has received a great deal of attention because it circumscribes the algorithmic fairness problem into a well-known area of resource allocation with multiple objective criteria.

Another approach toward a symbolic representation of fairness comes from the logic domain. Farnadi et al. (2018) proposed a machine learning algorithm for relational datasets that take into account fairness patterns as first-order logic axioms (in-processing algorithmic technique). They tested their framework (FairPSL) using synthetic data, evaluating result accuracy and fairness. The authors suggested their approach can lead to both accurate and fair decisions.

4.9.4 Learning from imperfect labels

Lack of data labels and uncertainties are unavoidable problems that the algorithm must deal with. Kallus and Zhou (2018) brought to bear the discrimination issue caused by improper, but the only feasible, data collection. For instance, in the credit loan domain, loan default is only observed on approved loan applicants and used to train machine learning credit loan systems, perpetuating discrimination. The authors showed that even using fairness-adjusted algorithms, the “residual discrimination” caused by this intrinsic information asymmetry enforces structural discrimination on the same groups focused on the fairness adjustments. They represented residual unfairness as distributions of the conditional risk score across censored and target groups.

Frequently, pre-processing techniques are insufficient to avoid this challenge to the machine learning algorithm. Moreover, some methods for handling missing labels may worsen the problem. For example, (Ghosh et al. 2021) have shown that using inferred labels from demographic information can even increase unfair results. When automatically obtaining values for sensitive attributes from people’s photos, names, and addresses, the inference errors lead to mistakes that must be accounted for but rarely are.

4.10 Outcome issues

Last but not least, as summarized in table 14, there are issues related to the outcome of intelligent systems that varies from mistaken outcomes to practical explanation to allow users to overcome barriers in future applications and governance of algorithmic decision-making based on the results.

Table 14 The issues related to the outcome of machine learning systems

Lohia et al. (2019) proposed a method to prioritize specific instances (instances’ weights) to change the classifiers’ outputs. Similarly to Kamiran and Calders (2012) that select instances to change the classifier’s outcome, Lohia et al. propose to change the instances more likely to be biased considering sex, race, and age. They tested their approach using academic datasets including the German credit dataset. They analyzed sex, race, and age bias separately.

Karimi et al. (2021) claimed that instead of acting solely on the computational side, humans should be more active by understanding the system’s outcomes. In the case of a loan, an excellent explanation for rejection should include elements for which people could act upon changing their chances of getting a positive result in a future loan application.

Mendes and Mattiuzzo (2022) discussed the governance of algorithmic decision-making in the credit scoring domain in the light of current Brazilian legislation. They focused on discrimination caused by statistical error, generalization, use of sensitive information, and inadequate correlation. The literature indicates transparency and accountability as essential strategies to combat algorithmic discrimination. Nevertheless, because of business confidentiality issues, the authors do not believe transparency is feasible in the credit scoring domain. On the other hand, accountability can be enforced by legislation. The Brazilian general data protection act helps to move forward, but it is open to different interpretations, depending on the consistent and firm action of the Data Protection Authority.

5 Conclusion and open issues

Although fair algorithms and discrimination-free decision-making have been intensely depicted by current research in the data-driven financial domain, there are still many unexplored areas that deserve attention. The section addresses the most substantial ones among them, summarized in Fig. 10.

Fig. 10
figure 10

Research opportunities

5.1 Broadening discrimination scope

Most papers addressed prejudice, in the loan applications, against blacks (racial discrimination), Latinos (ethnicity discrimination), women (sex discrimination), or general “unfair results”. There are few papers that addressed prejudice concerning their line of work, such as small farmers (Otieno et al. 2020; Pi et al. 2020), people without a credit history (Cai et al. 2020; Liu et al. 2019) and sexual orientation (Sun and Gao 2019; Dillbary and Edwards 2019). Moreover, research on prejudice against certain sexual orientations is restricted to mortgages due to data gathering difficulty. At the same time that gathering more information on sensitive attributes such as sexual orientation and physical disabilities may feed algorithmic prejudices, this information can also help to identify and monitor prejudices in human or computational decision-making. Another important observation from the papers is the need for regulation.

5.2 Analyze multiple sensitive attributes together

Most studies have studied the discrimination effects by looking at one single sensitive attribute at a time. Still, as explained by Crenshaw’s intersectionality theory (Crenshaw 1989), sex and race united bring a stronger form of discrimination. Dillbary (Dillbary and Edwards 2019) considered the combination of race and sexual orientation, discrimination against black male homosexuals leads to loan rejection higher than white heterosexual males. On the algorithmic side, Singh et al. (2022) proposed a method to consider more than one sensitive attribute at a time. Since discrimination is usually broader than towards a single factor, there is a call for more research considering multiple attributes at a time.

5.3 Reverse redlining challenge

For a long time, there have been studies showing indirect racial discrimination considering the person’s home address (Black 1999). This fact could be crucial in banks’ decisions on loan applications. A related distortion is that banks and other financial institutions started using this information to push loan offers at a high rate disguised as a good opportunity. This reverse redlining has been studied in the USA mortgage domain (Steil et al. 2018).

Redlining and reverse redlining constitute a paradox. People are denied a mortgage due to some specific attribute value (e.g., race) but are also targeted through marketing campaigns to get a mortgage in much worse conditions regarding pricing, i.e., a higher interest rate. Swan (2019).

Understanding redlining and reverse redlining effects concerning race is still more complicated in the global south countries. Taking into account Bourdieu’s theory of skin color as a symbolic asset in society (Prasad 2022; Bourdieu 2018) argues that, while in countries with a white majority, race is the factor of prejudice, in the global south countries, in which there is a strong miscegenation, skin tone, and facial attributes that resemble European characteristics are the determining factors. For this reason, it is more difficult to pinpoint the problem. People with lighter skin tones within the racial categorization have better chances to get a loan.

There is a need for studies on reverse redlining in the global south countries (developing countries) using large datasets that could enlighten the effect of the country’s development and inequality on credit provision.

5.4 Consensus on fairness definition

There are many definitions for algorithmic fairness and discrimination, as presented in previous sections. There are even computational libraries to calculate the fairness degree according to many metrics (Bellamy et al. 2019). Nevertheless, Segal et al. (Segal et al. 2021) proposed a fairness certification for machine learning systems, focusing on the training dataset. For maintaining data confidentiality, they proposed to use cryptography features to be able to check the machine learning training dataset.

Economists and operational researchers would know-how to model the context once they understand the scope and context. Computer scientists have demonstrated they know-how to implement fair algorithms and prevent discrimination, as long as they have a clear definition of the problem: what is fair? what is unacceptable discrimination? Analogously, regulators know-how to write laws to restrict undesired behaviors as the problem becomes clear.

On the other hand, a consensus definition that is accepted by society is needed. Moreover, since culture influences society’s perception, probably the definition is not general worldwide. Looking at the differences in accepted definitions and the impact on decisions and systems is a challenge.

5.5 Technical limitations

A fair algorithm, for any definition, does not guarantee discrimination-free results. Society carries many inequalities that are very persistent. Sometimes, discrimination is reinforced by the laws, for example, in the case of the old USA mortgage law.

Technically, we can offer the same opportunities to people under the very same conditions. However, the number of people in that specific conditions might be very different across groups. Consequently, it is not a matter of being fair, but it involves costs for society that must be discussed.

Technical action has boundaries. Going over these boundaries to deal with structural discrimination may call for government affirmative actions and/or laws. The American FHA and the ECOA are examples of government actions to address structural discrimination. Laws have a positive effect on combating discrimination. Dillbary showed in his study of mortgage data from 2005 to 2015 (Dillbary and Edwards 2019), discrimination against homosexual mortgage applicants was much lower in the USA states that had passed anti-sexual orientation discrimination laws. In 2017, the USA passed a federal law prohibiting sexual discrimination in mortgage deliberation that changed the scenario.

5.6 Data privacy versus data sources’ widening

New loan applicants have been discriminated against for the lack of information that would help lenders to assess the risk of approving a loan application. Even when the credit history is not available, applicants may have cell phone payment records. Knight (2019) did a preliminary study using this type of information. But where does an individual’s privacy stand? Who does it help: lenders or applicants? Should it be indiscriminately used? What are the consequences? Have these data been used in a real scenario? There are international laws, such as the GDPRFootnote 27 and CCPA,Footnote 28 protecting the use of personal data without consent. What are the ethical concerns here? These are some of the questions that must be addressed before widening the data sources to evaluate loan applicants.

5.7 Final remarks

This paper presents comprehensive literature focusing on studies on discrimination in the credit domain using a systematic review method considering five data sources (ACM Digital Library, IEEE Digital Library, Scopus, Springer Link Google Scholar). This review was conducted by three researchers that examined and categorized the existing literature. Out of the 1320 initial research papers located in the data sources, 78 papers were selected.

The main threats affecting the validity of our SLR are related to the way we selected, extracted, and filtered the papers used for our analyses. We mitigate these threats by defining a search strategy that uses expressions and synonyms and alternative spelling and by using two well-known tools, PARSIF and Publish or Perish, to avoid human unconscious skipping.

Our analyses bring evidence of US dominance in the research concerning discrimination in the credit domain. It also shows that most of the current research is conducted via econometric analysis of existing mortgage datasets, mostly the public HDMA. Results have shown the existence of discrimination in many countries, against women, blacks, Latinos, and male homosexuals. Both loan rejection and price differentiation were found. Algorithmic discrimination has been taking a more general view to avoid unfair results. Moreover, there is no consensual definition of algorithmic fairness, and the existent metrics can lead to contradictions. Our analyses also reveal open issues and opportunities for future research.

Based on our findings, we argue there is still a wide room for further research improvement. Moreover, the fast widespread of machine learning in the credit domain allied to more strict discrimination laws makes further research even more necessary. The open issues that emerged in this study may represent the input for researchers interested in developing more powerful techniques for combating perceived lack of fairness, identifying discrimination, preventing algorithmic discrimination, and acting upon decisions’ outcomes to change the scenario in a future loan application.

As AI becomes more and more pervasive in the provision of many services, it is of utmost importance to have a code of conduct that assures both the users and the service providers that discrimination, in its several guises (race, gender, age, etc.), is being avoided as much as possible.

The principles and procedures reviewed in this paper are applicable to other areas. Discrimination exists in the provision of health, education, security, supply of utilities (such as water, electricity, and phone services), and general discrimination in the provision of consumption good services. In all of these areas, discrimination takes place in one of the following three dimensions: the intensive margin, the extensive margin, and the quality margin. Not all forms of discrimination exist in all of them, the credit market is perhaps one of the few where all aspects exist.

The intensive margin refers to how discrimination affects the amount of service provided. In the case of credit, it is about the size of the loan, in the case of security it is about the amount of policing that takes place, and in health care is about the time devoted by the doctor attending certain types of patients. The extensive margin is not about the marginal change but the exclusion from access. So, in the credit market, it is about whether or not a person has access to credit, whether or not an individual has access to education or health care, to whether or not a certain set of individuals have more disruption of services such as water, electricity, and phone, or whether or not a store determines whether or not to accept a person into the store depending on who the person is or looks like. Finally, the quality margin refers to the fact that the quality of the service is related to individual characteristics in an unreasonable way. For example, certain individuals receive lower quality of health care and education, and in the credit market, credit management is worse (or the loan conditions are worse).

The advantage of studying the credit market is that all forms of discrimination are prevalent, and all three are extremely costly to citizens: i.e., not getting the quantity needed, not getting the loan at all, or once the loan is obtained that treatment is worse. By concentrating on the credit market, we can extrapolate and understand discrimination in other economic activities.