1 Introduction

In the current European debate on the regulation of Artificial Intelligence there is a consensus that Artificial Intelligence (AI) systems should be developed in a human centered way and should be “trustworthy” [23, 24, 31, 99]. According to these documents, one value that constitutes “trustworthiness” is fairness. Many current publications on AI fairness predominantly focus on avoiding or fixing algorithmic discrimination of groups or individuals and on data-de-biasing, offering different metrics as tools to evaluate whether groups or individuals are treated differently [8, 71, 96]. Moreover, the International Standardization Organization (ISO)/International Electrotechnical Commission (IEC) TR 24028:2020, Information technology—Artificial Intelligence—Overview of trustworthiness in Artificial Intelligence lists fairness as an essential part for ensuring trustworthiness in AI (ISO/IEC 2020). However, the multitude of existing indicators allowing the labeling of an AI system as “(un)fair” and the lack of standardized, application field specific criteria to choose among the various fairness-evaluation methods makes it difficult for potential auditors to arrive at a final, consistent judgment [24, 96, 98]. The increasing need for standardized methods to assess the potential risks of AI systems is also highlighted by the draft for an “Artificial Intelligence Act” suggested by the European Commission in April 2021, which, in accordance with the so-called “New Legislative Framework,” ascribes a major role to “Standards, conformity assessment, certificates [and] registration” (Chapter 5).

Focusing on a concrete use case in the application field of finance, the main goal of this paper is to define standardizable minimal ethical requirements for AI fairness evaluation. In Sect. 2, we explore different understandings of fairness from three perspectives and address the different vantage points of many stakeholders involved in the development, commercialization, and use of AI systems. In Sect. 3, we discuss the example of a risk scoring machine learning (ML) model for small personal loans. As a main contribution of the paper, we suggest ethical minimal requirements that should be complied with when evaluating fairness and highlight a preferred fairness metric for fairness-evaluation purposes in this specific application field. In Sect. 4, we investigate how to translate our research findings into standardization criteria to be used when assessing ML credit scoring systems for small personal loans.

2 Defining fairness

2.1 AI ethics

In the current AI ethics discussions, fairness is generally framed in accounts of distributive justice and is broadly referred to as unbiased distribution of access to services and goods—e.g., access to treatments in healthcare [39, 81] or access to credit [65]—and as absence of discrimination, understood as unjustified unequal treatment of groups or individuals [72, 76].Footnote 1

Concerning distributive justice and non-discrimination as equal treatment, one of the primary contemporary philosophical references is the Rawlsian idea of equality of opportunities. This idea requires that citizens having the same talents and being equally motivated should receive the same educational and economic opportunities regardless of their wealth or social status [83] (p. 44). Since in the social praxis basic rights and liberties are neither accessible nor enjoyable in the same way for different citizens, society should take adequate measures in order for all citizens to enjoy their rights and liberties. Rawls develops on this intuition stating that “the worth of liberty to persons and groups depends upon their capacity to advance their ends within the framework the system defines. […] Some have greater authority and wealth, and therefore greater means to achieve their aims” [82] (p. 179). Consequently, to avoid the exaggeration of the unequal enjoyment of basic rights and liberties, a fair society must enact compensation mechanisms to maximize their worth to the least advantaged [82] (ibid). It is essential to avoid the development of vicious circles of (un)privilege-polarization in society due to the moral harm they produce. Therefore, removing the opportunities of the less privileged to truly benefit from their rights and liberties results in a harmful form of negative discrimination that amplifies economic inequalities and undermines the chances for the less advantaged to live an autonomous life and set self-determined goals. This is a form of disrespect toward their personhood [29, 68] and can amplify social resentment.

The philosophical debate about Rawls’ theory and other forms of “egalitarianism” could help clarify current emerging issues concerning algorithmic fairness [11]. Egalitarianism in this sense means that “human beings are in some fundamental sense equal and that efforts should be made to avoid and correct certain forms of inequality” [11] (p. 2). Many approaches try to determine the kind of equality that should be sought and which inequalities should be avoided in civil society to uphold the fundamental equality of human beings: among others, equality of preference-satisfaction [19], equality of welfare, income, and assets [28], and the equality of capabilities to achieve their goals [87]. However, regarding the application of these views on algorithmic decisions, defending an equal opportunity approach rather than an equal outcome is not always the most effective solution. If, for candidate selection or calculation of insurance, focusing on equal opportunity might lead to increase “economic justice,” in other contexts, such as during airport security checks, equality of outcome in the form of “parity of impact” could help establish a sense of social solidarity avoiding the over-examination of certain groups [11] (p. 7). Thus, the choice of a specific approach to evaluate (in)equality depends on the specific application context. As Balayn and Gürses claim, the regulation of AI must go “beyond de-biasing” [6]. Mere data-based or outcome-based solutions trying to solve local distributive issues of a system, such as trying to solve a racial bias in image recognition software by enlarging the data basis with pictures from people of all ethnic backgrounds, are not sufficient alone to address structural inequality issues at the root [60].Footnote 2 If decision-making processes that influence people’s access to opportunities are biased, the intersection between algorithmic fairness and structural (in)justice requires investigation [51].

In addition, other fairness issues that are indirectly related with the algorithmic outcome, but rather with the entire system design and application processes, as well as with their consequences for society, can occur. These aspects of fairness also overlap with other human rights and societal values. For instance, a structural fairness issue of many ML systems is the phenomenon of “digital labor” [33], referring (among other things) to the precarious work conditions and the very low pay of many click workers generating training data for ML systems. In addition, the commodification of privacy related with the use of many digital services raises fairness issues since users are often kept unaware of the exact use of their data, and they are not always in the position to defend their right to privacy [74, 91, 95]—being so a disadvantaged stakeholder compared with the service providers. Finally, addressing the sustainability concerns that emerged during the so-called “third wave” of AI ethics, global and intergenerational justice can be highlighted as fairness issues [37, 93]. Considering intergenerational justice means to add a temporal, anticipatory dimension to our understanding of fairness and extend the claim for the equity of human living conditions—as, for instance, expressed in the UN’s Sustainable Development Goals (SDG)—also to the future and not only limiting it to present generations. In the practice, fairness toward future generations means acting sustainably.

These considerations lead us to the following preliminary understanding of fairness in the context of an AI ethics assessment. First, focusing on the unbiased distribution of access to services and goods and on the absence of discrimination of groups or individuals, fairness means the equal treatment of people regardless of their sex, race, color, ethnic or social origin, genetic features, language, religion or belief, political or any other opinion, membership of a national minority, property, birth, disability, age or sexual orientation,Footnote 3 when it comes to granting or denying access to products, services, benefits, professional or educational opportunities, and medical treatments based on an automated evaluation and classification of individual or groups. In addition, a fair system should not involve work exploitation or the violation of human rights of any of the involved stakeholders during its life cycle. Moreover, the real-world application of the system should not create or amplify power unbalances between stakeholders, nor place specific stakeholders’ groups in a disadvantaged position.

2.2 Data science

Fairness is discussed in the context of data and data-driven systems whose inherent patterns or statistical biasesFootnote 4 can be interpreted as “unfair.” Here, it is important to emphasize that the evaluation of whether certain patterns are “fair” or “unfair” transcends the specific expertise of data scientists and requires further legal, philosophical, political, and socioeconomic considerations. What is being explored in data science under the term “fairness” are quantitative concepts to identify patterns or biases in data, in addition to technical methods to mitigate them.

Data analysis and data-based modeling of real-world relationships have progressed in recent years especially through Machine Learning (ML). ML is a subdiscipline of AI research in which statistical models are fitted to so-called training data, recognize patterns and correlations in this data, and generate predictions for new (input) data on this basis. ML methods have become a particular focus of fairness research, as they provide everyday applications using personal data, e.g., employment decisions, credit scoring, and facial recognition [71]. Furthermore, they pose the challenge that bias within the training data might lead to biased model results.

2.2.1 Short introduction to machine learning

ML-based applications have enabled technological progress which can particularly be attributed to the fact that their functionality is based on learning patterns from data. By this means, ML methods provide approaches to solving tasks that could not be effectively addressed by “traditional” software fully specified by human rules. In particular, deep neural networks, a type of ML method involving vast amounts of data, have significantly advanced areas, such as image [26] and speech recognition [13], in addition to predicting complex issues, for instance medical diagnostics [102] and predictive maintenance [17].

ML methods are designed to learn from data to improve performance on a given task [41]. A task can be viewed as finding a mapping that, for an input x, assigns an output y which is useful for a defined purpose. One ML task that is particularly relevant for fairness is classification. The purpose of classification is to identify to which of a set of categories a given input belongs, for instance, whether a person is creditworthy or not. ML is about finding such a model f that solves a task effectively by y = f(x). To achieve this, a learning algorithm adjusts parameters within the model. The fitness of the model for the given task can be evaluated using quantitative measures. Such quantitative indicators of model or data properties are generally referred to as “metrics.” For example, a typical performance metric for classification tasks is precision, which measures the proportion to which the classification to a certain category by the model was correct.

The data which ML methods use to build a model, called “training data,” is a collection of input examplesFootnote 5 that the model is expected to handle as part of the task. A single example in the data is called a datapoint. For classification tasks, a datapoint in the training data of a ML model contains, in addition to the example x, a “ground-truth” label that specifies how the ML model should process the respective input x. Following on the example of creditworthiness classification, the training data for a ML model addressing this task may be drawn from previous credit applications, and the individual datapoints could include features, such as income, age, or category of work activity (e.g., self-employed, employed). Moreover, each datapoint should also contain a ground-truth label that could be derived from manual processes or, if possible, from the observation whether in the given examples the loans were repaid in full.

When building a ML model, the training data is used to adjust the internal model parameters which determine the mapping through f. For instance, in a neural network, the weights assigned to the network’s edges are adjusted by the learning algorithm during model building, a phase which is also called “training.” Overall, ML is an optimization procedure that finds internal model parameters such that they optimize a defined performance metric on a training dataset. In this case, the performance metric specified as optimization objective is referred to as “loss function.” For example, in a classification task, a quantitative measure of the distance between ground-truth and model output could be used as a loss function. Consequently, such a model would be generated in training, which optimally approximates the relationships between x and y provided in the training data.

Underlying the ML approach of fitting a model to training data is the idea that the model infers patterns which help produce valuable outputs when applied to new data. The term “generalizability” is used to describe the aim that the model performs well on data not seen during training. Thus, for model evaluation, an additional test dataset different from the training data is used. Given training and test data, according to Goodfellow et al. model quality is indicated by two quantities: (i) the training error measured by the loss function, and (ii) the difference between training and test error [41]. A model with a large training error is called “underfitting," while one with a low training error but large difference between training and test error is "overfitting" the training data.

2.2.2 Meaning and challenges of “fairness”

Data has a crucial impact on the quality of a ML model. In computer science, data quality has already been researched for “classical” information systems, where it is considered especially regarding large amounts of stored operational and warehousing data, e.g., a company’s client database. Numerous criteria for data quality have been proposed, which can be mapped within four dimensions: “completeness, unambiguity, meaningfulness, and correctness” [100]. Only recently has the operationalization of data quality specifically for ML been explored [46]. The issue of data completeness, relating to the training and test data sufficiently capturing the application domain, is particularly relevant in this context. There is a high risk that an ML model has low statistical power on data either not included or statistically insignificant in its training set. To prevent strong declines in a model’s productive performance, measures are being researched for dealing with missing or underrepresented inputs [46] as well as for detecting distribution skews, e.g., between training and production data [12].

In certain tasks and application contexts, individuals are affected by the outputs of a ML model. For example, ML models are being used to support recruiting processes, decisions on loan approval, and facial recognition [71]. Consequently, it is essential that the model performs equally well for all individuals. The research direction in data science that addresses related issues from a technical perspective is referred to under the term “fairness.” Clearly, the motivation for “fairness” in data or ML models does not derive from a technical perspective, nor does data science as a scientific discipline provide a sufficient basis for evaluating under which circumstances these should be classified as “fair.” The approaches and methods researched in this area are usually neutral, as they can be applied to structurally similar scenarios that do not involve individuals.

Regarding model quality, the data are a central object of study in fairness from two perspectives. First, aspects of data quality should not differ regarding particular groups of people. Regarding the dimension of completeness, for instance, certain population groups could be underrepresented in the training data resulting in a lower performance of the ML model with respect to these groups [14]. Another example is that the ML model might infer biased patterns from the training data if their representativeness is compromised by non-random sampling, for example, if predominantly positive examples are selected from one population group but negative examples are selected from another. Second, even if data are of high quality from a technical perspective, they may (correctly) reflect patterns that one would like to prevent from being reproduced by the ML model trained on it. For instance, data might capture systemic bias rooted in (institutional) procedures or practices favoring or disadvantaging certain social groups [86]. The technical challenge that arises here is fitting a model to the training data but simultaneously preventing inferring certain undesirable patterns that are present. Overall, proceeding from the variety of biasesFootnote 6 identified to date, both measures that “detect” and measures that “correct” (potentially unfair) patterns in datasets and models are being explored [48] (p. 1175).

2.2.3 Measures that “detect”

Aiming to “detect,” one research direction is concerned with developing technical approaches to disclose and quantify biases in the first place. Numerous “fairness metrics” have been presented [96], particularly in light of providing statistical evidence for unequal treatment in classification tasks. Corresponding to the approach of identifying and comparing groups for identifying bias, so-called “group fairness metrics” constitute a large part of the fairness metrics presented to date. These metrics compare statistical quantities regarding groups defined on the basis of certain attributes in a dataset (e.g., a group could be defined by means of age, gender, or location if these attributes are provided in the data). Among the group fairness metrics, one can further distinguish between two types: i) metrics which compare the distribution of outputs, and ii) metrics which compare the correctness of the outputs with respect to different groups. An example of the first type is to measure the discrepancy to which a certain output is distributed by percentage among two different groups. This quantification approach is called “statistical parity,” and Sect. 3.3. provides a detailed elaboration. The second type of metrics focus on model quality and compare performance-related aspects with respect to different groups (e.g., specific error rates or calibration). For instance, the metric “equal opportunity” [96] calculates the difference between the true-positive rates of a model on the respective data subsets representing two different groups. Such metrics can highlight model weaknesses by providing insight on where the model quality may be inconsistent.

Besides group fairness metrics, further measures have been developed to disclose biases. Two examples are “individual fairness” [27] and “counterfactual fairness” [63]. “Individual fairness” is based on comparing individuals. Therefore, a distance metric is defined that quantifies the similarity between two datapoints. The underlying idea of this approach is that similar model outputs should be generated for similar individuals. In addition, measurable indicators for an entire data set have been derived using such a distance metric, for example, “consistency” [104]. Similarly, inequality indices from economics such as the generalized entropy index have also been proposed as bias indicators for datasets [89], which require a definition of individual preferences. “Counterfactual fairness” considers individual datapoints, similar to “individual fairness”; however, it examines the effect of changing certain attribute values on model outputs. This can be used to uncover if the model would have generated a different output for an individual if they had a different gender, age, or ethnicity, for example. Many of the presented bias quantification and detection approaches have been implemented in (partially) open-source packages and tools [3, 9, 40] and are likewise applicable to input–output-mappings not based on ML.

Different fairness metrics might pursue different target states, e.g., balanced output rates between groups (statistical parity) versus balanced error rates (equal opportunity). Therefore, they also differ greatly in their potential conflict with other performance goals. For instance, consider a dataset in which Group A contains 30% positive ground-truth labels and Group B contains 60%. If the model is to reach a low value for a fairness metric that measures the discrepancy in the distribution of positive labels across groups, its outputs must deviate from ground-truth. In addition to sacrificing accuracy, this could also result in unbalanced error rates. Thus, depending on the nature of the data, fairness metrics may be mutually exclusive [7].

2.2.4 Measures that “correct”

Another direction of research is striving to develop technical measures which can “correct” or mitigate detected bias. To this end, approaches along the different development stages of ML models are being explored [35]. The underlying technical issue, especially when facing systemic or historical bias, is to train a model by inferring correlations in data that performs well on a given task—but simultaneously preventing learning of certain undesirable patterns that are present in the data. An important starting point for addressing this apparent contradiction is the data itself. A basal pre-processing method that has been proposed is “Fairness through Unawareness,” meaning that those (protected) attributes are removed from the data set for which correlation with model output values is to be avoided [63], or whose inclusion could be perceived as “procedurally unfair” [44]. However, this method alone is not recognized as sufficiently effective as correlated “proxies” might still be contained in the data [61], and many mitigation methods actively incorporate the protected attributes to factor out bias [63]. Further examples of pre-processing methods range from targeted reweighing, duplication, or deletion of datapoints to modifying the ground-truth [57] or creating an entirely new (synthetic) data representation [104]. The latter are usually based on an optimization in which the original datapoints are represented as a debiased combination of prototypes. While these methods primarily aim at equalizing ground-truth values among different groups, some optimization approaches for generating data representations also include aspects of individual fairness [64]. Furthermore, to mitigate representation or sampling bias, over-sampling measures to counteract class imbalance [16] are being researched [15]. In addition to algorithmic methods, documentation guidelines have been developed to support adherence to good standards, e.g., in data selection [36].

While the data centrally influences the model results and pre-processing methods offer the advantage that they can typically be selected independently of the model to be trained, research is also being conducted on so-called in- and post-processing measures. In-processing measures are those that intervene in modeling. This can be realized, for example, by supplementing the loss function with a regularization term that reduces the correlation between model output and certain attributes [59], or using optimization constraints to align certain error rates among different groups [103]. Another in-processing approach, which affects the entire model architecture, is to include an adversarial network in the training that attempts to draw an inference about protected attributes from the model outputs [105]. The model and its adversary are trained simultaneously, where the optimization goal of the original model is to keep the performance of the adversary as low as possible. In contrast, post-processing refers to those measures that are applied to fully trained models. For example, corresponding methods comprise calibration of outputs [78] and targeted threshold setting (with thresholds per group, if applicable) to equalize error rates [49]. Many of the in- and post-processing measures are researched primarily for classification tasks, and the methods developed are typically tailored to one type of model for technical reasons.

2.2.5 Outlook

In the fairness research field, a variety of approaches have been developed to better understand and control bias. Beyond these achievements, still, open research questions remain unaddressed from a technical and interdisciplinary perspective. Regarding the first, many metrics, in addition to the methods that work toward their fulfillment, are applicable to specific tasks only and impose strong assumptions. Here, one challenge is to adapt specific measures from one use case to another. Regarding the latter, a central issue is that different fairness metrics pursue different target states for an ML model (see “Measures to detect”), therefore a choice must be made when assessing fairness in practice. Furthermore, the concrete configuration of specific metrics, for example, how a meaningful similarity metric for assessing “individual fairness” should be defined, remains unresolved. The question of which fairness metrics and measures are desirable or useful in practice must be addressed in interdisciplinary discussion. A concrete example is provided in Sect. 3.3.

2.3 Management science

Management literature [4, 21, 38, 77, 88] divides fairness into four types: distributive, procedural, interpersonal, and informational fairness.

Distributive fairness refers to the evaluation of the outcome of an allocation decision [34]. Equity is inherent to distributive fairness [79]. Hence, to achieve distributive fairness, participants must be convinced that the expected value created by the organization is proportionate to their contributions [4]. Procedural fairness refers to the process of decision right allocation (i.e., how do the parties arrive at a decision outcome? [62]). To achieve procedural fairness, a fair assignment of decision rights is required [4, 70]. The decision rights must ensure fair procedures and processes for future decisions that influence value creation [4, 70].

Interpersonal and informational fairness refer to interactional justice, which is defined by the interpersonal treatment that people experience in decision-making processes [10]. Interpersonal fairness reflects the degree of respect and integrity shown by authority figures in the execution of processes. Informational fairness is specified by the level of truthfulness and justification during the processes [20, 43].

To ensure a differentiated assessment of AI systems from a socioeconomic perspective, these four dimensions should be included in the evaluation of fairness. In particular, procedural and distributive fairness should be emphasized, as the credit scoring assessment concentrates primarily on the credit-granting decision process. Table 1 provides an overview of fairness measurement scales in management science based on Colquitt and Rodell’s [22] and Poppo and Zhou’s [79] work.

Table 1 Procedural and distributive fairness measurement scales [22, 79]

Within the data science perspective, inequality indices such as the generalized entropy (GE) index or the Gini coefficient are widely accepted [25]. Both measures aim to evaluate income inequality from an economic perspective. However, they differ in their meaning, with the GE index providing more detailed insights by collecting information on the impact of inequality across different income spectrums. The GE index is also used in an interdisciplinary context. For example, in computer science, it is used to measure redundancy in data, which is used to assess the disparity within the data. In addition to the economic level, approaches to fairness measurement also exist at the corporate level.

However, the operationalizability of the fairness types described, especially procedural and distributive fairness, remains unclear. This paper aims to address this issue. A typical instrument in management practice is price discrimination, aiming to exploit the market potential. Banks’ business model is to spread risks and price risks according to their default risk in order to achieve the best possible return on investment. For example, banks price the default risk of loans variously and derive differentiated prices. Here, procedural fairness is crucial in the overall fairness assessment, since procedurally unfair price settings lead to higher overall price unfairness [32]. Ferguson, Ellen, and Bearden highlight that random pricing is assessed to be more unfair than possible cost-plus pricing (price is the sum of product costs and a profit margin) within the procedural fairness assessment [32]. Furthermore, they provide evidence that procedural and distributive fairness positively interact and thus, if implemented accordingly, can maximize the overall fairness. As described in Table 1, the presented six procedural components and three distributional components should be considered in the pricing process to achieve strong overall fairness. From an organizational perspective, financial institutions should ensure that their credit ratings are neutral and unbiased based on accurate information [30]. Regarding erroneous data, customers should be able to review and correct the data if necessary. Pricing should be consistent to avoid the impression of random pricing. Therefore, people with the same attribute characteristics should always receive the exact credit pricing. Furthermore, the possible use of algorithms should not disadvantage certain marginalized groups. Moreover, to maximize the overall fairness, banks should include the distributive components in their fairness assessment. Haws and Bearden emphasize that customers assess high prices with unfairness and vice versa [50]. Thus, Ferguson, Ellen, and Bearden argue that distributional fairness is given when customers receive an advantageous price [32]. Consequently, when pricing loans, banks should always adhere to market conditions to achieve a maximum overall fairness.

2.4 Summary

Fairness can be generally considered as the absence of unjustified unequal treatment. This broad understanding takes on different specific connotations that require consideration when evaluating AI systems. In our interdisciplinary overview, two main aspects of fairness were highlighted. Distributive fairness is one of these. It concerns how automated predictions impacting the access of individuals to products, services, benefits, and other opportunities are allocated. This algorithmic outcome can be analyzed through statistical tools to detect an eventual unequal distribution of certain predictions and to assess whether this is justified by the individual features of group members, or whether this is due to biases or other factors. Considering procedural fairness is also fundamental for the evaluation of AI systems. Therefore, it should be considered how a decision is reached for different stakeholders’ groups and how members of these groups are treated in the different stages of the product life cycle.

3 Evaluating fairness

3.1 The use case: creditworthiness assessment scoring for small personal loans

Based on the empirical evidence of perpetuation of preexisting discriminatory bias and of the discrimination risks for specific demographic groups, recent literature on credit scoring algorithms investigated gender-related [96] and race-related [65] fairness issues of ML systems, looking for suitable tools to detect and correct discriminatory biases and unfair prediction outcomes. Here we consider the case of small personal loans. These are small volume credits to finance, for instance, the purchase of a vehicle or pieces of furniture, or to cover the costs of expenses such a wedding or a holiday. They typically range from 1.000 EUR to 80.000 EUR—in some cases they can be up to 100.000€,Footnote 7 which are granted without a comprehensive check-up by the credit institute.

During the credit application process, as a preliminary step, a bank requests customer information, such as address, income, employment status, and living situation, which it feeds into its own (simple) credit scoring algorithm.Footnote 8 As opposed to the application process for higher volume credits, extensive information on the overall assets and wealth of the applicant is not required. Regarding particularly small lending, account statements might not even be necessary.Footnote 9 In some cases, the authorization to conduct a solvency check through a credit check agency might be requested. If so, the credit check agency will process additional information concerning, among other things, the credit history of the applicant and other personal information to produce a credit rating.Footnote 10 Finally, based on the creditworthiness assessment, a bank clerk decides whether the small personal loan is granted. In some instances, the rates might be raised in order to compensate the credit institutes for the potential illiquidity of individual customers.Footnote 11

In the European framework, guidelines to improve institutions’ practices in relation to the use of automated models for credit-granting purposes have been produced [2, 5, 30]. In the report Guidelines on loan origination and monitoring, the European Banking Authority (EBA) recommends that credit institutions should “understand the quality of data and inputs to the model and detect and prevent bias in the credit decision-making process, ensuring that appropriate safeguards are in place to provide confidentiality, integrity and availability of information and systems have in place” (53.e, see also 54.a and 55.a), take “measures to ensure the traceability, auditability, and robustness and resilience of the inputs and outputs” (54.b, see also 53.c), and have in place “internal policies and procedures ensuring that the quality of the model output is regularly assessed, using measures appropriate to the model’s use, including backtesting the performance of the model” (54.c, see also 53.f and 55.b) [30]. In the white paper Big data and artificial intelligence, the German Federal Financial Supervisory Authority (BaFin) also recommends principles for the use of algorithms in decision-making processes. These include: preventing bias; ruling out types of differentiation that are prohibited by law; compliance with data protection requirements; ensuring accurate, robust and reproducible results; producing documentation to ensure clarity for both internal and external parties; using relevant data for calibration and validation purposes; putting the human in the loop; having ongoing validation, overall evaluation and appropriate adjustments [5].

The present work follows on these recommendations and contributes to the regulatory discussion by highlighting use case-specific operationalizable requirements that address the issues emphasized by European financial institutions. We focus specifically on small volume credit for two main reasons. First, the pool of potential applicants is significantly larger than the one for higher volume credits such as mortgage lending. While high volume credits usually require the borrower to pledge one or more assets as a collateral and to be able to make a down payment to cover a portion of the total purchase price of an expensive good, these conditions do not apply to small personal loans, making these also accessible for citizens without consistent savings or other assets. Since the overall personal wealth should not influence the decision outcome in small personal loans, this makes it a particularly interesting scenario to evaluate potential discrimination of individual belonging to disadvantaged groups that are not eligible for higher volume credits, but could be granted a small personal loan. Second, the amount of applicant information processed for creditworthiness assessment is significantly lower than in the case of higher volume credits, allowing a clearer analysis of the relevant parameters and their interplay.

3.2 Preliminary ethical analysis

Regarding credit access, structural injustice severely afflicts women and demographic minorities. Although in contemporary liberal democracies explicitly preventing credit access based on gender, race, disability or religion is illegal because it represents a violation of basic human rights,Footnote 12 for structural reasons, many individuals belonging to disadvantaged groups still struggle to access credit. The gender pay gap is a concrete example: a woman working full-time in the same position as a male colleague might earn less [42], and be less creditworthy from the bank’s perspective.

Since ethics is supposed to influence shaping a fairer society, the definition of minimal ethical requirements for a fair ML system in a specific application field should consider whether and how technologies can help prevent unfairness, and aim at assisting disadvantaged groups. Considering our fairness understanding as the absence of unjustified unequal treatment of individuals or groups, the first question leading to a definition of minimal ethical requirements is the following: Are there individuals belonging to certain groups that are not granted loans although they share the same relevant parameters with other successful applicants belonging to other groups? This should not be mistaken with the claim that every group should have the same share of members being granted a loan (group parity), since it is not in the interest of the person applying for a loan to be granted one if they are unable to pay it back. This would also be unethical because it would further compromise the financial stability and creditworthiness of the person, causing legal trouble and moral harm. To answer this question, fairness metrics can be a useful tool to detect disparities among groups.

3.2.1 Which metric(s) to choose?

The choice of one metric in particular is not value-neutral. Several factors should be considered when investigating fairness metrics. Among others, there is an ethical multi-stakeholder consideration to be performed [47]. Certain metrics can better accommodate the businesses’ needs and goals, while others will better safeguard the rights of those being ranked or scored by a software. For instance, while evaluating the accuracy of a credit scoring system among protected groups, financial institutes will be primarily interested in optimizing (to a minimal rate) the number of loans granted to people who will not repay the debt (false positive rate) for all groups. However, it is in the interest of solvent credit applicants to optimize to a minimal rate the number of credit applicants to whom credit is denied although they could have repaid the debt (false negative rate) for all groups. Therefore, if asked to choose a metric to evaluate fairness, the former could opt for a predictive equality fairness metric, measuring the probability of a subject in the negative class to have a positive predictive value [96]. However, someone representing the latter could rather choose the equal opportunity metric, the probability of a subject in a positive class to have a negative predictive value [96]. Therefore, the choice of a specific metric is not value-neutral since it could better serve the interest of certain stakeholder groups. Among other things, the role of AI ethics and AI regulation should be to prevent a minority of advantaged stakeholders from receiving the majority of advantages at the expense of those who are less advantaged. In this specific use case, this goal should be reached by considering the equal treatment of all applicants as the actual chance of getting a loan when the financial requirements are met irrespectively of the applicant’s demographic group.

In their paper, “Why fairness cannot be automated,” Sandra Wachter, Brent Mittelstadt, and Chris Russel, highlight the “conditional demographic parity” as a standard baseline statistical measurement that aligns with the European Court of Justice “gold standard” for assessment of prima facie discrimination. Wachter, Mittelstadt, and Russel argue that, if adopted as an evidential standard, conditional demographic parity will help answer two key questions concerning fairness in automated systems:

  1. 1.

    Across the entire affected population, which protected groups could I compare to identify potential discrimination?

  2. 2.

    How do these protected groups compare to one another in terms of disparity of outcomes? [98]

Here we follow their general proposal and suggest to use this specific metric as an evaluation tool for the specific case of creditworthiness assessment for small personal loans. In Sect. 3.3., we show how this metric can be used to evaluate the algorithmic outcome in our application field.

3.2.2 De-biasing is not enough

A fairness metric alone is insufficient to address fairness issues. If it becomes clear that group inequality is motivated by a structural reason, both the algorithmic outcome and the parameters and steps behind the decision process, this process requires questioning [45]. Different examples of checklists addressing procedural aspects of AI systems’ design, development and application can be found in recent reports, white papers, and standard proposals. The Assessment List for Trustworthy AI (ALTAI) for self-assessment by the High Level Expert Group for Artificial Intelligence of the EU includes “mechanisms to inform users about the purpose, criteria and limitations of the decisions generated by the AI system,” “educational and awareness initiatives,” “a mechanism that allows for the flagging of issues related to bias discrimination or poor performance of the AI system,” and an assessment taking “the impact of the AI system on the potential end-users and/or subjects into account” [53]. The VDE SPEC 90012 (2022) “VCIO-based description of systems for AI trustworthiness characterization” recommends to audit working and supply chain conditions, data processing procedures, ecological sustainability, adequacy of the systems outcome’s explanation to inform the affected persons [94]. NIST Special Publication “Towards a Standard for Identifying and Managing Bias in Artificial Intelligence” recommends considering “human factors, including societal and historic biases within individuals and organizations, participatory approaches such as human centered design, and human-in-the-loop practices” when addressing bias in AI [86]. Madaio et al. designed a check-list intended to guide the design of fair AI systems including “solicit input on definitions and potential fairness-related harms from different perspectives,” “undertake user testing with diverse stakeholders,” and “establish processes for deciding whether unanticipated uses or applications should be prohibited” among the to-dos [69].

In the credit lending scenario, certain applicants’ groups have been structurally disadvantaged in their history of access to credit and could still experience obstacles in successfully participating in the application process. Consequently, it should be ensured that only parameters which are relevant to assess the applicant’s ability to repay the loan are processed — e.g., bank statements or monthly income — and that parameters that may lead to direct or indirect discrimination and bias perpetuation — e.g., postal code, gender, or nationality — are excluded. On this point, we follow the privacy preserving principle of “data minimization” as expressed in Art. 5.1.(c) and 25.1. GDPR. Assuming that there are different computing methods to optimize the algorithmic outcome in order to avoid unjustified unequal treatment of credit applicants, those methods processing less data should be preferred over those requiring a larger dataset containing more information on additional applicant’s attributes.

Moreover, to empower credit applicants from all groups, the decision process should be made explainable so that rejected applicants can understand why they were unsuccessful. This would prevent applicants from facing black box decisions that cannot be contested, therefore diminishing the bargaining power unbalance between applicants and credit institutes. The decision process can be questioned through “counterfactual” explanations stating how the world would have to be different for a desirable outcome to occur [73, 97]. As remarked by Wachter et al., in certain cases, knowing what is “the smallest change to the world that can be made to obtain a desirable outcome,” is crucial for the discussion of counterfactuals and can help understand the logic of certain decisions [97] (p. 845). In our specific case, to provide applicants with this knowledge, the decisive parameters or parameter combination (e.g., insufficient income and/or being unemployed) that led to credit denial should be made transparent, and the counterfactual explanation should explain how these parameters should have differed in order for the credit application to be approved. This would provide the applicant the opportunity to contest the algorithmic decision, to provide supplementary information relevant to support their application or, eventually, to successfully reapply for a smaller loan.

This transparency requirement also relates to the issue concerning processing data that could result in direct or indirect discrimination since credit scoring might be performed based on data belonging to credit check agencies which are not made available for private citizens.

3.3 The “conditional demographic parity” metric

In order to conduct the statistical calculation concerning the potential existence of indirect discrimination, in the interdisciplinary literature, the so-called “conditional demographic parity” metric has been proposed [98] (p. 54 ff.). This metric mirrors a statistical approach, which can be applied to examine potential discrimination in the context of the European anti-discrimination laws [98]. This technique should not be confused with the second step, meaning the question of justification of a particular disadvantage. Instead, it only concerns the first step, which deals with the question whether a particular disadvantage within the meaning of the definition of indirect discrimination is present. From a computer science perspective, numerous approaches for measuring fairness have been presented which fit in the fundamental conceptions of “individual fairness” or “group fairness.” Individual fairness relates to the idea of comparing two persons, which can be classified as similar apart from the sensitive attribute; it is infringed if these two persons are not treated correspondingly [48] (p. 1175). However, group fairness statistically compares two groups of persons [48] (p. 1175). A case of direct discrimination constitutes a breach of individual fairness; a case of indirect discrimination contravenes group fairness [48] (p. 1175).

Group fairness metrics generally compare statistical quantities regarding defined groups in a dataset, e.g., the set of data samples with annual income over 50.000€ and the group with income less or equal to 50.000€. “Statistical parity” metrics [96] in addition to “demographic parity” [98], which are equivalent under certain circumstances,Footnote 13 constitute basic representatives of group fairness metrics that compare the (distribution of) outputs. These metrics are best applied to scenarios where there is a commonly preferred output from the perspective of the affected individuals (e.g., “credit granted” in case of credit scoring or “applicant accepted” in case of automated processing of job or university applications), and they compare how this output is distributed. “Statistical parity” serves to compare the proportions to which different groups, defined by a sensitive/protected attribute, are assigned a (preferred) output. Let us illustrate this on the example of credit scoring: Denote c = 1 the prediction/outcome that a credit is granted (c = 0 if the credit is not granted), and S the sensitive attribute sex with S = m denoting a male applicant and S = f a female applicant (for now, we reduce this example to the binary case both for the output and the sensitive attribute). The “statistical parity” metric (with respect to the groups of female and male applicants) is defined as the difference between the proportion to which male applicants are granted a loan and the proportion to which female applicants are granted a loan. As a formula:

$$\frac{\left|applicants\, with \,c=1 \,and\, S=m\right|}{\left|applicants\, with\, S=m\right|} - \frac{\left|applicants\, with\, c=1 \,and \,S=f\right|}{\left|applicants\, with\, S=f\right|}$$
(1)

For instance, if 80% of male applicants and 60% of female applicants are granted a loan, the statistical (dis-)parity is |80%–60%|= 20%.

Other than contrasting the protected group (here meaning the people falling under the sensitive feature in question and being examined in the specific case) with the non-protected group (what the “statistical parity” metric does), one could also consider solely the protected group and compare group proportions along preferred and non-preferred outputs. “Demographic parity” as described by [98] follows the latter approach. This metric compares to what proportion the protected group is represented among those who received the preferred output and among those who received the non-preferred output. According to the description in [98], demographic disparity exists if a protected group is to a larger extent represented among those with non-preferred output than among those with preferred output.

Returning to our credit scoring example, “demographic parity” here measures the difference between the proportion of females within the group of persons whom a credit is granted, and the proportion of females within the group of persons whom a credit is not granted. As a formula, demographic disparity exists if

$$\frac{\left|applicants\, with\, c=0 \,and \,S=f\right|}{\left|applicants\, with\, c=0\right|} > \frac{\left|applicants\, with\, c=1 \,and \,S=f\right|}{\left|applicants\, with\, c=1\right|}$$
(2)

For instance, if 36% of the applicants being granted a loan are female, but 51% of those not being granted a loan are female, demographic disparity exists with a discrepancy of |36%–51%|= 15%.

Both the “statistical parity” and “demographic parity” metrics provide a first indication of the “particular disadvantage” within the definition of indirect discrimination presented above. Moreover, they can be easily calculated independent of the (potentially biased) ground-truth data.

However, groups defined by only one sensitive attribute (e.g., sex) can be large. Thus, the metrics presented might be coarse and unable to capture potential disparity within a group. For example, within the group of females, single applicants from the countryside might have a far lower approval rate than the average female applicant, while married applicants from the city might be granted credits almost as often as men.

Considerations such as the previous, which aim to understand given statistical or demographic disparity more deeply (e.g., by finding correlated attributes to explain the existing bias), should be informed by statistical evidence. One approach to provide more granular information on potential biases is to include (a set of) additional attributes A, which do not necessarily need to be sensitive/protected. In particular, the statistical quantities which are subject to the metrics presented can be calculated on subgroups which are characterized by attributes A (additional to the sensitive attribute). This enables a comparison of (more homogeneous) subgroups. Following this approach, an extension of the demographic parity metric has been presentedFootnote 14:

“Conditional demographic parity” [98] is defined in the same way as “demographic parity” but restricted to a data subset characterized by attributes A. In other words, “conditional demographic parity” is violated if, for a (set of) attributes A, the protected group is to a larger extent represented among those with non-preferred output and attributes A than among those with preferred output and attributes A.

Returning to the credit scoring example, let A = (“annual income” < 50.000€) the attribute characterizing a person as having an annual income lower than 50.000€. For this configuration of A, c, and S, “conditional demographic parity” compares the proportion to which successful applicants satisfying A are female with the proportion to which unsuccessful applicants satisfying A are female. As a formula:

$$\frac{\left|applicants\, with \,c=0, S=f\, and\, A\right|}{\left|applicants \,with \,c=0 \,and \,A\right|} > \frac{\left|applicants \,with\, c=1, S=f \,and \,A\right|}{\left|applicants \,with \,c=1 \,and\, A\right|}$$
(3)

Let us assume that female loan applicants have an income under 50.000€ statistically more often than non-female applicants. Using fictitious numbers, let 90%Footnote 15 of the female applicants satisfy A but only 70% of the non-female applicants. “Conditional demographic parity” can now help us better understand whether this gender pay gap provides an explanation why female applicants are being granted a loan less frequently, or whether there is additional discrimination not resulting from unequally distributed income. Using fictitious numbers again, let us assume that 60% of the applicants who are being granted a loan and satisfy A are female and 62% of the unsuccessful applicants satisfying A are female. Thus, analyzing small income only (where female applicants represent a higher percentage), female and non-female applicants are not treated significantly differently. Complementary, one could examine whether there is unequal treatment in the high-income group. Let B = (“annual income” >  = 50.000€). Following our example, 30% of the non-female applicants fall in category B but only 10% of the female applicants. Let us now apply conditional demographic parity with respect to “high-income.” Using fictitious numbers, let 27% of the applicants who are being granted a loan and satisfy B be female and 25% of the unsuccessful applicants satisfying B be female. Again, the representation of females in the high-income group is fairly equal among successful and unsuccessful applicants. Overall, analyzing the subsets of applicants with small income and high-income separately, female and non-female applicants seem to be treated equally among these groups. Thus, our fictitious numbers indicate that the bias in overall acceptance rates results from female applicants being to a larger extent represented in the small income group than non-female applicants.

Regarding the examination under the European non-discrimination laws, the following remarks can be made with regard to the example above: In principle, trying to avoid the non-repayment of a credit constitutes a legitimate aim of the banking institute [85] (p. 310). As to this, the financial capability of the applicant in question is decisive. In this respect, it is conceivable to consider the applicants’ income — making it a suitable means to foster the legitimate aim. With this in mind one can see that when considering the criterion “income,” the percentage of women among the unsuccessful and the successful applicants is more or less equal within the two construed income-(sub-)groups (yearly salary over and under 50.000€). Considering only the attribute (sex), the result might be that the percentage of unsuccessful female applicants is higher than the percentage of successful female applicants (demographic disparity). Comparing the two results, it is possible to make the assumption that income was crucial for the decision whether a credit is granted or not. If one considers the orientation toward the income as an indicator for the financial capability to be necessary and appropriate, the particular disadvantage might be justified.

While generally achieving “conditional demographic parity” seems unlikely given the variety of choice for A, calculating the “conditional demographic parity” metric for different configurations of A can still provide valuable evidence to assist in detecting the relevant “particular disadvantage” (pertaining to the first step in determining an indirect discrimination). Especially, being more fine-granular than the non-conditional metrics which measure bias only in the model’s overall results, their extensions can be used to explain bias by analyzing additional attributes which might provide further relevant information.

3.4 Ethical minimal requirements

We claim that the following minimal requirements must be standardized to address discrimination and procedural fairness issues concerning the application of ML systems in credit scoring for small personal loans:

  1. (1)

    Regular check of the algorithmic outcome through a fairness metric. We follow Wachter, Mittelstadt, and Russel in suggesting that the conditional demographic parity fairness metric should be used to detect unfair outcome [98]. For our simple case study, the conditionals will be income and employment status. If the bank requires an external credit rating, then other parameters that influence the rating such as past loan defaults or number of credit cards must also be considered. However, in our case, comprehensive information on living costs and total wealth and assets is not required. These can therefore not be considered as conditionals.

  2. (2)

    Ensure the relevance of the chosen indicators. Parameters which are not directly relevant to assess the applicant’s ability to repay the loan shall not be processed. These include attributes, such as postal code, nationality, marital status, gender, disability, age (within the fixed age limits to apply for a loan), and race.Footnote 16 Some of these, such as gender, race, and disability, are characteristics protected by national and international anti-discrimination acts such as the European anti-discrimination law or the German Equal Treatment Act (AGG). Others, even if not protected by anti-discrimination laws, might facilitate the deduction of one or more protected attributes.

  3. (3)

    Provide transparency for credit applicants and other actors involved. The following shall be made transparent for the applicants:

    1. o

      Which data are processed (no personal data is processed without informed consent of the applicant).

    2. p

      Why an application is eventually rejected and what applicant features should be improved to obtain the loan. Therefore, the algorithmic decision must be counterfactually explainable, e.g., if the applicant had a higher income or if she/he was not unemployed, she/he would have received the loan.

    This does not mean disclosing the entire computing process—which might be protected by trade secrets—but guaranteeing transparency regarding the criteria applicants need to fulfill.

4 Standardizing minimal ethical requirements to evaluate fairness

Standardization can connect different perspectives of “fairness” and can establish a universal understanding in the context of AI. It can erase trade barriers and support interoperability as well as foster the trust in a system or application. Within the realm of standardization, existing definitions for fairness are rather generic and currently not tailored for AI systems and applications; however, this may change soon with the development of new AI-dedicated standards.

Several documents are pushing in that direction:

  • ISO/IEC TR 24028:2020, Information technology — Artificial intelligence — Overview of trustworthiness in artificial intelligence as it also lists fairness as an essential part for ensuring trustworthiness in AI [55].

  • ISO/IEC TR 24027:2021, Information technology — Artificial intelligence (AI) — Bias in AI systems and AI aided decision-making addresses bias in relation to AI systems [54].

  • ISO/IEC TR 24368:2022, Information technology — Artificial intelligence — Overview of ethical and societal concerns aims to provide an overview of AI ethical and societal concerns, as well as International Standards that address issues arising from those concerns [56].

In addition, there are many other AI-specific projects published or under development within the ISO and the IEC on the topics of ML, AI system life cycle processes, functional safety, quality evaluation guidelines, explainability, data life cycle frameworks, concepts and terminology, risk management, bias in AI systems, AI aided decision-making, robustness assessment of neural networks, an overview of ethical and societal concerns, process management framework for big data analytics, and other topics. The focus of these projects is to develop a framework of requirements for the development and operation of safe, robust, reliable, explainable, and trustworthy AI systems and applications. Following the establishment of the general AI requirement framework, the focus may likely shift to more use case-specific standardization topics like “fairness,” which is clearly needed in the standardization of AI, but cannot be generalized.

Based on our interdisciplinary analysis, the standardization of “fairness” in the context of AI with the aim to allow an assessment requires multiple relevant measurable and quantifiable parameters and/or attributes building state of the art use case-specific fairness metrics such as the above discussed conditional parity metric. Such fairness metrics can be developed and standardized with an independent consensus driven platform open to expertise from all use case-related stakeholders, including views from the perspectives of philosophy, industry, research, and legislation. This platform can be either a national standards body, ISO, IEC, or the European Standardization Organization (CEN); where the most appropriate option for this topic is the international joint committee between ISO and IEC, the ISO/IEC JTC 1/SC 42 “Artificial intelligence”. To begin a standardization process for a use case-specific fairness metric, scope, outline, and justification of the proposed standardization project must be proposed to the respective standardization committee. To elevate the chances of approval, a first draft with the proposed fairness metric should also be included. The standardization process within national standards bodies, ISO, IEC, and CEN provides all participating members an equal right to vote, comment and work on a standardization project. When working internationally or in the European field, this means that all interested registered experts can work on the project; however, during mandatory voting (project proposal, drafts and finalization) each participating country (represented by delegated experts) has one vote to facilitate a fair consensus process. The outcome of this process is a recognized standard, enabling mutual understanding based on agreed requirements, thus fostering trade and the new development of quality AI products and services—either nationally, in Europe, or internationally depending on the used standardization platform.

A standard can be used for a quality assessment in order to promote a product’s or service’s quality, trustworthiness, and user acceptability. In the assessment process of an AI system or application, the related standardized fairness metric can be used to attest the system’s or application’s ability to execute fair decisions. Consequently, a fairness-related attestation based on corresponding standards (e.g., certification) can increase the user acceptability and trustworthiness of the AI system or application, which can result in increased sales figures.

5 Conclusion

Evaluating the fairness of an AI system requires analyzing an algorithmic outcome and observing the consequences of the development and application of the system on individuals and society. Regarding the applied case of creditworthiness assessment for small personal loans, we highlighted specific distributive and procedural fairness issues inherent either to the computing process or to the system’s use in a real-world scenario: (1) the unjustified unequal distribution of predictive outcome; (2) the perpetuation of existing bias and discrimination practices; (3) the lack of transparency concerning the processed data and of an explanation of the algorithmic outcome for credit applicants. We addressed these issues proposing ethical minimal requirements for this specific application field: (1) regularly checking algorithmic outcome through the conditional demographic parity metric; (2) excluding from the group of processed parameters those that could lead to discriminatory outcome; (3) guaranteeing transparency about the processed data, in addition to counterfactual explainability of algorithmic decisions. Defining these minimal ethical requirements represents a starting point toward standards specifically addressing fairness issues in AI systems for creditworthiness assessments. These requirements aim to prevent unfair algorithmic outcomes, as well as unfair practices related to the use of these systems.