1 Introduction

In many areas of private and social life, efficient and (supposedly) objective decisions are prepared or made by so-called algorithmic decision-making systems (ADM systems) (Fry, 2018; König, 2019; Saurwein et al., 2015). These systems learn their (decision) rules from previous decisions; they are used to systematically replicate previous decision patterns. This means that if the training data contains an unjustified bias against a particular group, e.g. in the form of discrimination against women or elderly people, the system will probably adopt it.

Undesirable bias has been an issue for decades, but the use of automated systems leads to new and bigger problems with them (Noble, 2013; Pasquale, 2015). This is partly because ADM systems can be used on a broad scale, applying the same decision structure to everyone exposed to the system. When a human makes biased decisions, in most cases only a limited number of people is affected, the same human is not able to make the same amount of decisions that a machine can make.

If the bias concerns sensitive attributes protected by law.Footnote 1 and the legislator prohibits unequal treatmentFootnote 2 according to this attribute in the respective context, it is considered as discrimination. From a legal point of view, therefore, bias based on certain characteristics in certain decision contexts must be prevented.

While software engineers already try to detect and avoid undesirable bias before system release, numerous biased systems can still be found (Angwin et al., 2016; Datta et al., 2015; O’Neil, 2016; Orwat, 2019). Therefore, it is also important to enable the affected people or society as a whole to detect and prove bias in such systems based on tests and audits.

However, one of the central problems in this context is that such ADM systems (e.g., in finance or insurance, but also online platforms like Facebook) often are opaque for external entities. Furthermore, even if access to those so-called black-box systems (Diakopoulos, 2015; Rudin, 2019) is available, they only provide limited insights into their functionality and are, therefore, difficult to examine. This limits the selection of appropriate bias testing methods and auditing concepts. At the same time, there are no best-practices, standards or guidelines that help affected people and society choosing specific methods.

Which methods are appropriate or provide sufficient insight can only be determined by looking at the details of a specific application. Therefore, this paper discusses testing methods suitable for assessing bias in black-box systems and for which black-box auditing concepts they are applicable.

After a clarification of relevant terms (Sect. 2), we discuss a collection of applicable testing methods for which we present a taxonomy to help practitioners to make good and, above all, conscious decisions when choosing or implementing black-box tests (Sect. 3). Additionally, we discuss auditing concepts suitable for black-box audits and which specific test methods are applicable for which kind of audit concepts (Sect. 4). Finally, we discuss the key findings and main challenges of this work (Sect. 5).

2 Definitions

A number of related terms have emerged in the context of black-box testing methods and auditing concepts for biases in ADM systems. Since some terms are used in a vague manner or with similar, but different meanings they are defined below to eliminate ambiguities.

2.1 Algorithmic Decision-Making Systems (ADM Systems)

ADM systems are software components that are used to classify or score persons or objects (Saurwein et al., 2015). Classifications and scorings are based on properties or information available about a data subject. There are two types of systems: One, in which the decision rules that combine these data are learned from past decisions (Watt et al., 2020). For this, these past decisions are each operationalized into a set of parameters that describe a subject, the so-called input vectors \(X=X_1,\ldots X_n,\) and the resulting output Y. In the case of classifications the output resembles a categorization, the so-called labels, in the case of scoring the output is a number. Together, this information forms a dataset that is assumed to contain correct decisions (outputs) to be replicated, the so-called ground truth (GT). This ground truth can be used to train a statistical model in which case it is referred to as training data. The process referred to as learning represents the training process. The actual ADM system output is denoted as \({\hat{Y}}.\)

There is a second type of ADM systems, so-called rule-based expert systems (Buchanan & Shortliffe, 1984). In these systems, the decision rules are derived by experts from experience and data. If a person affected by such systems suspects unequal treatment, the experts can justify the decision rules they implemented. This option does not exist for systems that derive their decision rules automatically from data, not even for the developers themselves. However, all following considerations in this paper can in principle be applied to all ADM systems, regardless of how the decision rules have been derived.

2.2 Bias

As ADM systems have a large impact on many people’s lives due to their widespread use, it is important to ensure that they do not carry undesirable effects like bias.

There are many definitions for the term bias (e.g., Delgado-Rodriguez & Llorca, 2004; Ntoutsi et al., 2020; Steineck & AhlbomAhlbom, 1992). Generally, it refers to a “deviation from a standard” (Danks & London, 2017). In the context of ADM systems, the standard to be compared to is usually purely statistical, therefore, the term bias in this paper refers to a statistical bias to indicate a deviation from a statistical standard. According to this definition, bias may also be an actual, but skewed distribution. Especially in the context of ADM systems, bias is often understood more specifically as a deviation that does not represent the actual or wanted distribution and is therefore considered unfair with respect to a specific person or group (Angwin et al., 2016; Datta et al., 2015; O’Neil, 2016).

When designing a machine learning component for algorithmic decision-making, its goal is per definition to find a statistical correlation between the elements of the input vector and the label.Footnote 3 When using an ADM system, for example, to decide which job applicant gets an invitation for a job interview, feedback on the applicants career (e.g., school grades, employer references,...) is used to estimate their future performance. Decisions based on this kind of information are biased with respect to some sort of qualification attestation and are deliberate, socially intended and accepted.

Next to this justified and meaningful usage of input data, there are also a number of features that a society explicitly wishes to be disregarded in certain contexts, so-called protected or sensitive attributes,Footnote 4 which is why a bias with respect to those features is considered a discrimination. Which features are considered as protected highly depends on the applicable laws in the context of a decision (see Table 1 for some relevant examples of German laws and European directives). To differentiate between protected and unprotected parameters of an input vector X, \(X_1 \ldots X_n\) are considered to denote the unprotected attributes and A is the protected attribute, which means \(X=X_1 \ldots X_n,A.\) There can be multiple protected attributes \(A=A_1 \ldots A_n.\) However, for the sake of simplicity, we limit our considerations to a single protected attribute that can only take on one of two values \(\alpha \) and \(\beta.\)

Table 1 Protected attributes based on various German laws and European directives, according to Zweig et al. (2021, p. 240) who extend (Orwat, 2019, p. 25)

Regulations on which characteristics are protected by law have evolved over time according to experiences of what forms of unequal treatment a society cannot or should not tolerate. Due to the rise of ADM systems, some applications that led to societal problems or even scandals in the past already suggest that this list might grow in the future. For example, a French ADM system that manages student applications considered the current residence of a student (Wenzelburger & Hartmann, 2021). In and of itself, the ZIP-code is not a protected feature. However, since only rather wealthy people can afford to live in the direct residential area of many renowned universities, this feature is correlated with financial status rather than intelligence or ability to study. In his column at Le Monde, Thomas Piketty claims that the way these features are used in the system should be made transparent and points out that proper use could, e.g., improve social diversity and equality of opportunity.Footnote 5 This discussion indicates shifts in the public understanding of automated decision-making which might lead to further or adapted legislation in the near future.

However, even if some features are already protected by law, this does not mean that they can just be dismissed. As shown by Haeri and Zweig (2020), learned decision rules can be made more robust against bias if the biased information is at least available in the training data. In any case, the information regarding protected features is necessary to identify unwanted bias, i.e., discrimination, in a system (Hoffmann et al., 2022).

For the extent of this paper, we focus on legally protected variables and bias that results in any form of unwanted or illegal discrimination.

The next section defines our understanding of the terms black-box testing and auditing to analyze the behavior of algorithmic black boxes.

2.3 Black-Box Testing and Auditing

In contrast to so-called white-box ADM systems, where insights into the inner mechanics and decision-making processes of the ADM are available, there are systems where these insights are missing, called black-box systems. Four different kinds of black-box systems can be distinguished with regard to the examinability of opaque systems, depending on the knowledge about the input and output of the system:

  1. 1.

    Black-box systems for which only the output can be observed (Diakopoulos, 2015).

  2. 2.

    Black-box systems for which only some inputs can be observed and manipulated, e.g. during testing, but some inputs are unknown and/or cannot be manipulated.

  3. 3.

    Black-box systems for which all the inputs and resulting outputs can be observed (Diakopoulos, 2015).

  4. 4.

    Systems that are transparent in principle, but too complex to be humanly comprehensible (Rudin, 2019, Appendix A).

Testing the first category provides only limited insights, since even in case of an extreme output, an extreme (but unknown) input may be responsible for unexpected or undesired outputs, instead of errors or bias in the system. Consequently, the results of tests for the second category of systems are also only of limited use. A correlation between inputs and outputs may be determined, but there is always a chance that unknown inputs, like previous user behaviour, could influence the system behavior in a way that cannot be fully determined with tests. Black-box systems of the third category can be analyzed based on the relation of inputs and corresponding outputs. The following considerations focus on this type of systems.

All testing methods and auditing concepts discussed in this work are also suitable for the fourth type of black-box systems. However, since such systems also allow insights that are not available in actual black-box systems, additional testing methods and auditing concepts can be applied [like testing for neuron value and coverage criteria (Sun et al., 2018)]. In the remainder of this paper we will not focus on these methods.

When analyzing white-box systems, it can be determined what exactly should/must be tested based on the information available about the decision structure (Nidhra & Dondeti, 2012, p. 30). As the decision structure of black-box systems is opaque, test definition can only be based on precisely formulated requirements.

As the requirement make no discriminatory decisions is rather vague and neither legislation nor standardization provide thorough guidance, a great variety of distinct testing methods and auditing concepts suitable for detecting bias in black-box systems have been developed.

The IEEE glossary defines software testing as “an analytical quality assurance activity in which systems, subsystems, or components are executed under specified conditions, the results are observed or recorded, and an evaluation is made of some aspect of the system or component” (IEEE, 1990). So testing is applied to ensure that the system to be tested complies to previously specified requirements. A data set used for performing testing activities is called test data set.

In classical software engineering, the term black-box testing has already been coined for testing activities that are intended to treat the system to be tested as a black box, regardless of whether it actually is one. A detailed discussion of this kind of black-box testing can be found under the term “functional testing” (Nidhra & Dondeti, 2012). Especially in the context of testing it does not matter whether an ADM system had been constructed as white-box or black-box system, because the testing person is not able to perceive this information. The system is treated as a black box in any case. Therefore, in the context of this work, we focus on test methods that are suitable for testing actual black-box systems in Sect. 3.

While testing is an important part of software development processes, it can also be used as evidence base for an assessment in the context of an audit.

The social science background of the term audit and the development of the term up to the present day was well summarized by Gaddis (2018). An audit’s goal is to ensure the integrity and reliability of financial and other information, as well as to increase public trust in a company or organization. An audit, according to (ISO 19011, 2018), is defined as follows: An audit is a “ systematic, independent and documented process for obtaining objective evidence and evaluating it objectively to determine the extent to which the audit criteria are fulfilled” (ISO 19011, 2018, p. 11).

In corporate environments, the concept of an audit is typically perceived as a comprehensive evaluative procedure. The scope of such audits extends beyond merely business-related facets to encompass technical dimensions, contingent upon the specific objectives of the audit. This paradigm is mirrored in the ongoing discourse surrounding algorithm/AI audits (Lucaj et al., 2023; Metaxa et al., 2021), advocating for a thorough scrutiny of both the AI system and its integration within the broader framework [e.g. “Task Environment” (see Boer et al., 2023)]. Consequently, this has precipitated calls for comprehensive ’AI audits’ that span the entire organizational spectrum. The subsequent phase in dissecting the technical component involves pinpointing the precise elements that are subject to examination or verification, commonly referred to as the audit criterion in scholarly literature. This criterion serves as the benchmark against which the AI system’s performance, ethics, and compliance are evaluated. Hallensleben et al. (2020) have presented a framework for operationalizing these properties and Brown et al. (2021) has gone a step further at this point and has drawn up an initial list of metrics for making complex properties measurable. This manuscript predominantly concentrates on the technology-oriented aspect of these evaluations with regards to bias, offering an in-depth exploration of the technical elements inherent in AI audits.

In the field of computer science, there are multiple definitions of the term audit. In the context of this paper, we relate the term algorithm audit to the explanations of Sandvig et al. (2014). They understand algorithm (software) audits as some sort of field experiments tailored to a specific platform, containing multiple tests and, most likely, an additional software apparatus for experimentation. They categorize audits by the process of how the inputs are provided. Therefore, Sect. 4 describes the different forms of black-box audits and explains which test methods are applicable for which kind of audit.

To visualize the relationship between tests and audits we use and extend the structural model of a black-box analysis by Krafft et al. (2020). Here, a black-box analysis is modelled as a five-step process. It starts with the type of access to the black box, the chosen audit form (1), which results in a collection of data (2) that needs to be cleaned in a preprocessing step (3). Then, the actual analysis can be performed on the now structured and verified data sets (4). This step relates to test methods. The last step is the processing and communication of the test results (5).

Fig. 1
figure 1

Figure by Algorithm Accountability Lab [Prof. Dr. K. A. Zweig]/CC BY

Conceptualized process of analyzing a black-box system [extended (Krafft et al., 2020)]. The numbers represent the different steps.

In order to be able to test a system, it must be clear which output for an input can be interpreted as correct or incorrect. This is called an oracle, as described in the following section.

2.4 The Oracle Problem

With the help of a so-called test oracle, a test result can be checked to determine whether the system has delivered the correct output (Howden, 1978). A ground truth, for example, is a frequently used special form of a test oracle, as the correct output for a limited set of inputs is known. While there are probabilistic test oracles for which a certain error rate is tolerated, a ground truth usually assumes absolutely correct values (Barr et al., 2014). However, especially in complex black-box systems the information about what can be considered as “correct” output for most of the possible inputs is often missing, due to, for example, the huge input space (Marijan et al., 2019). This is the so-called oracle problem.

Since there is often no clearly definable right or wrong when it comes to decisions about people or human behavior, there can hardly be an oracle. However, for many ADM systems that make such decisions, not only the prediction quality (in terms of right or wrong) but also whether people with different protected attributes are treated equally, is a major concern.

To counter these problems, there are also test methods that do not require an oracle at all. They test for a certain ratio or distribution in the outputs, regardless of whether the respective outputs are correct, or not. Depending on the definition, this desired distribution can also be considered as an oracle, the literature is not clear regarding this classification (Barr et al., 2014). To prevent confusion, we explicitly do not consider such determinations, whether test result distributions are accepted as correct or not, as oracle.

In order to test for a specified distribution, it is necessary that information about these protected attributes is collected, otherwise it is impossible to check on their influence on a decision (Žliobaitė & Custers, 2016). If the respective information is not available, it seems reasonable to assume that an algorithm cannot take them into account. At first glance, this is correct, but bias on the basis of these protected variables can hide behind other, seemingly acceptable parameters, so-called proxy variables, which is why corresponding tests are absolutely necessary (Žliobaitė & Custers, 2016).

With these terms clarified, the next section gives an overview on the different ways to test a black box for unwanted bias.

3 Testing a Black Box

We identified three major categories for testing a given black-box system for unwanted bias: data set tests, explainability methods to detect (and assess) bias, and test methods based on system outputs.

Since bias in the training data is the major cause of bias in a system, some test the training data for biases directly (e.g. Hynes et al., 2017; Kim et al., 2019; Krishnan et al., 2016; Polyzotis et al., 2017). However, an external testing entity will rarely get the chance to inspect the training data used. Therefore, we decided to exclude this topic from our discussions.

In the context of black-box systems it is often difficult to understand how exactly an output is generated, i.e., which factors have a particularly large influence on the result in individual cases or in general. Methods for making models “explainable” (or interpretable) try to provide insights, e.g., by providing white-box models that mimic the behaviour of the black-box model (up to a certain degree) or calculate the influence of a specific parameter on a decision. In general, those insights make it easier to assess whether a system is biased (with regard to a specific parameter). Therefore, they are often considered as tests, but have a principally different goal. The discussion about explainability methods is completely different compared to test methods and represents a separate, frequently discussed field of research. Since we do not want to focus on explainability methods in this paper, we refer to the extensive works of Burkart and Huber (2021) and Kraus et al. (2021), where comprehensive lists of explainability methods are discussed.

We divided the test methods based on system outputs in test methods that require an oracle (Sect. 3.2) and test methods that do not require an oracle (Sect. 3.3). As for each test method multiple methods to select appropriate test cases can be considered, we additionally discuss methods for selecting appropriate test cases (Sect. 3.4).

For a better comparison of the test methods that do or do not require an oracle, we additionally have developed a taxonomy according to which we discuss the respective methods.

3.1 Taxonomy

Since the number and variety of bias test methods for black-box systems is too large to implement them all, each entity performing quality assurance procedures must make an appropriate selection of test and/or audit procedures. To support the process of test selection we provide a thorough taxonomy of properties, including information regarding:

  • The necessity to directly interact with the black-box system (no, yes). In this item, we discuss whether the test method principally allows the involvement of a mediator or not. In general, there needs to be some way to provide input data to the system. Optimally, the external testing entity can provide the data and inspect the resulting outputs on its own. This would provide the highest credibility and flexibility for testing purposes. For those test methods in which the input queries can be fully defined before the test starts, it does not matter who provides the inputs; it could also be an internal mediator that performs the queries on behalf of an external tester. Thereby, the operator can avoid the effort of creating a secure interface for external parties and may also see a better protection of business secrets. On the other hand, it must be noted that the testing entity can hardly check whether the tests have been carried out conscientiously and whether the correct results are provided for verification. Test processes thus become more credible if both, the construction of a test data set and the execution of the tests based on it, are carried out by an external, independent entity (e.g. via API). This requires the testing entity to have enough access to the model to submit input data and inspect the respective outcomes. Additionally, some methods are not suitable to be mediated: Especially those which require to modify queries depending on previous results.

  • The type of evaluation (qualitative, quantitative): When a system under test is to be evaluated in terms of bias, there is a difference between an evaluation of its general, statistical behaviour and its behaviour in individual cases (Binns, 2020). The type of evaluation used gives an indication of whether the discrepancies detected by this method are qualitative or quantitative in nature.

  • The effort (medium, high): The effort to implement a specific test method depends on its details, as for each method there are multiple variations that may strongly differ in necessary effort. Prior to actual implementation, in many cases, various design decisions must be made, posing additional effort. We assume that the required possibility to feed inputs and the required test data are provided. Here, we only discuss additional efforts and specific challenges not already discussed in the previous aspects, like efforts to pass test data to a mediator or to develop secure APIs to provide a tester direct access to a model.

  • The automation capability of the test (medium, fully): As nowadays software is never finished but prone to changes and updates, and software systems are too complex to be tested by hand, test automation is indispensable. Still, there are test methods that require manual intervention which may limit the automation capability.

  • The need to check for thresholds (yes, no): Depending on the test method, the test result is a number indicating an abstract level of bias. To decide whether this level is acceptable or not, it needs to be compared with a specified minimum target value. If the test result is below that target value, the test is considered failed. For most cases, neither legislation nor standardization provide explicit thresholds to test for or methods to define such threshold, therefore, their specification may pose an important design decision.

Wherever possible, we provide a rough estimation in terms of discrete values. This operationalization is not suitable for direct comparison with each other, but is intended to support the consideration what may or may not be suitable for a specific situation. The resulting taxonomy for each test method is summed up in Table 2.

Table 2 Classification of the test methods with respect to the presented taxonomy

3.2 Test Methods that Require an Oracle

Test methods that rely on an oracle are more intuitive for bias tests, because they rely on existing, labeled test data and therefore can actively test against a true statement. To be able to make a statistically significant statement at all, the test data set needs to be sufficiently large to allow for statistically significant results. How much test data is needed for a useful conclusion, which data is used for testing and how it should be distributed is a challenge of its own (Taskesen et al., 2021). The statistical evaluation of such tests can be converted into measurements of fairness which are discussed in this section.

3.2.1 Fairness Test Based on an Oracle

Fairness measures can be used to test for various concepts of fairness. The aim is to treat subgroups with certain protected properties the same way as any other subgroup. They can roughly be divided into measures that are based on an oracle and those that do not require an oracle (see Sect. 3.3.1).

Measures based on an oracle have a comparative character. They compute the relation between outputs of the system and the facts represented by the oracle and, therefore, make use of so-called quality measures (Haeri & Zweig, 2020). With quality measures the reliability of predictions or the correctness of decisions (based on an oracle) can be tested. This provides an anchor on which bias in the system can be assessed, but it also means that bias of the system that corresponds with the oracle might be left uncovered.

Consider, for example, the fairness measure of separation (Barocas et al., 2019) to decide whether a binary classification system for which \(Y= 0\) represents the negative (undesired) class and \(Y=1\) represents the positive (desired) class, is fair. Separation requires two conditions to be fulfilled:

  • The probability that an instance is assigned to the positive class under the condition that the instance actually belongs to the positive class (according to the oracle) is the same for both groups (equal true-positive rates for the groups).

    $$\begin{aligned} Pr\{ {\hat{Y}}=1 \vert Y=1, A=\alpha \} = Pr\{ {\hat{Y}}=1 \vert Y=1, A=\beta \} \end{aligned}$$
  • The probability that an instance is assigned to the positive class under the condition that the instance actually belongs to the negative class (according to the oracle) is be the same for both groups (equal false-positive rates for the groups)

    $$\begin{aligned} Pr\{ {\hat{Y}}=1 \vert Y=0, A=\alpha \} = Pr\{ {\hat{Y}}=1 \vert Y=0, A=\beta \} \end{aligned}$$

As in practice it is almost impossible to completely fulfill fairness measures, a threshold \(\tau \) is needed that states what target value should be reached by the measure in order for the test to pass on a given test dataset (see Sect. 3.4). To consider a threshold the equations could be converted to compute the difference between the respective probabilities, which means for the two conditions:

$$\begin{aligned}{} & {} 1- {\Big \vert } {Pr\{ {\hat{Y}}=1 \vert Y=1, A=\alpha \}} - {Pr\{ {\hat{Y}}=1 \vert Y=1, A=\beta \}}{\Big \vert } \ge \tau \\{} & {} 1- {\Big \vert } Pr\{ {\hat{Y}}=1 \vert Y=0, A=\alpha \} - Pr\{ {\hat{Y}}=1 \vert Y=0, A=\beta \}{\Big \vert } \ge \tau \end{aligned}$$

Requires possibility to query inputs: No.

The tester only needs the system outputs for the selected inputs. A predefined set of test cases can be handed over to a mediator to perform the tests.


Evaluation: Quantitative.

The evaluation is based on the distribution of results for a large amount of test data.


Effort: Medium.

While computing fairness measures involves only little effort, an appropriate selection of measures and test data needs to be specified in advance, which might result in considerable effort. There is an ongoing discussion about which measures are deemed appropriate under which conditions (Hauer et al., 2021).


Automation capability: Fully.

Once measures, test data, and respective thresholds are available, calculation of the chosen measures can be fully automated.


Threshold: Yes.

As already explained, a threshold is needed that states what target value should be reached by the measure in order for the test to pass. Depending on the chosen fairness measure(s), the number of protected attributes and the number of groups per protected attribute, there might be multiple thresholds to define.

3.2.2 Testing Bias by Computing Counterfactual Fairness

A counterfactual is a duplicated data point that differs from its original in only one parameter (Pearl, 2009, p. 120). Thus, counterfactuals can be used for various bias tests, by comparing the model output for an original test input and its counterfactual. If they result in the same output, counterfactual fairness is fulfilled for this input (Wu et al., 2019).

The direct use of counterfactuals as test data has a weakness, however, because it ignores the fact that changes in one parameter in real data are usually accompanied by changes in other parameters as well due to causal dependencies. Having a data point that lists gender and shoe size, a counterfactual would be to simply change the gender of the data point and ignore that there is a causal dependency between shoe size and gender.

Therefore, a different kind of counterfactual fairness according to Kusner et al. goes one step further and tries to consider such dependencies (Kusner et al., 2017). In their example, not only the gender but also the shoe size would be adjusted to represent a realistic data point. To transform this consideration into a procedure, a so-called causal graph is created in advance. A causal graph tries to model all causal relationships of parameters to each other (Wu et al., 2019). Which parameters are causally related to each other can only be extracted from the data to a limited extent, which is why expert or domain knowledge is particularly relevant here, but the result also depends on subjective decisions. Once the causal graph has been created, the extent of the mutual dependencies is determined in form of coefficients, based on the available data. There are various procedures for this, as, for example, described by Di Stefano et al. (2020). After the causal graph and the influencing factors have been determined, new counterfactuals can be constructed by creating a duplicate for a data point for which one parameter is specifically changed (e.g. gender) and all causally dependent parameters are adjusted according to the causal graph and the coefficients (Makhlouf et al., 2020)Footnote 6:

$$\begin{aligned} \Pr ({\hat{Y}}_{A\leftarrow \alpha }=y\vert X=x, A=\beta )=\Pr ({\hat{Y}}_{A\leftarrow \beta }=y\vert X=x, A=\beta ) \end{aligned}$$

The bias test consists of checking for every data point in the test dataset and their counterfactuals, whether they are counterfactually fair. It might be unlikely that an ADM system can completely satisfy counterfactual fairness, therefore, a threshold needs to be specified that states, for example, to what percentage of the test dataset needs to fulfill counterfactual fairness for the test to pass (Kusner et al., 2017).


Requires possibility to query inputs: No.

The tester only needs the system outputs for the selected inputs. A predefined set of test cases can be handed over to a mediator to perform the tests, though this has to be done (at least) two times: Once to get the information necessary to compute the coefficients of the causal graph and once for computing counterfactual fairness based on the counterfactual examples constructed with the causal graph.


Evaluation: Quantitative.

The evaluation is based on the distribution of results for a large amount of test data and is therefore statistical in nature.


Effort: High.

While computing this specific fairness measure involves only little effort, an appropriate selection of test data and the construction of a suitable causal graph based on domain knowledge most likely results in considerable effort. Finding a suitable causal graph might be an iterative process (Wu et al., 2019). Additionally, there is a lot of discussion about which fairness measures are deemed appropriate under which conditions (Hauer et al., 2021).


Automation capability: Fully.

Once the causal graph, test data and a threshold are available, calculation of the measure can be fully automated.


Threshold: Yes.

Since in practice it is almost impossible to completely fulfill counterfactual fairness, a threshold is needed that states what target value should be reached by the measure in order for the test to pass.

3.3 Test Methods that Do Not Require an Oracle

Since an oracle is not always at hand, test methods that do not require an oracle have been developed as well. They address the oracle problem by either assuming a similar system as oracle or by only searching for deviations based on protected attributes alone, independent of what could be considered as correct. For such methods, test data might be useful, but is not necessary. In many cases unlabeled test data can be generated for testing purposes and does not need to be available before. The selection or generation of appropriate test cases will be discussed separately in Sect. 3.4.

3.3.1 Fairness Test Not Based on an Oracle

There are fairness measures that are only based on an expected or socially desirable distribution of outcomes. The fairness measure of independence, for example, is fulfilled, if for a given test data set the same percentage of both groups results in the positive class (Barocas et al., 2019):

$$\begin{aligned} Pr\{{\hat{Y}}=1\vert A=\alpha \}=Pr\{{\hat{Y}}=1\vert A=\beta \} \end{aligned}$$

Testing with such measures is analogous to testing with fairness measures not based on an oracle. The classification based on our taxonomy is also analogous.

3.3.2 Differential Testing and Back-to-back Testing

In differential testing, the system under test S is compared to another system A that has the same or very similar functionality (Evans & Savoia, 2007), for example, two systems that predict a credit score. A could also be an older version \(S'\) of system S; in this specific case, the method is called back-to-back testing (Vouk, 1988). System A could also be that of a competitor. Or, if system S makes predictions about the future behavior of individuals—for which there is not yet a ground truth—it might even be compared to the judgement of human experts. In any of these cases, the answers by system A (or the human experts) are assumed to be an oracle.

The test method consists of taking a large set of inputs to both systems and compare the respective outputs (McKeeman, 1998). Depending on the specific implementation, different variations of the procedure are possible (see Fig. 2). There could be a purely statistical evaluation of in how many cases A and S contradict each other. If there are too many discrepancies for a given type of input, the test fails (Petsios et al., 2017).

However, discrepancies can also come from the fact that the systems do not have exactly the same, but only very similar functionality. Thus, the test method can only be used to detect discrepancies in behavior, which then have to be investigated by hand (Groce et al., 2007). This approach is particularly interesting when multiple (potentially also black-box) systems are used as A for comparison (Pei et al., 2017). The advantage of such an approach is that the effort to manually label test data can be limited to those cases where the results contradict each other. Furthermore, these edge cases can then be added to the training data to improve the ADM system S. The disadvantage lies in the danger that all compared systems could have been created by developers who tend to make the same mistakes or have the same bias (Knight & Leveson, 1986). This may produce faulty outputs which cannot be detected based on this kind of test.

Fig. 2
figure 2

Figure by Algorithm Accountability Lab [Prof. Dr. K. A. Zweig]/CC BY

There are at least two largely different ways to compare the results between the output of the tested system and the output of the alternative system.


Requires possibility to query inputs: No.

The results of the tests on S are compared with the tests already performed on A and \(S'\) respectively. The predefined set of test cases can be handed over to a mediator to perform the tests.


Evaluation: Qualitative or quantitative.

If the alternative system is considered as oracle, the evaluation can be based on the distribution of results for a large amount of test data (quantitative evaluation). If the alternative system is only assumed to be an approximate basis for comparison, discrepancies have to be investigated by hand which leads to an individual evaluation of results (qualitative evaluation).


Effort: High or low.

The effort required for differential testing depends on whether an alternative system is already available or not. If there is a system with very similar functionality or if the system is to be compared with an older version (back-to-back testing), the effort is relatively low. However, if there is no suitable alternative system, the only option is to develop an alternative system as a basis for comparison [which is usually called a pseudo oracle (Davis & Weyuker, 1981)]. This procedure only makes sense under certain circumstances. The alternative system must be confirmed to have the property to be tested, for example, by being more testable, otherwise there is no added benefit. If it is possible to develop a more testable system of the same quality, there is no reason to use the actual system at all. A better testable alternative system will usually have a poorer prediction quality in practice, which makes it of only limited use as an oracle. In the best case, therefore, it is a procedure that supports the search for bias and errors at the expense of a high level of effort.


Automation capability: Medium or fully.

The test method can be fully automated if the alternative system is assumed to be a perfect oracle as any deviations are directly evaluated as errors of the system to be checked. If this assumption is not given, deviations must be examined more closely and investigated manually.


Threshold: Yes and no.

In general, no thresholds are required for the test method. The only exception is the variant in which the alternative system is assumed to be a perfect oracle and the test is considered failed if there are a certain number of deviations.

3.3.3 A/B Testing

For A/B testing, two or more systems (or variants of the same system) are compared by performing the same test activities on them (Kohavi & Longbotham, 2017; Young, 2014). This procedure is also known as bucket testing, split testing, or controlled experiment (Xu et al., 2015). A/B testing does not assess “what would happen if a data point had a different value”, but “what would happen if the system had a different decision structure”, which is why it is also referred to as counterfactual reasoning (Gilotte et al., 2018).

A/B testing is frequently used to improve different variations of configurations of online platforms in terms of user experience (Cruz-Benito et al., 2017; Siroker & Koomen, 2013). It is a meta testing framework based on any other test methods to evaluate a property. Breck et al. also discuss the concept of testing a complex model against a simple one to assess the need for a complex model (Breck et al., 2017). Thus, A/B testing is also useful for testing whether a white-box model would be sufficient. The abstract character of this framework is shown by the fact that different test methods can be applied in the process of an A/B test (Kohavi & Longbotham, 2017). While the previously discussed taxonomy depends on the respective test method, each of them can be used for A/B testing.

Differential testing can be understood as a special form of A/B testing in which one system is compared to a predefined reference system.

3.3.4 Adversarial Testing

Adversarial testing is an approach based on the success rate of finding so-called adversarial examples. An adversarial example is a slightly modified example input that was created specifically to result in incorrect model output (Goodfellow et al., 2014). When this method is used with maleficent intention, it is also called adversarial attack. Artificial neural networks in particular are considered to be especially susceptible to this type of attack. There are different methods for finding adversarial examples depending on how much information a tester (attacker) has. In the case of black-box systems, the tester only has information about the system input and system output. To the best of our knowledge this leaves only the method of Decision-based attacks for finding adversarial examples (Brendel et al., 2017). In this case, the tester queries the system to be tested with synthetic inputs selected by an heuristic process to train a substitute model that mimics the decision boundaries. To find such decision boundaries the system needs to be probed with input data that is iteratively adjusted depending on the respective system output. As the internal mechanisms of this substitute model can be observed, various algorithms can be applied to find adversarial examples that also work on the original model (Papernot et al., 2017). According to Brendel et al. this method works well for systems with low intra-class variability, for other systems there is a lack of experience though (Brendel et al., 2017).

While with classical adversarial testing the goal is to identify inputs that lead to wrong outputs, this method can also be used in the context of bias detection by identifying input vectors for which a change of a protected parameter results in a change of the system output. Based on the assumption that the protected property should not change the result, such a change is an indicator of a mistake independent of which of the two decisions was correct.

Based on Rice’s theorem (Rice, 1953), Raghunathan et al. (2018)Footnote 7 argue that it is algorithmically impossible to verify whether an ADM system is completely robust, therefore there can be no absolute protection against wrong individual cases. A measure is required which states under which circumstances robustness in the sense of adversarial tests (and thus, against different results in consequence of a change of a protected parameter) is regarded as sufficient; this is called an exit criterion (Felderer et al., 2019). It may be the duration until an erroneous output has been found or the number of necessary attempts to find an adversarial example.


Requires possibility to query inputs: Yes.

Since the inputs to be made depend heavily on the respective outputs previously received, it is impossible to predict following inputs. In order to be able to carry out the test method efficiently [potentially hundreds or thousands of iterations are necessary (Papernot et al., 2017, p. 515)], the test code must be able to make the requests itself, i.e. it needs the possibility to query inputs itself.


Evaluation: Qualitative.

To evaluate whether the robustness against adversarial attacks is considered adequate, various measures can be used, e.g., the time passed or the number of iterations until an adversarial example has been found. In either case the evaluation is based on individual results and, therefore, of qualitative nature.


Effort: High.

Depending on the method used, the effort required to implement and improve the test is high to very high (Qiu et al., 2019).


Automation capability: Fully.

Manual execution is hardly possible which is why automation is necessary. Even though adversarial tests can be fully automated for a specific problem, the respective adaptation to such a problem, i.e. the test preparation, must be done manually. This includes selection and adaptation of the method (Papernot et al., 2017, p. 512).


Threshold: Yes.

As the evaluation of adversarial testing is a matter of implementation, the selection of thresholds my vary greatly. In any case, though, a threshold is needed that defines the boundary between successful and unsuccessful.

3.3.5 Metamorphic Testing

Another approach to alleviate the test oracle problem is metamorphic testing (Chen et al., 1998; Segura et al., 2016). The basic concept is to develop assumptions about the input–output relationship of the system, so-called metamorphic relations, and then to adapt test cases in order to test these assumptions. The metamorphic relations describe how changes in (individual) inputs should affect the output of the investigated system as formally as possible (Chen et al., 2003).

For example, in a system for selecting applicants for a job, if two applicants A and B differ only in gender and years of experience, the person with more years of experience is expected to get a higher score. The metamorphic relation in this case states: If A and B are equal but A has more years of experience, it should get equal or higher outcome, independent of gender. Ma et al. use the method to inspect the influence of changing gender specific words (e.g. actor vs. actress) in a text corpus by comparing the respective system outputs of various NLP services (Ma et al., 2020).

In both examples, the test method can be considered as variation of counterfactual testing, but more sophisticated metamorphic relations can be defined as well, which would not fall under the term counterfactual testing, e.g.: If A has more years of experience than B but is younger, A should generally have higher outcome, independent of any other input information, including gender. In this case, the test does not need to be performed on duplicated and modified test data which do not resemble any real data. It is also possible to define multiple (non-contradictory) metamorphic relations that all have to hold. This way, the design flexibility of metamorphic relations allows to investigate more complex questions around bias.


Requires access: No.

The tester only needs the system outputs for the selected inputs. A predefined set of test cases can be handed over to a mediator to perform the tests.


Evaluation: Quantitative.

The evaluation is based on the number of results that do or do not fulfill the metamorphic relation for a large amount of test data.


Effort: Medium.

While performing metamorphic testing approaches involves only little effort, appropriate metamorphic relations need to be specified in advance. This task is usually performed by a domain expert or experienced programmer and may result in considerable effort (Kanewala & Bieman, 2013; Segura et al., 2016).


Automation capability: Fully.

Once a metamorphic relation and a function to compare the results have been specified, the test method can be completely automated.


Threshold: Yes.

As in practice it is almost impossible to completely satisfy metamorphic relations addressing bias, a threshold is needed that states how much results may differ to be considered as fulfilling the metamorphic relation. Additionally, a second threshold might be necessary that states the maximum number of test instances for which the metamorphic relations may not be fulfilled for the test to pass.

As all of the explained test methods require data to perform the test on, the following section introduces methods for test case selection.

3.4 Selection of Test Cases

There are two general kinds of test case selection. For statistical evaluations (which is the case for all fairness measures) the goal is to make a selection of test data with respect to a certain criterion or distribution. For example, the selected data shall reflect real world distributions as well as possible, or different subgroups shall be equally represented, regardless of the real world distribution (Moser, 1952).

When a test method aims to find single erroneous results (which is the case for all tests that make qualitative evaluations, but also for metamorphic testing), however, the goal is either to achieve the best possible test coverage, i.e., testing for all relevant eventualities, or to systematically search for inputs which result in unwanted outputs, like adversarial testing does. For real world applications, it is usually not possible to test the entire input space, regardless of which test is to be performed. This is because the number of possible input combinations is simply too large to consider every case (Gotlieb & Marijan, 2014). Therefore, there are various methods to reduce the number of test cases by limiting them to certain input parameter combinations, which are discussed under the broader term of combinatorial testing (Kuhn et al., 2013; Nie & Leung, 2011). The name is a bit misleading - in our categorization it is a data selection method and not a stand-alone test method.

3.4.1 Exploratory Testing

Exploratory testing is a fancy name for generating inputs based on human intuition. Itkonen et al. review various nuances of the concept (Itkonen & Rautiainen, 2005, Chap. 3).

3.4.2 Pairwise Testing and Orthogonal Array Testing

Pairwise testing relies on combining any two factors at least once. If there are three input parameters that each can have 3 discrete values, 27 unique value combinations are possible. Pairwise, however, each combination will occur three times this way (e.g. 1A\(\alpha,\) 1A\(\beta,\) 1A\(\gamma,\ldots,\)). By selecting inputs in a way that any pairwise combination of parameter occurs at least once (exactly once if possible), the number of value combinations to test for is reduced to 9 (see Fig. 3).

Fig. 3
figure 3

Figure by Algorithm Accountability Lab [Prof. Dr. K. A. Zweig]/CC BY

Considering an input size of 3 parameters \(x_1, x_2, x_3\) for which each can take on 3 discrete values \(x_1=1,2,3, x_2=A,B,C\) and \(x_3=\alpha ,\beta ,\gamma,\) the complete input space consists of 27 combinations. With pairwise testing this space is reduced to 9 combinations. The complete input space contains each input pair three times, while with pairwise testing the same pair occurs only once.

In case of continuous values, there needs to be some sort of discretization that divides the value space into ranges. The number of tests to be run can thus be drastically reduced while still providing a well-distributed coverage of the underlying decision rules (Cohen et al., 1996). Orthogonal array testing presents a special form of pairwise testing, that allows the combination of any n factors (instead of only 2) and demands that each possible value combination of parameters is tested equally often (Hedayat et al., 2012, p. 2). Thereby, the selection of n is a test design decision.

3.4.3 Fuzz Testing

Classic fuzz testing is used to find test cases that result in system crashes. For this purpose a collection of initially random input vectors is generated and passed to the system for execution. In each iteration the queried input vector is mutated, which means that the initial random input is changed within a certain range (for example on a limited number of parameters). If the system output encounters an interesting behavior, the corresponding input is stored. In the context of debugging, an interesting behavior could be, for example, an error message that does not lead to a system crash, or longer delays in the computation. The generation of further input vectors focuses increasingly on the mutation of inputs that previously have triggered interesting behavior (Klees et al., 2018).

This method of systematic test case generation can be used for bias tests on a scoring system. First, a random input vector is generated and duplicated with different values of the protected attribute (\(X=X_1,\ldots X_n,A\) and \(X^{\prime} = X_1,\ldots X_n,A^{\prime}\)), for which the system computes the outputs (Y and \(Y^{\prime}\)). The procedure is repeated with mutated inputs (\(X_\epsilon =X_1+\epsilon _1,\ldots X_n+\epsilon _n\)) until the difference between outputs for inputs that only differ in the protected attribute exceeds a certain threshold (\(\vert Y_\epsilon - Y^{\prime}_\epsilon \vert > \tau \)), even if this deviation would not yet lead to a different categorization based on the score.

These input vectors can now be used as test cases or as inputs that result in interesting behaviour. Based on these inputs, additional, randomly mutated input vectors that further increase the deviation of the system outputs can be constructed.

Such methods of test case generation are discussed especially in the context of differential testing and metamorphic testing (Petsios et al., 2017; Zhu, 2015).

4 Auditing a Black Box

According to Sandvig et al. there are five general concepts for algorithm audits (Sandvig et al., 2014), from which four are applicable for black-box systems.Footnote 8 In this section, we discuss these four audit forms and which of the test methods presented are suitable for analyzing the resulting data.

Table 3 Compatibility of test methods and auditing concepts for black-box analysis. With the term “fairness measures” we refer to both types of fairness measures: those based on a ground truth and those not based on a ground truth (see Table 2)

4.1 Noninvasive User Audit

For a noninvasive user audit (Fig. 1A), the users are observed while using a system and the output of the system is then analyzed (Sandvig et al., 2014). The auditing entity has no influence on the inputs made by the user, thus, the evaluation of the data (in the context of bias) is limited to a manual examination of individual cases and statistical evaluations, such as fairness measures (see Table 3). This method could be used, for example, to inspect whether posts from women appear comparably often on Facebook News Feed as posts from men. The NYU Ad ObservatoryFootnote 9 from the NYU Online Political Transparency Project serves as an example for a noninvasive user audit study. In this initiative, during September 2020, participants were invited to install an add-on for their browsers to gather and forward political advertisements encountered while navigating Facebook.

4.2 Crowdsourced Audit

A crowdsourced audit (Fig. 1B), is based on the participation of real users. Instead of making use of real user behaviour, the participants enter predefined queries or let a program use their profile and interface to automatically enter predefined inputs (Sandvig et al., 2014). Reber et al., for example, used the technique to assess whether patients are actively targeted with advertisements for unproven stem cell therapies by checking advertisements after keywords such as ‘Parkinson’ were typed into Google (Reber et al., 2020). Another example is the evaluation of the personalization on search engine performed by Krafft et al. (2019).

Since the influence of the user profiles is not known in advance and can only be determined statistically, it is hardly possible to make specific changes to the test setup in order to take this influence into account or to test in a more targeted manner. On the other hand, as long as users are selected uniformly at random, bias with respect to the user profiles can be detected. This means that here, too, the primary option is to evaluate individual cases or make statistical evaluations.

The applicable test methods in such an audit depend on whether an oracle is available or not (see Table 2). Obviously, fairness measures based on a ground truth can only be used if an oracle is available. To perform differential testing, the users must submit their input to both systems and retrieve both outputs. The results can then be used for further investigation.

4.3 Sock Puppet Audit

For a sock puppet audit (Fig. 1C) human interaction is simulated by programming bots that act like they were humans (Sandvig et al., 2014). The behavior of the bots can be modified at any time, which allows application of any of the previously discussed test methods (see Table 3). This audit methodology has been employed to scrutinize the personalization practices prevalent on the internet, particularly in the context of differential pricing strategies (see Hannák et al., 2013; Mikians et al., 2012), Google Ads (Datta et al., 2015) and Google Search (Krafft et al., 2019). Still, the selection is limited according to table 2. How this restriction looks like can be determined with the help of a few questions (see Fig. 4):

  • Is an oracle available: If no oracle is available, it is not possible to calculate fairness measures based on a ground truth or to compute counterfactual fairness (F).

  • Can interaction with the system only take place via a mediator: If the system can only be reached via a mediator, it is not possible to make the necessary adjustments to the input at runtime for adversarial testing (A).

  • Is the test supposed to be fully automatable: If the tests are to be performed on a regular basis and therefore shall be automated, the qualitative evaluation based on differential testing (D) is not an option.

Consequently, metamorphic testing (M) or fairness measures not based on a ground truth (N) can always be applied. In general, a sock puppet audit is useful when the influence of human interaction is to be part of the evaluation (like typing or clicking), or there is no API through which a system can be systematically tested (see, for example, Krafft et al., 2020). The challenge of creating bots that are not recognized as bots by the system under test should not be underestimated (Krafft et al., 2020).

Fig. 4
figure 4

Figure by Algorithm Accountability Lab [Prof. Dr. K. A. Zweig]/CC BY

A Venn diagram showing how the questions of whether an oracle is available, interaction with the system can only take place via a mediator and the test could be automatable limits the selection of possible test methods in the context of a scraping or sock puppet audit.

4.4 Scraping Audit

Scraping audits (Fig. 1 D) are based on previously defined queries that are automatically transferred to the system, for example, via an API or a browser control system (Sandvig et al., 2014). Since the query can be modified as desired, also depending on previous responses, the same test methods as for a sock puppet audit can be applied (see Fig. 4).

Access to an API may be limited to a given number of requests per time or to a certain maximal number of input data. This problem can be addressed if the audit is extended to include a crowdsourced or sock puppet approach by having many participants that either make their own queries and donate the results for evaluation, or transmit prepared queries. In case of a crowdsourced approach the applicable tests are limited to those listed in Sect. 4.2. This form of audit is not appropriate if specific user behavior, like mouse cursor movement or delay times due to an user reading and processing information, is part of the system input. Hauer et al. performed a scraping audit to retrieve the data when comparing the h-indices of hundreds of the worlds most renowned scientists as provided by various platforms (Hauer et al., 2020).

5 Discussion

Finding and proving bias in the decisions computed by a black-box system is a difficult task, as a testing or auditing authority only has limited information about the system under test and the scientific field of black-box testing is still developing.

This paper addresses black-box testing (according to the definition in Sect. 2.3) and auditing concepts relevant for bias testing and auditing of ADM systems. While we aimed to provide a full overview, the literature on this topic is vast and presents multiple, fine-grained variants of basically the same ideas; we have tried to focus on these abstract ideas here. Additionally, others combine different techniques with each other in practice, e.g., Tramer et al. (2017) and Udeshi et al. (2018).

The elaboration of methods and concepts suitable for bias testing and auditing of black-box systems in this paper was intended to provide support when selecting a test method for a real-life bias investigation. We have shown that for this selection, several factors have to be considered; the suitable class can be identified with the help of the taxonomy presented (see Table 2). Additionally, likely challenges of a given class of test methods can be identified as well. Based on the fact that auditing concepts obtain information in different ways from a black-box system, we have also provided a mapping between test methods and audit concepts.

The research for this paper has shown that the terminology regarding black-box testing is generally rather ambiguous, which makes finding information and matching similar or equal concepts difficult. The reasons for this are, on the one hand, the lack of established textbooks addressing the kind of black-box testing and auditing needed for ADM systems. On the other hand, there are not yet dedicated conferences or journals that collect and condense the scientific work revolving around black-box testing and auditing. Scientists have not only worked on these topics in different domains and publication platforms, they have also published some similar ideas under different names and some different ideas under the same (see, e.g., the definition of black-box testing in Sect. 2.3).

It has to be noted, that all test methods and auditing concept are fit to prove bias, but none of them is fit to disprove it. This is a general problem that results from the fact that it is not possible to test for the complete input space (Salem et al., 2004).

Last but not least, the taxonomy still leaves room for subjective decisions on the selection of audit concepts and test methods, their concrete implementation, and the setting of various parameters within each of the test methods. For example, in 2017, the authors T. Krafft and K.A. Zweig (together with M. Gamer), analyzed the degree of personalization of Google search results in a German federal election (Krafft et al., 2019). The external conditions have, for example, excluded us to use an API approach. Using the taxonomy would have left us with multiple options: Among others, we could have used either a sock puppet audit or a crowdsourced audit. We chose the latter, which led to the decision on who to sample from the German voters. We opted for a self-selection. For the test, we used and designed many different ways to measure personalization of the obtained search results. This is just a small subset of all the decisions we needed to make or that were put upon us by multiple, external constraints. If the results of testing and auditing are to be used to validate or refute legal or social claims, those choices made in selecting testing methods and auditing concepts and by setting all other parameters and conditions should be carefully documented and justified.

The next years will have to show how to best tackle the persisting problem of biased decision-making by black-box systems; this paper provides a first basis for the selection of appropriate auditing concepts and suitable test methods in that process.