1 Introduction

Machine learning (ML) models are increasingly leveraged to assist in consequential decision-making across various societal domains. They are utilized to predict recidivism risk of defendants, the optimal allocation of healthcare services, student success, inform social workers about children at risk, or detect suitable applicants in hiring decisions. In each of these domains, cases in which respective models have shown bias against marginalized social groups are well established [7, 17, 58]. Hence, the concern that the involvement of ML models reinforces existing inequalities.

This concern has animated an interdisciplinary research program, named fair ML. At its core, fair ML can be characterized by two goals: first, it aims at developing suitable formal fairness criteria through which it can be evaluated whether, based on observable statistical properties, individuals are treated equally—regardless of group membership. Second, it seeks to identify means for constructing ML models that meet these criteria—e.g., by remedying model bias (for review, see [8, 54]. However, while it is a merit of formal fairness notions to make disparities tangible, the belief that it is insufficient to merely focus on statistical properties to evaluate fairness in ML models is gaining traction [28].

The shortcomings of formal fairness notions become apparent once we shift toward a sociotechnical view on ML models. Crucially, such a view entails that ML models are not deemed isolated entities, but rather tools, used for specific purposes and potentially impacting their social environment in manifold ways. Taking a sociotechnical perspective facilitates identifying a new problem space and rethinking criteria for model evaluation. Four issues stand out in particular:

  • Social Objectives: what is the overarching goal that a ML model is supposed to achieve?

  • Measurement: does an ML model capture meaningful statistical relationships in the target system?

  • Social Dynamics: what are the broader effects on the target system resulting from the deployment of an ML model?

  • Utility for Decision-Makers: to what extent does an ML model enable humans to make fairer final decisions?

To understand their importance, consider how each of these factors can turn into a source of unfairness. Regarding Social Objectives, the involvement of ML models may impact fairness-promoting policies such that they penalize marginalized social groups—e.g., the target variable cannot account for contextual factors when selecting students in the college admission process based on ‘merit’. A problem for Measurement is that many of the constructs to be predicted by ML models (merit, health, child welfare) cannot be directly measured. Consequently, they must be derived from observable properties, used as target variables [42]. If an inappropriate target variable has been selected—such as when health needs shall be predicted via healthcare costs—then this can cause the model to be miscalibrated for certain demographics [58]. Furthermore, the deployment of an ML model may unfold a Social Dynamic in that the prediction of crime locations may affect the allocation of police forces. For members of marginalized groups, this may lead to an increase in stop-and-frisk [62, 63]. Finally, for Utility for Decision-Makers, the very same algorithmic risk score has shown to be interpreted differently by judges, depending on a defendant’s race. The result was that Black defendants were likely to receive harsher punishment than White defendants [32].

This paper sets out to develop an account of fairness that is based on a sociotechnical perspective of ML models. For this aim, the paper draws on an increasingly prominent approach of model evaluation in the philosophy of science, namely the adequacy-for-purpose view. Its basic idea is that a model should not be evaluated according to how accurately it represents a given target system, but on the extent to which it is adequate for achieving certain purposes [22, 60]. In combining pragmatism with a holistic perspective on models, adequacy-for-purpose is predestined to tackle fairness concerns, triggered by the interplay of ML models and the social environment. I assert that an adequacy-for-purpose view on fairness makes two novel contributions to the existing debate: first, it provides a framework to unify different strands in the fair ML literature that are concerned with an embedded or social perspective on fairness. Second, while there are different proposals in the ML community by now for holistic model auditings, these largely shy away from prescriptive methodological considerations (see [66]. The advantage of an adequacy-for-purpose view is that it forces us to be explicit about the goals that we seek to achieve by implementing ML models in social environments, while also having to take into account what guarantees are obtainable to assess whether these goals are met.

The theoretical considerations in this paper will be underpinned by the example of an ML model, used to assist in the college admission process. Overall, my hope is to lay the grounds for a mid-level theory of model evaluation that can be contrasted with mere statistical assessments of fairness and social structural critiques—taking a macroscopic view on how the deployment of ML models promotes (in)justice [14, 26, 56]. The paper proceeds as follows: Section II establishes conceptual common ground by outlining the debate regarding the scope of formal fairness notions, while also discussing the basics assumptions of adequacy-for-purpose will also be discussed. The subsequent sections carve out the constituents of a sociotechnical evaluation of fairness in ML models, focusing on Social Objectives (III), Measurement (IV), Social Dynamics (V), and Utility for Decision-Makers (VI). The paper then concludes by addressing the implications of this account for auditing ML models.

2 Evaluating ML models through formal fairness criteria

This section discusses the role of formal fairness criteria for evaluating ML models. After introducing the basic concepts, it will be argued that formal notions are insufficient to establish fairness in a meaningful way. The alternative proposed amounts to shifting the locus of evaluation from the model’s statistical properties toward the effects that it has on its social environment.

The goal in training ML models is to find a decision-function, minimizing error when predicting a target variable Y over covariates X within a joint distribution D. Their predictive performance is evaluated by showing the model a test set of unseen data, using various classification metrics—e.g., accuracy, precision, or recall. Fairness requirements usually come off as additional constraints to the initial optimization problem. Accordingly, the standard method to evaluate fairness is to compare an ML model’s predictive performance between a disadvantaged group and a privileged group. The composition of the groups is thereby determined by a sensitive attribute A, relating to a salientFootnote 1 social category (gender, race, age, and so on). If the predictive performance is equal across groups, the model is deemed to be fair. Equal treatment is in turn determined by competing formal fairness criteria (see [8] for review). What they all have in common is that they are oblivious: the emphasis lies on the statistical properties, while disregarding the model’s inner workings. This matters insofar as even if data regarding a person’s gender, race, or age are protected, the ML model still might treat different groups unequally by using proxy variables as predictors for sensitive attributes [36, 54]. Call this ‘formal fairness’, since it is confined to statistical model properties.

Various statistical fairness criteria have been proposed in the fair ML literature, most notably Calibration Within Groups (stating that a given score should have the same evidential value for admitting students—regardless of their group membership), or Classification Parity (stating that a given classification error should be equal across groups). However, the prospects of ensuring fairness in ML models by relying on formal notions has been criticized on the grounds that they are, among others, unable to account for social biases in the training data or unequal risk-distributions within groups [9, 18, 27, 54]. However, it lies outside of this paper’s scope to present a comprehensive overview the debate on formal fairness notions. After all, while they may be a necessary condition for ensuring fairness in ML models (especially since they make inequities tangible), the point of this paper is to argue that to meaningfully evaluate fairness in ML models, we need to shift toward a sociotechnical perspective.

2.1 Social structural critiques of fair ML

Rather than being concerned with internal methodological disputes on statistical fairness criteria, the emphasis in social structural critiques of fair ML lies on the tensions between formal and substantive notions of fairness. While the concept of substantive fairness has been used rather loosely so far, one way to understand it is “formal fairness + x”, where x designates an expansive scope of analysis, involving awareness of social relationships and institutional arrangements. Theories of substantive fairness are often also largely driven by broader justice-theoretic agendas [10, 34].

Intuitively, one might assume that formal and substantive fairness notions are closely intertwined. After all, formal fairness notions are useful instruments to evaluate whether a given ML model treats different groups equally. However, they can also drift apart: an e-commerce provider may want to ensure that its classifier satisfies formal fairness constraints for different demographics (e.g., to ward off bad press or to make more accurate product recommendations), but it is unreasonable to assume that the company pursues substantive egalitarian goals. Likewise, instead of formally fair treatment, achieving equality may under certain conditions call for affirmative action policies [6].

One key concern in social structural critiques of fair ML is that the deployment of ML models for consequential decision-making inherently reinforces social inequality—even if the model satisfies fairness constraints. This is due to the ML models being used in social environments that have been shaped by structurally unjust background conditions.Footnote 2 For example, when models are trained by data, pervaded by social bias’, the prospect of treating like cases alike is jeopardized from the onset [82]. Another issue relates to the inability of formal fairness notions to account for intersectional discrimination.Footnote 3 For example, with regard to educational settings, first-generation students will—on average—grow up under worse economic conditions than continuing-generation students, and low economic/educational status is again correlated with certain racial or ethnic demographics. The ways in which these demographic memberships intersect can produce distinct discriminatory effects that prevalent formal fairness criteria are unable to capture (see Kofi-Bright et al. [11] for a causal analysis of intersectionality).

Even worse, there is a rich history of risk prediction models in criminal justice or public policy, that despite being deployed for the purpose of remedying social injustice, ultimately had the opposite effect. The main reason is that the social problems to be solved were too complex to be tackled by risk prediction models [14], pp. 222–23). Since the use of risk prediction models to combat social problems misses its mark, social structure critics argue that the appropriate target of justice-promoting interventions is fixing the unjust background conditions by way of social reforms or affirmative action policies [26, 56].

These critiques forcefully show how ML models can reinforce social inequality, if developers ignore the background conditions in the social environment. That said, whereas social structure critiques highlight the blind spots in fair ML, there is certainly a need for a middle-ground theory that tries to account for many of the problems raised by social structural critiques, while providing more specific guidance for authorities about how to satisfy fairness demands in model auditing processes. Here, a pronounced understanding of the goals that ML models are supposed to achieve as well as the specifics of the social environment are necessary preconditions. Just as important is to understand why and when ML models may not be the right tools for achieving certain goals.

2.2 Adequacy-for-purpose

Adequacy-for-purpose has become an increasingly popular view in the model evaluation literature [3], Currie [22], [60]. Its starting point is that the quality of models does not result from an accurate and complete representation of the target system, but rather whether they prove to be appropriate tools for achieving a purpose of interest. For many purposes, representational accuracy is not what is at stake. As a case in point, highly idealized models may trump more complete models for the aims of scientific understanding [80]. This instrumentalism makes adequacy-for-purpose a natural fit for evaluating ML models—where predictive performance is traditionally prioritized over representational accuracy or interpretability.Footnote 4

The basic strategy to evaluate a model’s adequacy is to define a problem space and then assess, whether the model has the properties that are likely to achieve a purpose P, within a given context of use—composed of a certain type of user U, one or more type(s) of methodology W, and a range of circumstances of use B. Importantly, while a model can be adequate, it is ultimately the user who achieves or fails at P [60], p. 462). In this respect, adequacy-for-purpose is a holistic approach to model evaluation, going beyond a model’s statistical properties to also include users, methodologies, and the target system.

One finding from the preceding sections is that whenever ML models are used to guide consequential decision-making, they are typically used in environments with complex socio-structural background conditions. Unlike scientific scenarios, where the main criterion for an adequacy-for-purpose view is whether the model enables researchers to reliably make inferences, we therefore cannot rely on purely instrumentalist adequacy conditions. Rather, any assessment of adequacy needs to be accompanied by a broader moral and democratic reflection on the legitimacy of the purposes, governing the deployment of ML models. This requires establishing procedures and protocols that allow ML models to be subject to public scrutiny. The opacity of ML models [20] makes it challenging to explain the inner logic of the model with sufficient precision and in a way that meets the requirements of public reason. However, much is already gained by providing detailed descriptions about the modeling choices—e.g., why was a particular target variable being chosen, which training data has been used, how was the model validated, and so on. Indeed, it might be even argued that these external factors are even more crucial to facilitate public discourse than the particular arrangement of variables used by the model to make predictions. Of course, things look differently, when it comes to the justification of individual decisions (see also [19, 52, 72]).

The account of fairness that I develop in this paper falls squarely into the category of substantive fairness. Ultimately, its idea is that the deployment of ML models should lead to fair outcomes in the real world. In that respect, the adequacy conditions that I discuss in the rest of the papers can be understood as stabilizers, ensuring that the ML model registers appropriate features, while its predicted output results in fair decisions. Even though we use different methodologies, the closest comparison is Green [34]. That being said, I am sympathetic to the idea that at least in domains like the legal system, it is also necessary to account for additional criteria (of procedural fairness), such as explainability, answerability, or reciprocity [75].

To showcase the systematicity of an adequacy-for-purpose account of fairness, in the following, I use the running-example of an ML model predicting which (prospective) students are likely to succeed—used by administrators in the college admission process,Footnote 5 having been trained by data from previous students, whereby it uses SAT scores, GPA, extra-curricular activities, financial status, and so on as input variables. For each student, it provides a probability score. If the probability score reaches a pre-defined threshold, the student falls in the positive class (of students likely to succeed), whereas if the probability score is below the threshold, she falls in the negative class (predicted to drop out). In the following, I turn to the individual components of a sociotechnical view of model evaluation. In each case, I begin by articulating the underlying problem and then discuss adequacy conditions.

3 Social objectives

The rationale for deploying ML models for which fairness is a concern is tied to the achievement of a socially impactful goal Y, such as increasing social welfare, public security, or maximizing profits [54]. However, these goals are abstract and it is often left obscure how they are operationalized into prediction tasks. The reason for deploying ML models for such purposes is typically a fairness argument, most pointedly expressed by [47, 48]: in comparing algorithmic risk predictions with data on judges’ recidivism decisions, they assume ML models to be fairer than humans, which is why their implementation into socially impactful settings should be deemed as a justice-promoting intervention. Moreover—so the argument goes—potentially unfair algorithmic behavior can be detected through audits, provided that certain transparency conditions hold. By contrast, given that biases oftentimes evade conscious control and since human decision-making does not explicitly follow any sort of formula, pinning down discriminatory behavior in humans proves to be challenging. Hence the view that ML models leave less space for implicit biases or hidden motives, impacting the decision-making process.

While I find this line of argument to some degree compelling, this section will also highlight that the operationalization of socially impactful goals as prediction tasks entails trade-offs, possibly violating fairness concerns. Here, the main challenge is that the socially impactful goals are unobservable theoretical constructs and therefore cannot be directly measured. In turn, ML developers need to select some proxy Y* as the target variable [55, 61]. The target variable must be narrowly confined and reducible to observable properties. However, the discrepancy between Y and Y* gives rise to two corresponding problems: first, the socially impactful goal Y and the utility, optimized by Y* can be misaligned. Second, there can be a mismatch regarding the generalizability of Y* across different demographics. Both of these problems entail fairness concerns, with this section addressing the first one.

3.1 Adequacy conditions for social objectives

When assessing the fairness of an ML model, one important adequacy-condition is whether the social utility that ought to be maximized can be aligned with the functionality of the ML model in the first place. However, an ML model may fail to be adequate here for two reasons. First, its underlying logic may be incompatible with the desired social objective. Second, the model may not be sufficiently fine-grained to capture all the nuances that need to be considered when trying to maximize a given social objective.

Recall the ML model for estimating student success. For the college administration, possible goals Y are: (i) lowering student dropout rates to maximize revenue from tuition fees, (ii) increasing the diversity of the student body, and (iii) selecting students based on merit. Arguably, for human decision-makers, all these goals play a motivational role in the student selection process, although the weighting of the individual factors may vary intra- and interpersonally. Compared to humans, deploying ML models requires the choice of one particular goal, to be optimized [46].

Basically, (i) can be considered a paradigmatic ML-based prediction task, whereby the target variable Y* might be predicting whether a given student drops out within the first 2 years. On this basis, the model estimates how likely it is that a student applicant will graduate. What then matters is that the model satisfies formal fairness constraints. That said, even for (i) there are various caveats to take into account. First, there is the selective labels problem, meaning that the outcomes observed are themselves consequences of existing choices by human decision-makers [48]. When applied to predicting student success, one issue is that the ML model has only access to data from admitted students, while data from students not admitted is missing. Second, some students from disadvantaged groups may not drop out due to poor grades, but because of economic hardships. Hence, to ensure that the student selection is also procedurally fair, ML developers will have to assess to what extent socioeconomic data should (not) be taken into account (see [82] on procedural fairness).

Matters become significantly more complicated with (ii), starting with the fact that there are both, distinct concepts of diversity and rationales for why a more diverse student body is deemed desirable. In this vein, Steel et al. [70] distinguish between three diversity concepts, namely egalitarian, representative, and normic. Roughly, egalitarian diversity relates to a uniform distribution of social groups over a relevant attribute, while representative diversity is reached once the distribution is similar to that of the reference population. By contrast, normic diversity tries to capture the deviation of persons from the norm within a group, defined by a relevant attribute. Normic diversity is closely linked to membership in marginalized social groups, whereby it is possible for one person to diverge from the norm along multiple axes of difference.

Increasing diversity of the students is not an end in itself, but a proxy for other overarching aims. On the one hand, a diverse student body might provide epistemic benefits, insofar as it broadens the cognitive repertoire and enhances problem-solving strategies [59, 81]. On the other hand, increasing student diversity is seen as a prerequisite for educational justice, since it facilitates a patterned fair distribution of educational outcomes [1, 68]. Both rationales can overlap to some extent. A diversity of standpoints (from members of marginalized social groups) can more accurately depict unjust social structures. However, they differ concerning their normative implications. For an epistemically motivated version of representative diversity, it can be sufficient if the percentage of admitted first-generation students is similar to those in the applicant pool. Hence, presuming that their academic achievements are roughly equal, a well-calibrated ML model may prove to be sufficient for achieving this goal. By contrast, since facilitating educational justice is concerned with ameliorating gaps in the attainment of educational outcomes, it entails affirmative treatment for those less well off. In turn, a formally fair model may not be the adequate tool.

Consider another issue for the goal of increasing diversity through ML models, namely that the groups evaluated through formal fairness notions are defined by a single sensitive attribute. Hence, the composition of the relevant groups is oftentimes based on unstated assumptions concerning the ontological status of social categories—most notably that a racial or sex group is just a collection of individuals who share a given trait [41]. Due to these simplistic assumptions, said group concepts fail to register intragroup diversity, which is pivotal for normic diversity concepts. As an upshot, when assessing the adequacy of an ML model designed to guide college admission decisions, model authorities will have to clarify in much detail, which type of diversity (and for what exact reasons) they want to optimize.

Similar concerns arise with regard to (iii), where merit can either be interpreted as academic performance, intellectual ability, or academic performance in view of possible socioeconomic hardships. Again, while academic performance can be well measured, provided that GPA and other test scores can be deemed reliable proxies, identifying appropriate target variables Y* for the latter two requires thorough theoretical and normative assumptions. In particular, predicting intellectual ability involves making inferences about latent psychological states, while assessing socioeconomic hardships requires practical judgment, as opposed to (mere) prediction. Hence, while one purported benefit of ML models is to reduce noise in decision-making processes –potentially culminating in unfair treatment of members of disadvantaged groups (see also [44])—this also may turn into a weakness when too much nuance is being lost to include a wider-range of justice-related considerations. Again, the gist of evaluating adequacy-for-purpose for ML models from a fairness perspective is to clarify normative aspirations that motivated the use of ML models in the first place. One practical way how to do this is to create a wide range of (fictional) student profiles—with varying social backgrounds—and examine how they would fare in the admission process when being examined by the ML model. Provided that the ML model rejects many of the edge-cases, model developers may either try to use more fine-grained reference classes to better account for different student populations or decide that the model may not be adequate for the purpose of college admission.

4 Measurement

To motivate the problem of measurement, consider a study by Obermeyer et al. [58], in which the investigators examined an ML model, estimating which patients will derive the most benefit from high-risk care management programs. The study’s objective was to investigate differences in the predictions for White and Black patients. As input variables, the model used a large set of insurance claims data, containing demographic information, insurance type, medical history, medications, and detailed costs. Using calibration within groups as fairness criterion, the study compared how well the risk scores are calibrated across races with regard to health outcomes. The result being that for the same level of predicted risk, Black patients were significantly sicker than Whites.

The miscalibration of Black and White patients could be attributed to the fact that the model took the total medical expenditures of patients to predict their health needs. Since medical expenditure tracks both health and wealth, there is a strong correlation between health costs and the severity of illness, because the sicker a patient is, the more care she typically needs. However, the ability to pay varies and is bound to socioeconomic factors. Since Black patients in this study were, on average, poorer than Whites, they tended to spend less on healthcare for a given level of illness. Effectively, the model confused health status with wealth status. At the same time, Obermeyer et al. [58] found that the racial disparities in the prediction of health needs can be mitigated by selecting a different prediction target—e.g., health or avoidable costs.

The logic of the study can be expressed in simple terms: if there is a mismatch between Y and Y*, then this leads to unfair treatment for disadvantaged groups. Conversely, possibly unfair treatment can be mitigated by an appropriate operationalization of Y*. However, unlike the statistical evaluation of ML models, there is no straightforward approach for identifying appropriate target variables. Instead, a necessary precondition is having a pronounced understanding of how structural biases in the social environment can affect the measurement of a property of interest [55].

4.1 Adequacy conditions for measurement

In light of this, the adequacy condition is whether we can operationalize Y* such, that it generalizes well across different demographics. For the purpose of identifying appropriate target variables, model authorities are well advised to seek inspiration from social science methodology—especially measurement theory and psychometrics. In brief, a measurement model refers to a statistical model, linking the (unobservable) latent variable to observable properties (henceforth: the operationalized construct). The key task in measurement modeling is to ensure that the operationalized construct captures the latent variable, with the success condition being defined by the coherence of correlational evidence and background assumptions. More precisely, an operationalized construct will be evaluated along the axes of reliability and validity. The former assesses whether a given construct yields similar outputs, when tested at different points in time. By contrast, the latter tries to ensure that measurements obtained are meaningful and predictively useful [42, 74].

The process of construct validation involves three steps, beginning with the conceptual clarification of the latent variable (e.g., defining the scope of student success). To this end, psychometricians draw on existing scientific theories or folk-psychological notions. This is followed by the selection of a measurement method (e.g., a questionnaire), items (the relevant questions or tasks), and a scoring method. Crucially, in choosing the measurement method, modelers need to make assumptions about statistical relationships between the construct of interest and other variables. Likewise, it is a matter of identifying possible confounders. Finally, the validity of the construct will be tested in different quantitative and qualitative procedures. These range from subjective sanity checks to more pronounced evaluations on whether the operationalized construct coheres with theoretical assumptions, does not overlap with other constructs, converges with other measurements of the same latent variable, is predictively valid, and so on [4, 42].

However, while there is little doubt that measurement modeling can guide more precise modeling choices, it also raises epistemological questions regarding the level of detail and the evidence in place that is required to establish a construct’s validity. More profoundly, disciplines like psychometrics presume that psychological attributes (depression, intelligence, intellectual ability) are operationalizable through quantifiable properties, just in the same way that physical attributes are. These issues have sparked controversy among philosophers of science [29, 74, 79].

Turning to ML models used for college admission, one difficulty is that latent variables like merit are thick concepts Väyrynen [78], combining descriptive and evaluative content, while also being context-dependent. In this manner, Alexandrova and Haybron [4] argue that while construct validation is methodologically sound in theory, matters look different in practice. Using the measurement of well-being as an example, they highlight that legitimate sources of evidence—such as philosophical theorizing—are either ignored or overridden by statistical considerations when validating constructs. Furthermore, it can be difficult to establish the superiority of one operationalized construct over another, despite their very different statistical properties. The main reason is that it is unclear how to weed out implausible correlations. As a possible amelioration strategy, Jacobs and Wallach [42] stress the need to be exceedingly precise about the underlying normative assumptions in a latent variable, in the hope that this will culminate in more nuanced operationalized constructs. The idea is basically a reductionist approach: if developers think hard enough, then it might be possible to eliminate the messy evaluative parts and come up with a proper technical operationalization.

However, the turn to measurement modeling also allows rethinking the way of how adequate operationalization are identified. For example, Alexandrova and Fabian [5] argue that one way to tackle thick concepts is to incorporate participatory design methods in which all stakeholder groups, potentially affected by a given policy, jointly deliberate on what constitutes appropriate measures. One advantage of this approach is that it ensures perspectival diversity and political legitimacy. After all, model authorities are likely to be predominantly middle class (and White)—and may therefore not be sufficiently aware of the all the trade-offs involved in the selection of target-variables.

To what extent stakeholder groups translates into epistemically better outcomes than reductionist approaches is of course speculative with regard to ML models, since I am not aware of any study in ML that incorporated this approach. Furthermore, this approach too is subject to constraints. Some apply to participatory design methods in general: there is always the threat that certain interest groups hijack the design process or that it becomes a box ticking exercise. Even worse, the gap regarding the necessary statistical/technical knowledge to meaningfully engage with the development of ML models might be too vast for laypeople (see also [5], p. 19).Footnote 6 Specifically for ML, another obstacle in measurement modeling is the need to identify an exceedingly narrow target variable, whose connection to the background assumptions (the variables in the input data) is murky. Finally, for reasons that will be discussed in the next section, it may not be in the best interest of many institutions to make the model’s functioning transparent.

For an adequacy-of-purpose view, there are two insights that follow from the previous discussion of measurement modeling. The first one is simple: ML developers are well advised to ensure that the measured properties are meaningful and predictively useful—while not disadvantaging certain demographics. Here, it is particularly useful to draw on the methodology of scientific disciplines that already have a great deal of experience with likeminded problems.

Relatedly, when there is no way to constrain plausible modeling assumptions and when disagreement concerning the appropriate operationalization persists, then the ML model might turn out to be the inadequate tool for a given purpose. The second one is more brittle: When ML models are used for social purposes, the development and evaluation process inevitably turns into a social science project. And while the turn to measurement modeling suggests that there is a great deal to learn for ML developers by engaging with psychometricians to improve their methodology, the fairness guarantees that can be achieved this way will necessarily be imperfect. This does not mean that we should abandon the project of using ML models for social purposes. Rather, what is at stake is that institutions should be wary of how much evidential weight they assign to ML predictions when making consequential decisions.

5 Social dynamics

The prevailing method to evaluate fairness is to test the predictive performance of an ML model on a test set, sharing the same idiosyncrasies of the training-data. Implicit in this is the assumption that the distribution in the data is stable. However, it has become apparent that evaluations within the same distribution do not provide any guarantees for the predictive performance when deploying an ML model in novel settings. Particularly in healthcare, distribution shifts are a major concern for malfunctions. These might result from a hospital using different medical devices or from changes in the patient population, compared to the training data [30]. Correspondingly, an ML model used to predict successful students in the United States is bound to fail horribly when deployed in countries with an unlike school system (say Scandinavian countries). Even still, social environments are not static. Education systems will undergo reforms, the content and difficulty of tests may vary, the demographics will change, just as the attitudes of students toward education. All these effects potentially contribute to a distribution drift, changing the probability distribution over time. While Social Dynamics are not specific to ML, it highlights a fundamental problem regarding the assurances that can be provided for the formal fairness of an ML model over an extended period of time (and across domains).

Most intriguingly, the deployment of ML models for consequential decision-making in itself can trigger changes in the social environment. Consider the case of a bank, estimating that a loan applicant has an elevated risk of default. Since the bank will assign the applicant a high interest rate, her risk of default is increased [51]. The social dynamic caused by the interplay of the ML model and the social environment gives rise to fairness concerns. If the applicant is unable to pay back her debts, then the interest rate for people with a similar profile is likely to increase in the future. Likewise, the prediction of crime locations can affect the allocation of police forces—culminating into an increase in stop-and-frisk for members from marginalized social groups [62], see also [63].

Another way in which ML models impact their social environment is by provoking people to alter their behavior such that it aligns with the preferences/utilities maximized by the model. This behavior modification may be motivated by strategic reasons, especially since it gives them a competitive edge. If admission into a prestigious college is made contingent on a high SAT score, then the strategic response is to optimize preparation for the test—e.g., taking preparation courses or doing multiple iterations of the test. Similarly, even creditworthy individuals might engage in artificial practices to boost their credit score [53]. However, strategic adaptation requires material and informational resources that are not evenly distributed across society. Students from underserved social groups may simply find themselves unable to pay for SAT preparation courses. The upshot is that some undeserving students from privileged groups will be admitted to college, while some students from disadvantaged groups will erroneously be excluded [40].

5.1 Adequacy conditions for social dynamics

Building on this, there are two adequacy conditions to consider: firstly, is there a way to ensure that the model performance remains stable across different demographics over time? Secondly, how is the model going to impact the behavior of users—and how desirable are these effects?

Concerning the first issue, fairness concerns arising from strategic adaptation are increasingly examined by ML researchers [39, 51, 53]. Guided by modeling assumptions in microeconomics (e.g., that agents maximize an utility-function based on perfectively accurate information) and statistical decision-theory, it is shown how agents reinforce their advantaged position—compared to their disadvantaged counterparts—by being able to learn more optimal decision-rules for achieving a desirable outcome. Also explored in the literature are possible strategies to counteract these inequality-reinforcing effects. For example, Hu et al. [40] consider interventions in which the costs for disadvantaged groups are lowered for the purpose of improving their classification performance in a strategic manipulation game. They also highlight the brittleness of such interventions: at least under certain conditions, they can backfire and reduce the welfare of the advantaged and disadvantaged groups.

Perdomo et al. [62] develop a risk minimization framework, wherein the model is repeatedly retrained to find performatively stable points, accounting for undesirable distribution shifts, induced by the deployment of the previous model (parameters). Here, performatively stable relates to optimal decision-rules for a particular distribution, used as fixed points for model retraining. Again, this solution has its caveats, since a prerequisite for the existence of performatively stable points is that all agents behave perfectly rationally—which cannot be presumed for externally valid settings. Thus, Jagadeesan et al. [43] emphasize that many of the modeling assumptions in the literature on strategic behavior in algorithmic classification tasks are stylized. As an alternative framework, they propose a noisy response model. D’Amour et al. [23] raise similar concerns, while claiming that simulation models are superior to the standard game-theoretic approach in exploring the system dynamics, feedback loops, and long-term effects occurring from the deployment of ML models for consequential decision-making. One takeaway from the research literature is that once we move beyond idealized conditions, it is difficult to obtain meaningful statistical guarantees that a model will perform stably over time. However, a pragmatic way to counteract the risk of performative instability is to regularly retrain (and re-evaluate) the model.

Setting these internal methodological constraints aside, another potential flaw relates to the scope of the relevant literature: only those changes to the social environment are captured that are immediately relevant to changes in the distribution. This view ignores externalities, caused by changes to cultural frames or institutions. Provided that the deployment of ML models has significant impact on college admission, then long-term, this entails wider-reaching effects on the educational system—possibly culminating in novel fairness concerns. The day that Harvard makes its admission algorithm transparent will profoundly impact how private schools across the globe operate. In light of this, it might be argued that the strategic behavior of individual agents is the wrong reference class to capture the relevant interdependencies in the social environment. Therefore, it is integral to anticipate a broader range of negative effects on the social environment.

Along these lines, Fazelpour et al. [28] argue that accounting for social dynamics requires a change of perspective, from a static evaluation of model properties toward a moral evaluation of the trajectories, induced by the deployment of ML models. Here, the locus of evaluation are the good-making properties of the trajectory as a whole, defined by the temporal dynamics (how does the deployment of an ML model promote justice over time?), robustness (does the ML model promote justice under a wide range of background conditions?), and perspectival diversity in the evaluation of morally relevant modeling choices.

The moral evaluation of trajectories holds many valuable insights for assessing the adequacy of ML models, registering their temporal dynamics. That said, unlike Fazelpour et al. [28], I do not see the moral evaluation of trajectories as a categorial shift, but rather as a supplementary approach in model evaluation. Moreover, evaluating moral trajectories of ML-based interventions holds various epistemic and morally normative challenges. Most profoundly, many assumptions about the social environment have to be made over which one has little control. This problem might be best compensated by incorporating the perspectival diversity of stakeholders—especially with those who are most affected by a given ML-based intervention figuring prominently [28], p. 14). Just as important is it to gain externally valid evidence through empirical studies on the impact of ML models on the social environment.Footnote 7

That said, it is well established that the ability to effectively forecast social phenomena is fairly limited [57, 67]. Hence, the problem of social dynamics tempts pessimism about the extent to which we can long term evaluate the fairness of ML models in social environments. Another value-based question is how to draw the line between morally (un)desirable trajectories, resulting from the deployment of an ML model and those that can be attributed to other factors, such as institutions acting in bad faith or structurally unjust background conditions. In sum, a profound problem for an adequacy-for-purpose view regarding the social dynamics of ML models is that even the best methodological tools face significant limitations, which is why we are bound to pragmatic solutions, such as continuous model retraining and post-deployment monitoring.

6 Utility for decision-makers

It is typically claimed that the intended role of ML models is not to replace humans but to provide assistance in consequential decision-making. At the same time, it is left un-analyzed how the assistant role ought to manifest. This section sets out to examine how decisions made by ML models and humans can drift apart, how this culminates into fairness concerns and what the relevant implications are for an adequacy-for-purpose view on ML model evaluation.

Again, consider the college admission scenario to understand the complexities in the interaction between ML models and human decisionmakers. The decision-making procedure might look as follows: a college administrator and the ML model will start by reviewing the student documents independently. The administrator makes an initial assessment for a respective student and then considers the probability score, provided by the ML model. If the probability score significantly diverges from the initial assessment, the administrator might update her judgment or re-read the submitted documents. The upshot will likely be a more fair assessment of prospective students.

6.1 Adequacy conditions for utility for decision-makers

A problem with the outline just provided is that it ignores many of the stumbling blocks in the interaction of human decision-makers and ML models, getting into the way of epistemic success. When evaluating the adequacy of ML models as decision support tools, the locus of attention should be whether the involvement of the ML models enables policymakers to make fairer (final) decisions. Here, one crucial factor is how the ML model impacts the epistemic environment.

Consider this in more detail: one obstacle relates to the evidential mismatch between the ML model and the college administrator (see also [46]). Besides the data used as input variables by the model, the college administrator may also have access to other kinds of evidence that the model cannot meaningfully assess—including essays or recommendation letters. From a fairness perspective, this creates a twofold problem: first, it skews the comparability of assessments. Second, the evidence not available to the ML model is precisely of the kind that might turn the tides in favor of privileged students, as its evaluation (at least partially) depends on subjective factors. While the problem can be mitigated by restricting the kinds of evidence that the administrator has access to (see also [47]), this arguably leads to a loss of valuable information and threatens to undermine the administrator’s epistemic authority.

Let us turn to another set of problems, namely biases emerging during the interaction. Suppose that the administrator does not make her assessment independent from the ML model at first. Instead, the probability score is shown to her right at the beginning. The difficulty here is that the probability score turns into a reference point, likely to influence the administrator`s judgment [77]. But other arrangements also have their pitfalls. Provided that the administrator makes an initial assessment, she may be unwilling to let the ML model`s prediction override her judgment. The corresponding problem of algorithmic aversion has been investigated in an influential study by Dietvorst et al. [24], according to which people avoid assistance from risk-prediction models—especially after seeing them err. Algorithmic aversion, however, is only one side of the coin, with automation bias being its counterpart. Particularly Tschandl et al. [76] find that, within the context of skin cancer diagnosis, the default for junior clinicians is to defer to the ML model, whereas expert clinicians are prone to stick to their own diagnoses. Consequently, when evaluating the adequacy of ML models for the purpose of assisting in consequential decision-making, the task is to ensure that human decision-makers assign appropriate epistemic weight to ML-based recommendations.

Even more critical, it might turn out that due to racial/gender biases of college administrator, the same probability score loses its evidential value when assessing students from disadvantaged social groups. A series of online experiments by Green and Chen [32, 33] shows that—given the same risk score—Black people were judged more harshly than Whites by the research participants within the context of pre-trial or lending decisions. This puts any justice-promoting effects that result from the assistance of the ML model at risk.

Further complication arises from a potential value misalignment between the college administrator and the ML model, due to them trying to maximize different fairness notions [47]. As an example, the ML model might be well calibrated across different social groups, whereas the administrator pursues affirmative goals. In that case, she is left with three options. First, she might ignore estimates made by the ML model altogether. Second, she uses them as a heuristic—yet opts for preferential treatment for students from disadvantaged groups in edge-cases. Third, the administrator might adapt her fairness standards toward the ML model. Especially the latter has been documented in a study by Stevenson and Doleac [71], investigating the impact of risk-prediction models on judicial decision-making. A striking finding is that while overall, the involvement of risk-prediction models did neither decrease nor exacerbate racial bias, they led to higher risk-scores for young defendants—and harsher sentencing decisions for them in turn. In non-algorithmic settings, by contrast, they were judged more leniently. One way to look at this issue is that the assistance of risk-prediction models reduces the variance in consequential decisions. After all, defendants across the age spectrum are treated more consistently. However, it is also evident that the involvement of ML models exhibits normative force by narrowing down the set of available policies (see also [65]).

To add yet another layer of complexity, the opacity of ML models—either due to epistemic reasons or business secrets [13, 50, 73]—makes it challenging for human decision-makers to localize potential reasoning errors. The college administrator might not even have a high-level understanding of how the ML model works and which utility function it tries maximizing. It does not help that empirical studies paint a pessimistic picture about the extent to which explainable AI/ML methods enable overcoming opacity-induced fairness concerns [25, 64].

Returning to an adequacy-for-purpose view, a lesson to draw is that policy-makers need to carefully specify how the interaction between the human decision-maker and the ML model should function. One way to approach this is to distinguish between different types of interactions between human decision-makers. Along these lines, Chesterman [16] distinguishes between cases in which decision-process are fully automated (referred to as human ‘out of the loop’), where the human decision-maker surveils the decision-making process (human ‘over the loop’), and where human authorities are meaningfully involved in the decision-making process (human ‘over the loop’). In a similar vein, Grote and Keeling [35] distinguish between application scenarios—with a view on clinical medicine—in which the ML model is a peer to the clinician, where they divide up the labor, as well as upstream and downstream applications of ML models for decision-support. Specifying the type of interaction and the resulting challenges is a precondition in order to establish success criteria for the utility of a given model for decision-makers.

This includes various epistemic and morally normative aspects—particularly with an emphasis on the epistemic properties that need to be made transparent, the evidential status of ML model predictions, and the value that should be assigned to variance/arbitrariness in the decision-making process (see also Creel and Hellman, [21]). Put differently, is the deployment of an ML model motivated by the aim to enhance a human’s decision-making capacities or is it to constrain them—e.g., by counteracting potential racial biases?

Either way, it should be evident that randomized-controlled trials (RCTs) and prospective studies on human–machine interaction are indispensable to determine whether an ML model is indeed used adequately by policymakers [31]. Important desiderata here are a proper documentation of when/why human decision-makers deviate from algorithmic predictions and how their attitude toward the decision support evolves over time. As it stands, the existing literature is still dominated by one-shot online studies with laypeople (e.g., [32, 33]) or the analysis of retrospective data [46]. Despite their merits, both study designs are susceptible to research biases and therefore raise concerns regarding the external validity of the yielded evidence. When testing human-ML interaction in realistic settings, one complication that arises is that it might be impossible to detect erroneous decisions in real-time, given that there is oftentimes no stable ground truth. One way to overcome this problem is to shift the focus from accuracy metrics toward meaningful outcome measures. For instance, in a medical setting, one might evaluate whether the physician-ML tandem led to better patient outcomes in terms of quality adjusted life years (see also [2]).

Then, it may be necessary to supplement population level studies with qualitative studies that seek to track the psychological mechanisms of decision-makers.Footnote 8 One crucial aspect here is to investigate whether decision-makers at varying levels of seniority—and possibly with more conservative or liberal attitudes—use a given ML model in the same way. In case of stark differences, it might be important to explore how disparate interaction can be remedied through best-practice guidelines, interventions on the decision-architecture, and so on. Either way, the manifold pitfalls in the interaction of human decision-makers and ML models will prove to be a key area in order to ensure that algorithmic assistance leads to socially meaningful outcomes.

7 Conclusion

The aim of this paper was to develop an account of model evaluation—with an emphasis on fairness concerns—that takes the social situatedness of ML models as its starting point. By drawing on the adequacy-for-purpose view, epistemic norms and desiderata for an adequate deployment of ML models along the dimensions of Social Objectives, Measurement, social dynamic, and interaction were identified. In brief, the adequacy conditions involve (i) ensuring alignment between the Social Objectives, (ii) a proper validation of the target variable by way of psychometric testing, (iii) taking means to ensure stable performance over time, whilst also trying to anticipate broader moral repercussions that are induced by the deployment of the model in the social environment, and (iv) an evaluation of how the input provided by the ML model is used by human decision-makers. In that respect, the objective was to develop a unified framework that helps tackling issues resulting from the interplay of the ML model and the social environment more systematically. One critical insight that has emerged in the process is that any auditing of ML models that ought to provide assistance in consequential decision-making cannot be limited to an assessment of statistical properties. Rather, it is necessary to incorporate a variety of methods from the social sciences. At the same time, turning to social science methodology highlights why the guarantees that can be obtained will be necessarily imperfect and why some underlying normative issues will haunt the fair ML research program for the foreseeable future.