1 Introduction

In recruitment, promotion, admission, and other forms of wealth and power apportion, an evaluator typically ranks a set of candidates in terms of their perceived competence. If the evaluator is prejudiced, the resulting ranking will misrepresent the candidates’ actual ranking. This constitutes not only a moral and a practical problem, but also an epistemological one, which begs the question of what we should do—epistemologically—to mitigate it.

The overwhelmingly most common approach in the literature on prejudice is to advocate some kind of preventive intervention to handle problems of prejudice (cf. Madva, 2020). Such interventions aim to stop the prejudiced behavior—the misrepresentation, in the case at hand—from happening. Applicable preventive interventions include both structural interventions that change the circumstances under which the ranking takes place (e.g. the introduction of anonymization, or the requirement that criteria–based decisions are made) and thereby decrease prejudiced behavior, and individual interventions that attempt to change the evaluator and thus make them less prejudiced. Applied to the ranking problem, such interventions thus aim to make sure that the veracity of future rankings is higher than it would have been if the intervention had not been carried out.

A novel, very different, approach is advocated by Jönsson & Sjödahl (2017) (and refined by Jönsson and Bergman forthcoming). On their approach, we can increase the veracity of prejudiced rankings directly, by updating them in light of available statistical information. Evaluators are thus not prevented from producing prejudiced rankings. Instead, the intervention aims to improve the veracity of these rankings after they have been produced (but before they have led to discriminatory behavior). This kind of post hoc intervention thus aims to make sure that rankings have greater veracity after the intervention has been carried out, than they had before.

The intervention advocated by Jönsson and Sjödahl has been demonstrated to work as suggested (cf. Jönsson and Bergman forthcoming) given that we have access to an evaluator’s history of evaluations, and a number of additional presuppositions are satisfied (see Sect. 2). It is thus quite clear when the intervention can and cannot be used as a solution to the ranking problem (e.g. it cannot be used if the relevant population means are not known). The situation is different with respect to preventive interventions, where corresponding presuppositions haven’t been spelled out. It is thus unclear which interventions (or which combinations of interventions) are at our disposal when we face the ranking problem in different situations.

In what follows, I will clarify this by spelling out the background assumptions that need to hold in order for individual interventions—a prominent subclass of the preventive interventions—to likely yield positive epistemological effects in ranking situations (Sect. 1). I will then describe the corresponding presuppositions of the post hoc intervention advocated by Jönsson and Sjödahl (Sect. 2). With these two sets of presuppositions in place, I will reflect on the limitations imposed by each presupposition on its intervention, compare presuppositions across the two kinds of interventions, and conclude that the two kinds of interventions importantly complement each other by having fairly disjoint, but non–conflicting, presuppositions (Sect. 3). The post hoc intervention can thus complement an individual intervention (and vice versa) both in situations where both are applicable (by adding further increases in veracity), but also by applying to situations where the individual intervention is not applicable (and thereby increase veracity in situations beyond the reach of that intervention).

2 The Presuppositions of Individual Interventions

2.1 Preliminaries

What follows will be concerned with situations where an evaluator evaluates a set of people with respect to some goal (e.g. hiring, admission, promotion, proposal evaluation etc.), and how such evaluation can be improved. Even if these situations by no means exhaust the situations where prejudice is manifested, they do apply to a large range of situations which are tremendously important from the perspective of the distribution of wealth and power in society and the overall integration of its members. So even though a focus on evaluation situations involves a considerable restriction (excluding as it does, for instance, manifestations of hostility and violence), it still targets tremendously important situations. It also bears emphasis that evaluation situations are often characterized by prejudice, as is demonstrated particularly clearly by so-called CV–studies (See e.g. Steinpreis et al., 1999; Bertrand & Mullainathan, 2004; Correll et al., 2007; Moss–Racusin et al. 2012; Döbrich et al., 2014; Agerström & Rooth, 2011; Rooth, 2010) which demonstrate that identical, or comparable, CVs give rise to differential evaluation when they signal membership in different irrelevant social groups (e.g. male/female or Black/White).

Most evaluation situations involve rankings, i.e. orders which relate all members of a set of candidates, but allows for two or more members to have the same rank (formally, a total preorder over a set of people). Some evaluation situations just involve ranking a single candidate above the rest, but are ranking situations in this sense nonetheless. Thinking about evaluation in terms of ranking is particularly useful when we are concerned with the epistemological problems of prejudice, since it affords an easily measurable notion of misrepresentation, in terms of the difference (e.g. in terms of a rank order correlation) between the correct ranking of candidates in terms of their competence, and the ranking produced by an evaluator.Footnote 1 Had we instead focused on, for instance, people’s possibly erroneous general beliefs about particular social groups—which is commonplace in the philosophy of prejudice—the notion of a misrepresentation, and the related notion of increased veracity, would have been harder to operationalize, in particular since these beliefs are typically expressed in terms of generic generalizations, a form of generalization with very unclear truth conditions (cf. Leslie and Lerner 2016). The question of what we should to do to improve the veracity of rankings, thus seems clearer than other related questions.Footnote 2

With these preliminaries in place, we can define an individual prejudice intervention (‘individual intervention’, for short) broadly as an attempt to change an individual so that they become less prejudiced. It can thus be seen as a function from one set of features of an individual—the features of the individual that influences, or is influenced by, the outcome of the intervention—to another such set—what the individual is like after the intervention. Thus understood, individual interventions encompass everything from increasing intergroup contact (Allport, 1954; Pettigrew & Tropp, 2006), learning counter stereotypical ‘if—then plans’ (so called ‘implementation intentions’) (Rees et al., 2019), and just being told to avoid prejudice in the right way (Döbrich et al., 2014), to playing a softball game with teammates mostly from the social group you are prejudiced towards, in order to reduce implicit bias (Lai et al., 2014).

Since our focus is on evaluation situations, we are concerned with such functions where ‘become less prejudiced’ is understood in terms of ‘evaluate more accurately’.

2.2 The Presuppositions of Individual Interventions

Consider now the question of what conditions need to hold in order for an individual intervention to be epistemologically beneficial with respect to a prejudiced evaluator E and future ranking situations of sort s (e.g. hiring new baristas at a particular coffee shop).

First, in order for it to be (conceptually) possible for an intervention to be successful, E’s evaluations have to be (conceptually) possible to improve. This presupposes some performance on the part of E that can be evaluated, and this seems to minimally involve that E is able to order the people E is evaluating, i.e. that E’s evaluations can be captured by an ordinal scale (equivalently, that the evaluations have the structure of a ranking). Although this assumption is likely often satisfied in practice, it is not a completely trivial assumption, since it rules out transitivity violations and cases of non–comparability, so an evaluator who ranks a before b before c before a, or is unable to rank two or more people, does not satisfy this assumption.

With this in place, we can (following Jönsson & Sjödahl, 2017 and Jönsson and Bergman, forthcoming) measure the veracity of a ranking in terms of the rank order correlation between an evaluator’s ranking and the correct ranking (e.g. in terms of Spearman’s rank correlation coefficient, a number between 1—which signifies a perfect positive correlation—and –1—which signifies a perfect negative correlation).Footnote 3 Given this measure of veracity, E’s degree of accuracy (with respect to s–situations) can be identified with the average veracity of rankings that E is disposed to produce (in s–situations).

We can then understand an intervention being successful in terms of it bringing about an improvement in accuracy.

An improvement in accuracy can in turn be understood in two main ways: either (1) asynchronously, in terms of the contrast between the veracity of E’s rankings before and after the intervention (which involves comparing two different sets of rankings), or (2) synchronously, in terms of the contrast between the veracity of E’s rankings after the intervention, and the veracity of those interventions if E had not undergone the intervention (i.e. the contrast between E’s accuracy with and without the intervention). Both notions are reasonable, but (2) is slightly better than (1) since it avoids possibly confounding differences between past and future rankings (e.g. that later rankings are easier, or feature less people that E is prejudiced towards). Moreover, (2) is in the same spirit of an (between factors) experimental design which is common in intervention studies (see below).

Second, the aforementioned measure of veracity presupposes that there are objective rankings that E’s rankings can be evaluated in terms of. Although such rankings are seldom readily available in real life selection situations (due to time pressure and various uncertainties concerning the candidates) it seems reasonable to assume that there are such rankings in many situations. Moreover, it seems reasonable to think that we can come to know these rankings (given enough time, information and ingenuity), at least in testing situationsFootnote 4. Such objective rankings are readily available, for instance, in the aforementioned CV–studies. In these studies, individual interventions are often tested using pairs of CVs that differ only with respect to which irrelevant social group a candidate belongs to. In these situations, the objective ranking of the two CVs—qua them being relevantly identical—is just an assignment of the same rank to both CVs. An evaluator who assigns them different ranks thus deviates from the objective ranking, something which can be captured by the suggested veracity measure.Footnote 5

Third, in order for the intervention to be epistemologically beneficial to E, it needs to be accuracy increasing in s–situations featuring people from the social groups which E is prejudiced towards. We can distinguish between two situations. First, an individual intervention is unconditionally accuracy increasing (in s–situations, and with respect to certain social groups) if it tends to increase the accuracy of the individual that undergoes it independently of their features (over and above their prejudice). Second, an individual intervention is conditionally accuracy increasing on f, where f is a feature (or set of features), if it tends to increase the accuracy of the individual that undergoes it only if they are f.

At the very least, showing that an individual intervention is unconditionally accuracy increasing in s–situations, typically involves measuring the accuracy of a large, diverse, randomly selected group of people by having them rank individuals (or representations of individuals) in s–situations, then administering an intervention to some portion of these people (the intervention group), measuring everyone’s accuracy again, and finding a significant improvement for the intervention group but not for the control group with some suitable statistical test, e.g. a test for an interaction effect in a mixed measures analysis of variance (ANOVA). Given a sufficient number of such studies across different conditions, it has been shown that the intervention is unconditionally accuracy increasing.

If it is instead found that the intervention only leads to increased accuracy for people that are f, or if testing hitherto only have involved people that are f, it has only been shown that the intervention is conditionally accuracy increasing on f.

Fourth, if the intervention is only conditionally accuracy increasing on f it also needs to be the case that E is f in order for the intervention to epistemologically benefit E. In some cases this will be fairly trivial to ascertain, (e.g. if an intervention only increases accuracy for women), but other times it will not be (e.g. if the intervention only increases accuracy for people having a certain level of implicit bias) and E has to be shown to satisfy f through additional testing.

Fifth, most individual interventions change various psychological states of the individual in order to bring about a particular change in ranking accuracy (e.g. if the intervention involves E learning certain ‘if—then plans’, this changes the evaluator’s beliefs and/or associations). Given that these changes influence the evaluator’s accuracy, it is also likely that they will bring about other changes. This raises the possibility that these other changes might not be epistemological improvements. So in order for an individual intervention to be epistemologically overall beneficial with respect to E it is important that the gains in accuracy the intervention gives rise to, are not offset by unintended detrimental epistemological side–effects. Sometimes an absence of such side effects will be fairly obvious, but in other circumstances things will be less clear (as will be exemplified shortly).

Sixth, assumptions 1–5 are, strictly speaking, all the assumptions that are needed in order for an individual intervention to be epistemologically beneficial with respect to a prejudiced evaluator E. However, in order for it to be reasonable to believe that an individual intervention is epistemologically beneficial, it needs to be shown that the third assumption (and possibly the fourth and fifth) obtains, and this presupposes that the presuppositions of the statistical tests used to show this are also satisfied.Footnote 6 For instance, if an ANOVA is used, it is presupposed that the involved populations are normally distributed and have homogenous variance.Footnote 7

In order to sum up the aforementioned assumptions succinctly (and to frame them in a way that facilitates comparison with the corresponding assumptions of post hoc intervention in Sect. 3), let E be an evaluator who evaluates people in s–situations, let g1 … gp be social groups towards which E is substantially prejudiced (i.e. prejudiced enough for this to distort E’s rankings), let r1 … rn be future rankings produced by E (in s—situations) that feature members of g1 … gp, and let m1 … mo be rankings used in empirical tests of a particular intervention. We can then say, that it is reasonable to believe that an individual intervention will be epistemologically beneficial with respect to E and future s–situations r1 … rn, if it is reasonable to believe the following six assumptions.

  1. 11.

    E’s evaluations (r1 … rn and m1 … mo) are carried out using, minimally, an ordinal scale.

  2. 12.

    There are objective rankings for r1 … rn and m1 … mo.

  3. 13.

    The intervention is accuracy increasing in s–situations featuring members of g1 … gp, either unconditionally or conditionally on f.

  4. 14.

    (E is f.)

  5. 15.

    The intervention is without negative epistemological side–effects.

  6. 16.

    The statistical assumptions underlying the demonstration of I3 (and possibly I4, and I5) are satisfied.

For the purpose of illustrating a case where I think that these conditions are satisfied, consider the simple individual intervention (AGE) due to (Döbrich et al., 2014) aimed at reducing ageism in performance appraisals and hiring decisions. AGE just involves briefly informing evaluators of prevailing age biases and the negative consequences of age discrimination, and telling them to disregard age when carrying out their evaluations. To test AGE, Döbrich et al., (2014, study (2) had participants evaluate job applications that were identical save for the indicated age of the fictitious applicant. The participants had no difficulty evaluating the applications, and to the extent that E is like the participants, I1 is thus reasonable. As argued above, CV–studies induce an objective ranking, and I2 is thus satisfied (at least with respect to the testing situation).

Döbrich et al. demonstrated a substantial age–based prejudice among the participants, and that AGE could successfully remove this prejudice (or at least the influence of this prejudice) to the point of non–significance, using a between measures analysis of co–variance (ANCOVA). This demonstration lends some credence to the idea that AGE is accuracy increasing, conditionally on an evaluator being like the participants in the study (people currently or previously working in HR with experience of hiring decisions), i.e. that I3 is satisfied. Given that our imagined evaluator E is like the participants in the study, I4 is also satisfied. AGE just warned participants about the existence and effects of ageism and prompted them to not base their evaluation on age. Refraining from an irrelevant factor in this way seems unlikely to have negative side–effects, and I5 is thus reasonable as well.Footnote 8 Finally, no reasons are reported by Döbrich et al. to doubt the statistical assumptions underlying the ANCOVA (e.g. similarity of variance and distributional shape of the underlying populations and some additional linearity assumptions), so it seems reasonable to assume that I6 holds as well. Since all of I1–I6 seems reasonable, it seems tentatively reasonable to think that AGE will be epistemologically beneficial with respect to an evaluator E—who is like the participants in the study—and ranking situations like those in the study.

Before we move on, let us also consider an example of a kind of intervention where it is—currently—more questionable that I1—I6 are satisfied. For this purpose, we can consider an implicit bias intervention, an individual intervention that attempts—in the first instance—to reduce a measure of implicit bias, such as the IAT (Greenwald et al., 1998). One such intervention (endorsed by Madva, 2020, and found to be among the most effective implicit bias interventions tested by Lai et al., 2014 in a comparison of 17 such interventions) encourages participants to form implementation intentions, simple if–then plans of the form ‘If I see a Black person, I will respond by thinking “good”’.Footnote 9 This kind of intervention has repeatedly been shown to lower implicit bias (as measured by the IAT and other measures, see e.g. Gollwitzer & Schaal, 1998; Gollwitzer, 1999; Stewart & Payne, 2008; Mendoza et al., 2010; Webb et al., 2012; Lai et al., 2014; Wieber et al., 2014; Rees et al., 2019). However, it is questionable if it is epistemologically beneficial in the sense we are concerned with in the present context.

First, it is doubtful that the intervention really is accuracy increasing, since demonstrations of it decreasing implicit bias measures doesn’t establish this, even if these are coupled with negative correlations between implicit bias and degree of accuracy. Unless implicit bias causally contributes to impaired accuracy, changing implicit bias will not improve accuracy. If implicit bias and prejudiced behavior are both determined by some third factor, then lowering implicit bias will not reduce biased behavior. This in itself should be enough for us to doubt I3, but this doubt is further strengthened by a meta study by Forscher et al. (2019), who reported that even successful reductions in implicit bias do not generally seem to substantially reduce degree of prejudiced behavior.

Second, implementation intentions—as they have been studied so far—are also questionable from the perspective of I5, since it seems that they are more susceptible to side–effects than something like AGE since it is less clear what cognitive effects they have. Strengthening an association between Black faces and positive feelings need not have overall epistemologically beneficial consequence. For instance, Stewart & Payne’s (2008) successful intervention to reduce the degree to which participants misidentify tools as weapons after seeing a Black face by way of an implementation intention (‘If I see a Black face, I will think safe.’), also had the consequence of having participants misidentify guns as harmless tools after seeing a Black face (cf. Stewart & Payne, 2008: 1337, Fig. 1) to a greater extent. As long as the intervention is abstracted away from the task at hand, as is the case with many implementation intentions, the presence of unintended side-effects are more difficult to rule out.Footnote 10

Given these doubts about two of the conditions listed above, it seems—given the evidence currently available—to be questionable whether an implementation intention is epistemologically beneficial with respect to an evaluator. This is—of course—not meant as a forward–looking critique of implementation intentions—future research might reveal them to be excellent ways to reduce prejudice.Footnote 11 It is just an illustration of what doubts about the six conditions might look like, given our current evidential state.

3 The Presuppositions of a Post hoc Intervention

I will now momentarily leave the individual interventions and shift focus to the presuppositions of the post hoc intervention advocated by Jönsson & Sjödahl (2017) (refined by Jönsson and Bergman forthcoming).Footnote 12 Jönsson & Sjödahl’s (2017) intervention—GIIU—does not attempt to change evaluators, but attempts to change the rankings they produce after the fact (but before the rankings give rise to discriminatory effects).Footnote 13 The basic idea is that if the population means—with respect to some relevant form of competence—are known for the social groups (e.g. men and women) one is interested in, and a sufficiently large history of evaluations (a set of prior rankings) featuring these groups has been produced by E, E’s degree of prejudice can be estimated by comparing E’s past evaluations with what is to be expected from these if E had not been prejudiced. GIIU then updates future rankings in light of this estimated bias, via a set of corrective functions, and thereby attempts to increase the veracity of the rankings.

To make this more vivid, consider the following illustration from Jönsson & Sjödahl (2017), where the history of evaluations is assumed to contain three sets of competence evaluations, h1, h2 and h3, and we are considering whether to update the target ranking r0.

h1

h2

h3

r0

1

John

8

1

Mike

7

1

Brittney

6

1

Mike

8

2

Luke

7

2

Catherine

6

2

Jamal

5

2

Mark

7

3

Sarah

4

 

Billy

6

3

Richard

3

3

Felicia

6

 

Amber

4

4

Richard

5

 

Susan

3

4

Gordon

4

5

Isa

3

5

Jennifer

4

 

Aaliyah

3

5

Sarah

3

 

Jenny

3

   

6

Molly

2

 

Latifah

3

If we assume that the target groups are men and women, we can note that the mean competence scores in the history of evaluation for these groups are 5.9 and 3.8 respectively. Given that this is a statistically significant difference (if it is not, GIIU does nothing), and that there is no difference between men’s and women’s population means, GIIU concludes that E is biased and that r0 needs to be updated. One way to do this (which would remove the mean difference between men and women in the history of evaluations) is by means of adding 2.1 to the score of each woman in r0 (Felicia, Sarah and Latifah)—and one corrective function f1 thus corresponds to ‘f1(x) = x + 2.1’—another is to multiply each score by 1.55—and ‘f2(x) = x × 1.55’ thus corresponds to another corrective function f2. If either of these functions is applied to the scores in the history of evaluations, the difference in men’s and women’s means would disappear. Other corrective functions might have the same result. GIIU then individually applies each corrective function to the values in r0, resulting in n different rankings f1(r0), f2(r0), etc., one for each corrective function fi. To the extent that these rankings converge ordinally on the same results (e.g. Felicia being ranked higher than Mark) GIIU recommends that r0 be updated accordingly. If we assume that f1 and f2 are the only relevant corrective functions in this example, GIIU thus recommends that r0 be updated in accordance with the ranking that they agree on. As it happens, they completely agree on the rank order of the candidates—and GIIU thus recommends that r0 be replaced with the new ranking.

f1(r0)

 

f2(r0)

 

updated r0

Felicia

8.1

Felicia

9.3

Felicia

Mike

8

Mike

8

Mike

Mark

7

Mark

7

Mark

Sarah

5.1

Sarah

4.65

Sarah

Latifah

5.1

Latifah

4.65

Latifah

Gordon

4

Gordon

4

Gordon

In order for GIIU to reliably improve accuracy (e.g. for it to be reasonable to think that the updated r0 is better than the old one) a number of presuppositions must be satisfied. Jönsson and Sjödahl’s originally suggested five different presuppositions, but this list has been refined by Jönsson and Bergman (fortcoming) and extended into a slightly longer, more accurate list, that we will use here.

To partly repeat, and partly extend, the notation introduced to describe the presuppositions of individual interventions above, let E be an evaluator who evaluates people in s–situations in terms of some property c (e.g. competence), let h be E’s history of evaluations (i.e. the set of rankings that E has previously produced), let g1 … gp be social groups (populations) towards which E is substantially prejudiced, and let r1 … rm be additional, possibly future, rankings produced by E featuring members of g1 … gp.

With this in place, we can say that GIIU will tend to increase the veracity of r1 … rm, and thus be epistemologically beneficial, if the following seven assumptions obtain.

  1. G1.

    E’s evaluations (those in h and r1 … rm) are carried out using, minimally, an interval scale.

  2. G2.

    h is large enough to reliably find substantial prejudices against members of g1 … gp, with a suitable statistical test.

  3. G3.

    The mean values of c in (the populations) g1 … gp are known, or are known to be the same.

  4. G4.

    For the purposes of estimating and correcting prejudice, GIIU makes use of appropriate subsets of g1 … gp in h and r1 … rm.

  5. G5.

    Any fluctuations in E’s prejudice within each of the groups g1 … gp are small compared to the size of the corresponding prejudice.

  6. G6.

    E’s prejudice operates in an approximately linear way.

  7. G7.

    E’s prejudice operates on discrete groups.

These presuppositions will become clearer during the discussion in the next section, but to give the reader the basic gist of why they are presupposed, here are brief explanations in terms of the example above.Footnote 14

G1 is presupposed since it doesn’t make sense to apply arithmetic functions like f1 and f2 to ordinal numbers (cf. Jönsson & Sjödahl: 507). G2 is obvious. Without G3 there is no statistical basis for forming a hypothesis concerning what the average scores of men and women in h would have been, had E not been prejudiced, and E’s prejudice can thus not be estimated from the contrast between these scores and E’s actual scores. If G4 is violated, and E is not really prejudiced towards women, but towards some subset of women (e.g. only very feminine women), GIIU might correct scores in r0 for persons who E is not prejudiced against (women that are not very feminine). If E’s prejudice fluctuates greatly from woman to woman, G5 is violated, and GIIU might not improve veracity, since the corrective functions correct for prejudice which is supposed to be (fairly) stable. Finally, the corrective functions presuppose that an evaluator’s prejudiced evaluation is a function of the evaluated person’s real score—and thus presupposes something like G6—and that E’s prejudice is directed to members of certain social groups (women) categorically, rather than continuously (as a function of, e.g. degree of femininity)—and thus presupposes G7.

4 On the Presuppositions of Individual Interventions and the Post hoc Intervention

Let’s now discuss the presuppositions of individual interventions which were listed in Sect. 1, and the corresponding presuppositions of GIIU, as well as the limits imposed by each on the intervention to which they belong, and how the presuppositions of the two kinds of interventions relate.

First, it should be noted that the two sets of presuppositions are not in conflict, and that GIIU can be combined with an individual intervention if this is done with some care.Footnote 15 What should be avoided is administering an individual intervention between the rankings in h take place, and when those in r1 … rm take place. For if that intervention turns out to be successful, GIIU might overestimate E’s degree of prejudice when E carries out r1 … rm, and overcompensate, which might lead to a decrease in veracity. However, an individual intervention can complement GIIU without the risk of overcompensation, if the intervention is administered before h. Then, that intervention might lower E’s prejudice to some degree, and GIIU can monitor and correct for any residual prejudice.

Second, the only place where the two sets of presuppositions obviously overlap is with respect to G1 and I1, where the former assumption entails the latter, and both interventions thus presuppose I1. This assumption is so widely true though, that this is not a very restrictive. Moreover, G1 will seldomly restrict GIIU’s applicability further either, since evaluations that presuppose an interval scale are independently very common in contexts such as hiring, admission and project funding (cf. Gatewood et al., 2015: 210–212). For instance, many selection procedures involve grading candidates on a number of criteria, and then summing the resulting grades for an overall comparative score. Doing this presupposes an interval scale by itself.

Third, it might be worth pointing out that G2, although clearly restrictive to some extent, is less demanding than it might seem. According to Jönsson and Bergman's (forthcoming) estimates, only between ten and twenty evaluations are needed each from g1 … gp for a reasonable statistical power to reliably find a prejudice when it obtains. So the toy example described above is not that far off, quantity–wise.

Moreover, it can be mentioned that even if GIIU builds on the idea of estimating bias from actual past evaluations, a post hoc intervention quite like GIIU–call it GIIUfc–could be devised where one uses fictitious candidates in order to get a prejudice estimate. This would broaden the scope of post hoc interventions to cases where G2 (or G3, or both) is not satisfied.Footnote 16

Fourth, we have reason to believe in G3 in two different situations: either when a particular mean difference in competence between two social groups (or an absence of such a difference) have been demonstrated empirically, or when we are licensed to believe that no such difference exists in the absence of positive corroboration. The prevalence of the latter situations depends on how we think about the burden of proof concerning claims about absences of differences. Although there is no space to discuss this here, it can be mentioned that the assumption that there is no mean difference in competence between two social groups, is very similar to what one frequently assumes in CV–studies when one assumes that the manipulated variable is irrelevant to the evaluation at hand. Even if this is not a part of the presuppositions of individual interventions per se, it is thus presupposed by a common procedure to demonstrate I3.

It can also be mentioned that there is some affinity between G3 and I6, since both involve population level assumptions (even though the assumptions of I6 are typically about variance and distributional shape of evaluators, rather than means for the social groups being evaluated by evaluators).

Fifth, G4 amounts to the claim that the evaluator and GIIU categorize people into roughly the same categories. This is very plausible in many situations, since people are largely in agreement on how to categorize people into many social groups such as men, women, Black people, White people, obese people, etc. There is thus likely a correspondence in categorization between E and the person administering GIIU with respect to these groups.

Things are more problematic in situations where there is greater variability in categorization, i.e. where social group membership is identified via less clear physical attributes, as is the case, for instance, with sexual orientation, or religious conviction. In these cases, categorization is sometimes due to an inference from known features, that are merely associated with group membership, (e.g. inferring that someone is of a Muslim faith from them having a certain name, complexion and facial hair), and the willingness to make this inference might differ between different people. GIIU can still be used in these cases if it is used conservatively so that it is reasonable to think that the categories it uses are subsets of those used by the evaluator. But if it is unclear whether or not this is the case, it is better to use another kind of intervention, to avoid the risk of overcompensation.

Another problem bears mention in this connection as well. Every person being evaluated belong to multiple social groups. For instance, a person can belong to all of the following groups: man, Black, Christian, Black man, Christian man, Christian Black man. Depending on which group GIIU considers, it might recommend different updates (or no update at all). The question then arises: which of these updates should be used? This problem is both a version of the reference class problem and the problem of intersectional prejudice.Footnote 17 One strategy is to consider only the update pertaining to the smallest relevant social group (i.e. Christian Black man in this example). This could mean though, that there will not be enough members of that group featured in the learning history of E for GIIU to reliably find a bias. If this is the case, one might need to either switch to GIIUfc or another kind of intervention entirely. However, even if GIIU is unable to find and correct for all intersectional prejudice it might still find and correct for more general forms of prejudice.

It should be noted that parallel intersectional problems also arise for many individual interventions: which kind of interventions (in terms of the pertinent social groups) should the researched, and which well–researched interventions should be deployed with respect to a particular evaluator? These problems do not, however pertain to all interventions. There are interventions—such as anonymization, making sure that the evaluator is well rested, and is taking frequent breaks etc.—that remove conditions that enable prejudiced evaluation quite generally. If applicable to a particular situation, these seems best from the perspective of intersectionality. Administering them does not presuppose any commitment to prejudices being directed towards particular social groups.Footnote 18

Sixth, G5 requires that E’s prejudice towards members of a social group doesn’t fluctuate too much from member to member, to make sure that the constant update that GIIU prescribes will lead to improvement. The fluctuation at issue here is both that within h and within each of the rankings r1 … rm, but also between them. If G5 is not satisfied, an application of GIIU might decrease the veracity of a ranking rather than increase it.

G5 can thus be seen as a stability assumption, and as such can be seen as stronger version of an assumption built into the definition of improved accuracy for individual interventions. In Sect. 1, improved accuracy was spelled out in terms of the contrast in accuracy between the actual ranking situations after an intervention, and the corresponding counterfactual situations where the evaluator didn’t undergo the intervention. On this definition, improved accuracy presupposes that E’s accuracy would have remained somewhat stable, and not have increased as much, if E hadn’t undergone the intervention.

Seventh, assumptions G5—G7 are all assumptions that might fail to hold for particular forms of prejudice, in which case a preventive intervention is likely better.Footnote 19 G6 in particular might not be satisfied in some likely cases of prejudice. For instance, consider situations where E assigns competence scores to members of a certain group, at least partly with disregard for their actual competence scores. An extreme example of this is an evaluator who fails all students of a certain ethnicity regardless of their actual competence. A less extreme case is when E—in the grip of something like the stereotype that there are no female geniuses—makes use of a lower maximum score for women than for men. If the top competence score is, say, 10 this would be the case if E assigns all women with real competence scores between 8 or 10, an 8. Both these situations involve prejudice that violates G6.Footnote 20

Fortunately, there will often be clear signs in h if one of these assumptions is violated (e.g. in terms of the range of values used for some social group in h). It might not always be obvious which one it is; a linear wildly fluctuating bias might be confused with a non–linear stable prejudice in some histories of evaluation. But regardless of which, we can to some extent use h to gauge if we have reason to prefer another kind of intervention over GIIU.

Eighth, as we turn our primary focus now to individual interventions, we can note that I3 is the most restrictive assumption out of I1—I6. Many kinds of ranking situations have not been studied, and it is thus unclear whether particular individual interventions will increase accuracy in these. The results of meta studies of individual prejudice interventions (cf. Paluck & Green, 2009 and Paluck et al., 2021) are quite humbling. Paluck et al. and’s (2021: 553) recent study concludes that ‘much research effort is theoretically and empirically ill-suited to provide actionable, evidence–based recommendations for reducing prejudice’. Given that I3 is uncertain, we are better off to choose another kind of intervention such as a post hoc intervention.

Ninth, as was illustrated by the implicit bias interventions, I5 is not an obvious assumption. It seems the most likely to obtain when the individual intervention aims to increase objectivity, either by making the evaluator learn how to rank people more accurately for a specific purpose or helping the evaluator discount irrelevant information—like the simple intervention due to Döbrich et al. (2014). In many situations, where counterstereotypical interventions are used that try to moderate a deeper form of prejudice—which might manifest itself in many kinds of situations—it is less clear if the overall consequences are epistemologically beneficial. In cases where I5 is doubtful, we are better off using a different intervention. GIIU, in particular, has a very precise effect (since it doesn’t change evaluators at all), and does not run the risk of changing anything over and above the veracity of r1 … rm.Footnote 21

Tenth, I6 will often be true. Typically, ANOVAs are applied to randomized intervention and control groups to establish I3. Such analyses presuppose, among other things, that the populations corresponding to the relevant groups (before and after an intervention) have the same variance. This in particular is often reasonable to assume, barring incredible accidents during randomization, and variance changing interventions. And the latter will at least sometimes be revealed in the relevant samples, so that the ANOVA can be replaced with a more suitable, possibly non-parametric, test.

Eleventh, the individual interventions and GIIU have very different means to ensure increased veracity. Individual interventions rely on the availability of objective rankings to test interventions (I2). Post hoc interventions rely on known differences in population means, and a history of evaluation providing samples of these means from which a bias can be estimated (G3). Neither of these assumptions is unproblematic.

For instance, it seems that the former assumption involves something like a pragmatic inconsistency. For if there are objective rankings that can be used in measuring situations, why are not these used in real life situations? Why rely on an evaluator’s imperfect ranking, there? The answer will likely be in terms of the messiness of the real–life situations, the limited time and resources available to evaluators in them, or in terms of the greater control one might have in measuring situations (e.g. the possibility to manipulate variables of interest, as is done, e.g. in CV–studies).

Post hoc interventions on the other hand might seem overly reliant on histories of evaluation featuring representative samples of very wide populations. Why think that the male and female applicants for a certain kind of job represent men and women in general? The answer here is to point out that there is no such expectation (cf. Jönsson and Bergman, forthcoming, Sect. 4.1). Instead, the population means mentioned by G3 are those corresponding to the populations sampled in the history of evaluations, populations like men and women with a certain training or academic background, rather than men and women in general.

Twelfth, individual interventions and GIIU are quite different in terms of their generality (i.e. the size and content of r1 … rm and r1 … rn respectively) and this has consequences in terms of their epistemological cost–effectiveness (i.e. their likely epistemic reward given a particular set of resources).

Individual interventions are primarily constrained by which social groups, and kinds of situations that have been studied. If only prejudice interventions featuring black people have been studied in s–situations, a new set of studies have to be carried out to test interventions against other forms of prejudice, unless it is clear that the previous intervention generalizes. Since the generalizability of e.g. distraction inhibiting interventions (and other objectivity raising interventions) is likely higher than counter stereotypical interventions we have here an additional argument in favor of the former in terms of likely epistemic payoff.

GIIU on the other hand has to be applied anew to each ranking in r1 … rm, but since each application is not very laborious (and can be automated in part) this will not restrict its generality substantially. Instead, GIIU is limited by the size and diversity of E’s history of evaluations, and can only be used to correct for prejudice against social groups that have been featured repeatedly in h.

It follows from these limitations taken together, that if E has a rich history of evaluations, post hoc interventions might be an easy way to achieve ample epistemic gains for little cost. If E is less experienced, employing existing interventions or even researching new individual interventions might be a good way in order to provide epistemic gains with respect to E’s initial evaluations rather than wait for E to produce enough prejudiced rankings for GIIU to kick in.

Thirteenth, as we have seen, GIIU and individual interventions involve quite different presuppositions. Since there is little overlap between the presuppositions of the two interventions (the overlap mostly consists in the widely plausible I1, which is entailed by G1), they each apply to a range of situations where the other doesn’t apply, and can thus complement each other by increasing veracity in situations beyond the other’s reach.

5 Concluding Remarks

In this paper, I have spelled out the conditions that needs to obtain in order for it to be reasonable to believe that an individual intervention is epistemologically beneficial with respect to an evaluator E and ranking situations of a certain sort. I’ve compared these presuppositions with those of GIIU—a post hoc intervention—and argued that the two approaches can complement each other, in part by the possibility of applying (with caution) both to the same situations, and in part by the possibility to apply GIIU to situations where the presuppositions of individual interventions are not satisfied (and vice versa). Having both kinds of interventions in our intervention repertoire thus improves our ability to deprejudice. The complementarity of the two kinds of interventions also extend to more practical matters. Individual interventions can help make policymakers more inclined to adopt something like GIIU. Conversely, by adopting something like GIIU, an organization might end up with a more diverse work–force and this might act as a form of individual intervention (by increasing intergroup contact) which can make members of the organization less prejudiced and more inclined to accept additional interventions.

Since post hoc interventions are novel, I want to end the paper with a few more reflections pertaining to some pertinent, more practical, issues. These all require a more extended treatment than there is room for here, and thus correspond to suggestions for future research.

It is normally not obvious to the prejudiced person that they are prejudiced, since being prejudiced involves falling short of some norm, which the prejudiced person is normally not aware of, e.g. having false negative beliefs, unmotivated negative feelings or exhibiting unjustified negative behavior (cf. Jönsson submitted, b). Since it may not be obvious to the prejudiced person that they are prejudiced, it is thus not obvious to them that they are in need of an intervention. This can be problematic from the perspective of individual interventions in at least two ways, the prejudiced person might resist undergoing the intervention at all, and, equally damaging, they might not put in the work required by the intervention to become non–prejudiced. But even a prejudiced person typically believes that other people can be prejudiced (at least according to her own standards). This means that they might see a use for introducing general interventions (e.g. something like GIIU) into organizations they are members of, to keep others in check, even if this might affect them too. So the asymmetry inherent in prejudice (that we can see it in others, but not in ourselves) is something that might make it easier to introduce general interventions (structural, or post hoc interventions) than individual interventions.Footnote 22

This is not to say, of course, that introducing something like GIIU into an organization will be easy. People might, for instance, reject GIIU since it violates “colorblindness” (cf. Anderson, 2010: ch. 8) in the sense that it requires knowledge of the race, gender, sexual orientation etc. of the people being evaluated.Footnote 23 Or they might see GIIU as illegitimately overriding individual information with base rate information.Footnote 24 Or people might see GIIU as an expression of paternalism, and resist it for this purpose. Perhaps this last problem can be overcome by pointing out to them that GIIU doesn’t require any additional work at the individual level, or that it won’t do anything if no prejudice is detected. Or perhaps that there is comfort in the fact that no actual person will be responsible for overriding an evaluation when prejudice is detected (i.e. no actual pater enforcing the paternalism). Or perhaps paternalism can be entirely avoided if GIIU is not employed to override the evaluations of evaluators, but just to advise the evaluators.Footnote 25 But perhaps not. How something like GIIU will be received is an open empirical question.

This is also true of two additional potential shortcomings with it, which might also be worth mentioning. First, there is the worry that a prejudiced person that is being subjugated to GIIU might attempt to “game” GIIU by becoming increasingly prejudiced, and thus try to stay one step ahead of GIIU’s corrections. Second, there is the worry that introducing something like GIIU might get people to become less motivated to become unprejudiced, since they might come to rely on GIIU to correct their prejudice post hoc.Footnote 26 Both of these worries are open empirical hypotheses. But the converse hypothesis—that introducing GIIU will motivate people to accord with GIIU’s mandates and become less prejudiced—is equally open.