Attitudes, beliefs, and motivations concerning race are central to many prominent theoretical perspectives on prejudice and discrimination. Accordingly, researchers have developed and used scales to measure the effect of race-related attitudes on a wide variety of outcomes. However, the capacity for these theories to explain behavior hinges on how well researchers are accurately measuring these latent constructs. Measuring a construct poorly introduces error, leaving one unable to test hypotheses with precision. Just as an old metal detector will undoubtedly find some rings and coins but leave other treasure undiscovered, so too will an outdated or poorly designed scale reveal some effects but also leave many others undiscovered or poorly estimated. Similarly, just as an old metal detector might falsely signal the presence of gold when there are actually only iron oxides beneath the surface, the extent to which a scale fails to capture its intended construct will also lead researchers to draw erroneous conclusions about the theoretical meaning of observed effects.

A variety of scales have been developed and used by researchers to capture various facets of explicit racial attitudes, beliefs, and motivations. Approaches include: asking people directly about their level of racial prejudice (Axt, 2018), their race-related political attitudes (Henry & Sears, 2002), whether race contributes to the accuracy of various judgments (Uhlmann et al., 2010), whether they are motivated to control their own prejudice (Plant & Devine, 1998), their knowledge of cultural stereotypes (Ghavami & Peplau, 2013), how much conflict they perceive between groups in society (Sidanius et al., 2004), and their endorsement of racism-adjacent attitudes such as right-wing authoritarianism (Altemeyer, 1988) and social dominance orientation (Pratto et al., 1994). Additionally, some scales were constructed to capture variation in attitudes toward Black people more generally, rather than to measure a specific race-related attitude (e.g., the American National Election Survey scale; Payne et al., 2010). Notably, some scales “cluster” together such that they are related—sharing similar items, origins, or theoretical motivations—yet are still somewhat distinct. For example, the Symbolic Racism 2000, Modern Racism, and Racial Resentment Scales can all be understood as offspring of broader theorizing about symbolic racism (see Sears, 1988).

We broadly refer to this cluster of scales in the literature as “race-related scales”, not because they were all designed to specifically capture racial attitudes, beliefs, and motivations, but because they are either functionally used for this purpose (see Axt, 2018 for discussion) or used to explain racism-related attitudes and outcomes (e.g., right-wing authoritarianism; see Duriez & Soenens, 2009; Hiel & Mervielde, 2005; Nicol & Rounding, 2013). The broad evaluation of these race-related scales includes many that have shown a marked influence on psychological research, with 14 of the scales’ papers amassing over 500 citations and four of the scales’ papers amassing over 2500 citations (see Table 1). Furthermore, from a practical perspective, these scales are associated with race-related outcomes via their inclusion in Project Implicit data collection alongside measures of implicit racial prejudice. This dataset constitutes one of the richest and most influential sources of information on racial attitudes.

Table 1 Information about the 25 race-related scales

Measuring constructs as well as possible and with minimal error is key to hypothesis testing (Flake & Fried, 2020). Good measurement is not merely a concern for the replicability or reproducibility of results, but a key element of precisely connecting data to theory: if you do not know what you are measuring, or you are measuring it poorly, any results are dubious. Indeed, some scholars have argued that there is a “theory crisis” in psychology that partially stems from invalid measurement of latent constructs (Eronen & Bringmann, 2021). Any researcher hoping to tap racial attitudes, beliefs, or motivations must choose carefully between numerous measurement options, and it is difficult to holistically consider the multi-faceted evidence about the quality of many scales. Here, we address this important concern using a large dataset and modern methodology to evaluate the validity properties (i.e., construct validity in general with a greater focus on structural validity) of 25 race-related scales.

Our intention is not to show the invalidity of any given scale. Indeed, the scales that we evaluate have played essential roles in decades of research on the nature of racial stereotyping and discrimination. Instead, we aim to identify which scales currently have the best psychometric properties and highlight opportunities to renovate existing scales to better capture the underlying latent factors they are designed to measure.

The ongoing process of construct validation

Loevinger (1957) described the process of construct validation in three phases: substantive, structural, and external. The substantive phase outlines the theoretical underpinnings of a construct. The structural phase involves quantitative analyses, examining psychometric properties of the measure such as reliabilities and factor structure. Finally, the external phase measures whether the scale relates to attitudes and outcomes one would expect it to predict, such as other measures of similar constructs as well as relevant judgments and behaviors.

Researchers who develop and use scales often overlook or downplay the structural phase of construct validation. For instance, in a broad examination of construct validation in social and personality psychology, 57 of 301 reviewed scales provided no information at all about the scale, and another 205 provided only information about internal reliability (largely via Cronbach’s α; Flake et al., 2017). In our targeted review of the race-related scales evaluated in this manuscript, we were only able to locate clear reliability information for 20 of the 25 scales (see Table 1).

Downplaying structural validity can appear innocuous when the substantive and external stages of construct validity appear to yield good evidence of a scale’s functionality. However, a scale with good substantive and external validity can still lead to incorrect conclusions about the nature of the latent construct. For example, consider Fig. 1, which depicts a hypothetical “true” model of two distinct but related factors (Factor 1 and Factor 2). With perfect measurement of both factors (top panel), results only find evidence that each factor predicts outcomes for which it is truly related. The same is true for when only one of the two factors is measured well (middle panel). However, with poor measurement fit, the substantive and external phases could yield good evidence, but because the scale is now capturing both factors and doing so with considerable error, accurate conclusions are jeopardized through higher rates of type I errors (i.e., incorrectly concluding that a factor predicts an outcome it does not) and type II errors (i.e., incorrectly concluding a factor does not predict an outcome that it does).

Fig. 1
figure 1

Illustration of the importance of structural validity. Note. Arrow width represents the size of the relation between variables. In color figure, variables related to Factor #1 are blue; variables related to Factor #2 are red; and variables related to both factors are purple

Even when these race-related scales included more rigorous evaluations of structural validity, the passage of time since their creation still poses a threat to scale validity. Construct validation is an ongoing process (Cronbach & Meehl, 1955), and how much information a given scale provides about the underlying latent construct is context-dependent. Some scales that may have been highly reliable and valid in past decades may no longer be so due to cultural changes in society that have rendered their items less informative (Fabrigar & Wegener, 2016; Kane, 2013). For example, the question “Interracial marriage should be discouraged to avoid the 'who-am-I?' confusion that the children feel” from the Attitudes Toward Blacks scale (Brigham, 1993) might be interpreted differently three decades later. Furthermore, modern research now has a greater focus on whether findings generalize beyond a given sample (Henrich et al., 2010). Six of the 25 race-related scales we evaluated were validated only for White participants and 12 of the 25 scales were validated only for college students (see Table 1), and thus may not be suitable for capturing the attitudes of non-White participants.

Modern developments in evaluating structural validity

The ongoing process of construct validity does not just concern the shifting meaning of items and populations of interest. The specific tools used to evaluate aspects of validity and reliability have improved considerably in the past few decades. Furthermore, existing but underused tools have become very accessible thanks to advances in open-sourced statistical software. We incorporate four new or underused tools in social psychological measurement in the present work: McDonald’s ω to evaluate global internal consistency; dynamic fit indices to better evaluate model fit in confirmatory factor analysis; item response theory to evaluate the distribution of latent factors and local reliability of scales; and nomological nets to generally evaluate the convergent and discriminant validity of scales by considering each scale’s relation to all other scales. None of these valuable modern tools were used for the validation of the 25 race-related scales that we review (see Table 1).

In the following sections, we discuss each of these tools, contrasting them with traditional methods when appropriate, and highlighting their advantages and unique contributions. We also describe the corresponding data analysis plan for evaluating the 25 race-related scales considered in this paper.

McDonald’s ω

Internal reliability refers to the extent that the items in a scale are consistent with one another. Cronbach’s α is the most commonly used measure of global internal reliability (Cronbach & Meehl, 1955). Researchers typically only report coefficient α as a measure of internal consistency in social psychology (73%; Flake et al., 2017). However, Cronbach’s α relies on a handful of assumptions that are rarely if ever met, such as complete unidimensionality and essential tau-equivalence (i.e., the equal loading of all items onto the latent factor; Dunn et al., 2014; Hayes & Coutts, 2020). The violation of these assumptions can bias Cronbach’s α to overstate reliability. For this reason, researchers have encouraged the use of McDonald’s ω as a more accurate measure of global internal reliability (McDonald, 2013), as McDonald’s ω eschews assumptions of unidimensionality and essential tau-equivalence. We compare Cronbach’s α and McDonald’s ω for all scales.

Dynamic fit indices

Although measures of internal consistency such as Cronbach’s α and McDonald’s ω are related to evaluations of factor structure such as confirmatory factor analysis (CFA), they are not equivalent. In CFA, researchers impose a model on the data in which one or more underlying latent factors are theorized to “cause” the responses to the items in a survey. Various model fit indices, such as the Comparative Fit Index, Root Mean Square Error of Approximation, and Standardized Root Mean Square Residual, attempt to capture the model’s alignment with the data, allowing researchers to make informed decisions about whether a model is a “good” fit for the data. Poor model fit indicates some imprecision about the structure of the model, which essentially translates to not measuring the construct that one thinks one is measuring. If the poor-fitting scale is then used to predict outcomes, any conclusions rendered using this scale are more likely to be wrong, due to uncertainty about what exactly the scale is really measuring.

While creating guidelines allows for a wider adoption of these methods, exactly what constitutes a “good” or “bad” model fit can be unclear. In their seminal paper, Hu and Bentler (1999) provided “rule-of-thumb” model fit thresholds against which researchers could evaluate their models. Researchers took full advantage of these concrete guidelines—the paper now has over 90,000 citations. However, the model fit thresholds defined by Hu and Bentler are based on models with very specific characteristics on many dimensions, such as factor loadings, number of latent factors, and correlation between latent factors. For example, the “reliability paradox” describes how a scale with less measurement error can actually have worse model fit than a scale with high measurement error, even if they appear to have the same model fit statistics (Hancock & Mueller, 2011; McNeish et al., 2018). Although Hu and Bentler warned against blanket use of their model fit thresholds across all CFA contexts (Hu & Bentler, 1998, p. 446; see Barrett, 2007; Millsap, 2007), researchers have typically applied them without considering the constraints of the original simulations.

Dynamic fit indices address this shortcoming in evaluating model fit (McNeish & Wolf, 2021). This approach conducts simulations to generate appropriate model fit thresholds on a model-by-model basis, taking into account specific characteristics of the model including: loadings, item intercepts, number of items, sample size, error variance, number of latent factors, and the correlation between latent factors. Doing so generates model-specific fit thresholds that match the intended use of the static model fit thresholds to effectively balance type I and type II errors when evaluating factor structure. We evaluate each race-related scale using both Hu and Bentler cutoffs and dynamic fit indices to contrast these results.

Item response theory: Local reliability and the distribution of latent scores

Evaluations of construct validity go beyond factor structure. Construct validity also concerns whether the theorized distribution of latent factor scores (i.e., the expected distribution of the latent factor in the population) can be adequately captured by the scale items. Imagine a racial prejudice scale that assumes a normal distribution of racial prejudice across the population (Fig. 2). The latent scores show that the lowest score on the scale is also the mode, far from the theorized normal distribution. This pattern would suggest that the scale is not adequately capturing the theorized distribution of the latent factor in the population due to floor effects. As a result of this floor effect, any researcher interested in capturing sample variation at the lower end of the latent construct would be unable to do so. For example, imagine researchers were interested in correlating prejudice scores with an outcome within a population low in anti-Black prejudice (e.g., students at historically Black colleges and universities). If these researchers used a scale with a floor effect, they would erroneously find very little variation in anti-Black prejudice and likely to draw incorrect conclusions from their data because the scale does a poor job separating those low in prejudice from those very low in prejudice.

Fig. 2
figure 2

Theorized versus actual latent score distributions. Note. These hypothetical data illustrate how the theorized distribution of latent factor scores may not be matched by the actual distribution of the measured latent factor scores

Item response theory (IRT) can provide insight into the distribution of latent scores. After fitting a model, latent factor scores are predicted for each individual in the sample and plotted in a distribution to identify potential levels of the latent factor not well captured by the scale.

Of course, the idea that internal reliability is global (i.e., stable across levels of the latent factor) is itself a major assumption that typically goes unexamined in social psychology. The internal reliability of a scale can vary as a function of the mean (i.e., high to low) of the latent factor, and this localized variation in reliability can be examined using IRT (Baker, 2001). For example, the Need for Cognition scale shows high internal reliability overall, but reliability decreases at the positive end of the scale (i.e., for people high in Need for Cognition; Edwards, 2009). As another example, the Affect Scale (Zanon et al., 2013) shows better reliability at lower levels of positive affect (Zanon et al., 2016). For researchers targeting populations with lower or higher levels of racial attitudes, considering localized internal reliability is critical. Here, we used IRT to examine local reliability for all scales.

Nomological nets

Finally, we consider the relationship between the 25 race-related scales by constructing a nomological net. This nomological net allows us to broadly evaluate both the convergent and discriminant validity of each scale (Cronbach & Meehl, 1955). It is important to note that a nomological net does not tell you about what you are measuring, only the extent to which any two scales are correlated. Yet if two scales are located very closely in a nomological net, and one purports to measure construct X, and another to measure a distinct construct Y, we can correctly infer that at least one of these claims is likely incorrect. Highly correlated scales located in a similar space suggests that the latent constructs measured may be similar, even if they purport to measure something distinct. Furthermore, the nomological net provides more global information about which areas of the latent factor “space” are more densely populated with scales. Similarly, this latent factor space also identifies areas that are sparsely populated, highlighting scales that are capturing something unique.

The present study

In this study, we evaluated the validity properties of 25 race-related scales. We used a Project Implicit dataset (Axt, 2018) with over 1 million participants completing two of the 25 scales—the dataset contains over 40,000 responses to each of the 25 scales. This evaluation is the most thorough to date, featuring sample sizes several times larger than those used to validate the scales in the original papers. Additionally, despite the limited and non-representative nature of the Project Implicit sample, the sample is still far more representative than the samples used in the initial validation of nearly all of these scales (see Table 1 for comparison). Compared to original works that overwhelmingly used (mostly White) undergraduates in psychology classes, the present sample is larger, and with greater racial, ethnic, and age diversity. Finally, because the Project Implicit dataset includes all 25 scales in the same sample, it is uniquely suited to creating a nomological net of these scales, a key aspect of establishing construct validity (Cronbach & Meehl, 1955; Flake et al., 2017). To our knowledge, this is the first time such a broad network in the prejudice domain has been established. However, one concern about a Project Implicit sample is its representativeness, given that participants self-selected into the study. To better generalize our findings, we also examined a smaller supplementary sample of paid, online participants.

Method

All data, scripts, figures, analysis markdowns, and other supplementary materials are available at https://osf.io/zg6fr/. Markdowns are recommended as the most accessible way to evaluate details of the methodology. We did not complete preregistrations for this project.

Participants

We used data provided by 1,396,234 Project Implicit respondents (60.1% female, 68.5% White, 9.7% Black, Mage = 27.3 years, SDage = 12.2, 82.8% US residents) between October 23, 2014 and September 27, 2016. These data were originally analyzed in Axt (2018). Each respondent was asked to complete a demographics questionnaire, the Race Implicit Association Test, as well as two randomly selected race-related scales (to avoid participant fatigue). Order of measures was randomized to account for any possible order effects; 365,027 participants dropped out of the study before completing the explicit race-related scales. We also excluded respondents outside of North America (US and Canada), given the unique racial context in North America that many of these scales are originally intended to capture (analyses including these respondents are available on the OSF page). With these exclusions in mind, we conducted analyses on datasets including both White respondents only (N = 569,414) and all respondents (N = 910,066).Footnote 1 We center the analyses including only White North Americans in our figures and reporting.

Materials

We evaluated 25 scales in this study,Footnote 2 and thus refrain from providing an in-depth description of each scale. See Table 1 for information about each scale.Footnote 3 The wording for each individual scale item is provided on the OSF page, as are any deviations from the original wording.

Analytic approaches

Analyses were completed in R using lavaan (Rosseel, 2012) for CFA, dynamic to generate dynamic fit indices for CFA (McNeish & Wolf, 2021), ltm (Rizopoulos, 2006) for IRT, and igraph (Csardi & Nepusz, 2006) for creating the nomological network.

Alpha and omega

Although CFA can provide more in-depth information about internal consistency, Cronbach’s α and McDonald’s ω are still commonly reported metrics throughout the literature. Further, we believe our analyses use the largest validation sample size to date for all scales. Accordingly, we considered it valuable to report and compare how these scales performed on each of these metrics, enabling easy contrasts by researchers in their own work. Here, to give a more direct comparison of how Cronbach’s α and McDonald’s ω compare to each other, we use McDonald’s ωu, which assumes that the latent factor is unidimensional and that the indicators are continuous (Flora, 2020).

Confirmatory factor analysis

Evaluations of model fit

We focused on three commonly reported indices: the Standard Root Mean Squared Residual (SRMR), the Root Mean Square Error of Approximation (RMSEA), and the Comparative Fit Index (CFI). For each of these statistics, we compared the actual fit to both the commonly used Hu and Bentler (1999) rule-of-thumb thresholds and the dynamic fit thresholds. Dynamic fit thresholds were calculated as a function of model factor loadings, item intercepts, number of items, sample size, error variance, number of latent factors, and the correlation between latent factors. These thresholds correctly reject misspecified models 95% of the time, while incorrectly rejecting correctly specified models 5% of the time. For full details, see (McNeish & Wolf, 2021).

Methods of estimation

For our CFAs, we used two different estimation approaches. First, we used maximum likelihood, traditionally used for the estimation of latent constructs in social and personality psychology. Currently, the estimation of dynamic fit indices is only fully compatible with maximum likelihood models and other models that treat latent factor indicators as interval, making the use of maximum likelihood necessary for the estimation of dynamic model fit thresholds.

Second, we used robust diagonal weighted least squares estimator, which treats the latent factor indicators as ordinal instead of interval (the latent factor itself is still on an interval scale). Diagonal weighted least squares provides more unbiased factor loadings and fit statistics for scale items under most conditions (Li, 2016a, 2016b). Furthermore, because Likert-type items are more accurately characterized as ordinal rather than interval, they better represent the data. Although dynamic fit indices are not currently designed for use with diagonal weighted least squares models, we nevertheless considered both dynamic thresholds and Hu and Bentler thresholds to comprehensively evaluate these scales. These results are provided on the OSF page.

Item response theory: Latent factor distribution and local reliability

We used IRT to evaluate the distribution of latent factor scores alongside the local reliability for each race-related scale, with an emphasis on evaluating those that either show reasonably good model fit or are prominent in the social and personality psychology literature. Latent factor distribution is essentially how well the latent factor is measuring a construct across different levels of the scale, which we examined by predicting latent factor scores using IRT Models and then plotting a density function for these latent factor scores. Local reliability is captured by the test information function, which illustrates the amount of information provided by the test items across levels of the latent distribution (Baker, 2001; Edwards, 2009). For interpretability, we used the formula \(\sqrt{1/ INFORMATION}\) to convert Information to the standard error of measurement,Footnote 4 which describes the extent to which an observed score likely differs from the true score (Dudek, 1979; Edwards, 2009; Tighe et al., 2010).

Latent factor distribution and local reliability are related. Scales that show steep declines in reliability at values of a latent factor also tend to show “peaks” indicating floor or ceiling effects in the scale’s ability to measure the latent factor. For this set of analyses, we focused primarily on latent factor distributions. “Peaks” in the observed distribution at the edges of the scale indicate ceiling or floor effects, typically accompanied by local reliability issues at the scale extremes.

Complete output from the IRT analyses are available on the OSF page. They also provide discrimination parameters for each item in each scale, as well as the graded-response model extremity scores for the outermost responses to each scale item.

Nomological networks

Nomological networks are a representation of constructs and the relationships between them (Cronbach & Meehl, 1955). These nets help assess whether a construct is “behaving as theorized” within a broader constellation of other constructs. In other words, it should be closer in space to other similar constructs, and further from those theoretically posited to be dissimilar. The distance between concepts is a function of some measure, such as the correlation between any two constructs.

In the present research, we created nomological nets using the Pearson correlation between diagonal weighted least squares latent scores. A force-directed algorithm determined the positioning of all the constructs relative to one another (Kamada & Kawai, 1989).

Results

Alpha and omega

Cronbach’s α and McDonald’s ω equations yielded similar internal reliability scores, with some exceptions. Most scales, but not all, showed adequate global internal reliability by commonly used standards, with 24 of the 30 scales and subscales showing both Cronbach’s α and McDonald’s ω values of over .70. Reliability for many of these scales is higher for White participants, compared to all participants, consistent with the development of many of these scales to measure the attitudes of White people (Fig. 3). See the OSF page for full tables listing Cronbach’s α and McDonald’s ω scores.

Fig. 3
figure 3

Cronbach’s α and McDonald’s ω scores. Note. Scales are arranged such that scores progress from left (best score) to right (worst score). This practice is maintained throughout the paper

When Cronbach’s α and McDonald’s ω do diverge, it is likely because of Cronbach’s α assumption that the item variances of the true scores are constant across items (a tau-equivalent model) has been violated. McDonald’s ω makes no such assumption, allowing the item variances of the true scores to differ from item to item (a congeneric model, which is more consistent with CFA; see Dunn et al., 2014).

Confirmatory factor analysis

In this section, we considered the extent to which maximum likelihood CFA models for each of the scales adequately fit the observed data. For each of these models, we calculated the difference score between the observed SRMR, RMSEA, and CFI fit statistics and the dynamic fit cutoff produced using McNeish and Wolf’s (2021) methodology. We also calculated the difference score between the observed SRMR, RMSEA, and CFI fit statistics and the Hu & Bentler (1999) traditional cutoffs to illustrate the difference between the traditional and dynamic cutoffs. We examined the model fit of entire scales rather than individual subscales. For every scale with multiple factors, we fit the models theoretically proposed by the authors.

Results were similar for White participants only and for all participants, and were also similar when comparing maximum likelihood results to diagonal weighted least squares results. See Figure 4 for fit statistics for maximum likelihood using all participants. See the OSF page for fit statistics from maximum likelihood using all participants and all diagonal weighted least squares models.

Fig. 4
figure 4

Model fit information for both traditional and dynamic cutoffs. Note. Values indicate difference scores subtracting the fit statistic from the cutoff statistic. Scales are arranged such that the best SRMR fit statistics (to the left) are at the top and the worst SRMR fit statistics (to the right) are on the bottom

For the majority of scales, the theoretical latent factor poorly fit the observed data. In the case of SRMR, whereas 20 of the 25 scales pass the traditional Hu and Bentler cutoff, only six of the 25 scales pass the dynamic fit cutoff necessary to correctly rejected misspecified models 95% of the time. On the other hand, differences between the two types of cutoffs were more modest for RMSEA and CFI, with certain scales (most noticeably Prejudice Index and Bayesian Racism) showing better evidence of good model fit with the dynamic cutoffs than with the Hu and Bentler cutoffs. Given that the evaluation of model fit is typically performed taking all of these indices into consideration (e.g., Hussey & Hughes, 2020), the differences in results between the dynamic and traditional cutoffs change the conclusions researchers might reach about the model fit of a given scale. Overall, these dynamic cutoffs illustrate that many scales commonly used by social psychologists are likely misspecified to some degree and that some scales show poor enough model fit that they include a substantial amount of error.

Distribution of latent scores and local reliability

Here, we evaluated to what extent a particular scale is more or less able to capture attitudes at certain values of the scale. In Fig. 5, we visualize these results for six race-related scales, selected either for their good dynamic model fit (Bayesian Racism, Modern Racism, Perceived Group Conflict, Prejudice Index) or for their importance as theoretical constructs in the literature (Racial Resentment, Social Dominance Orientation). The density plots depict the distribution of latent factor scores and the lines represent the standard error of measurement as a function of latent factor level. Graphs are available for all other scales on the OSF page.

Fig. 5
figure 5

IRT latent factor distributions and standard errors of measurement. Note. The purple density plot depicts the distribution of predicted latent factor scores for each participant. The black line indicates the standard error of measurement. The x-axis unit is the standardized latent factor in IRT (i.e., theta)

Of note, Modern Racism and Perceived Group Conflict show severe floor effects, such that the scale fails to distinguish between individuals low in these latent factors. Bayesian Racism and Social Dominance Orientation show modest floor effects, and Racial Resentment shows a minor floor effect but follows a relatively normal distribution. Notably, a severe increase in the standard error of measurement accompanies these peaks, indicating low reliability of the scale items for participants whose “true score” on the latent factor is low.

The Prejudice Index shows a very large peak at the center of the distribution, consistent with the measure’s scoring being derived from a series of difference scores concerning the degree to which Black vs. White people have certain characteristics. This is less problematic than a concentration of scores at the edge of the distribution. The scale appears to distinguish between strong pro-Black attitudes, neutral attitudes, and strong pro-White attitudes, unlike the rest of the scales (perhaps excluding Racial Resentment). This interpretation is consistent with the stability of the standard error of measurement toward the center of the distribution.

Results were similar for analyses including all participants. An examination of the other scales shows floor effects for both General Intergroup Anxiety and Intergroup Anxiety and a large ceiling effect for Internal Motivation to Control Prejudice. See the OSF page for information regarding item-level discrimination and extremity parameters for all scales, which is useful for closely examining the contents of a specific scale.

Nomological net

We created a nomological net featuring all evaluated scales and subscales (30 total) using latent factor scores (Fig. 6). The most noticeable feature of this net is the tight clustering of the majority of the scales. This is consistent with an interpretation that these scales are all tapping similar and related constructs, even when designed to measure attitudes, motivations, or beliefs about groups in general rather than racial groups specifically. What those constructs are, exactly, cannot be determined from this analysis, but many are theorized to measure racial prejudice, though the cluster also includes some scales that are not theorized to directly measure racial prejudice (e.g., Internal Motivation to Control Prejudice, General Intergroup Anxiety, Intergroup Anxiety), scales theorized to be independent personality constructs (e.g., Social Dominance Orientation, Right-Wing Authoritarianism), and scales purposely constructed to tap variability in attitudes toward Black people in a variety of domains (e.g., American National Election Survey). What this means is that, even if a scale was not designed to measure prejudice per se but is highly correlated with another designed to measure prejudice, it might be the case that at least one of the scales is being misinterpreted. Both may be tapping prejudice, or both may be tapping something else.

Fig. 6
figure 6

Nomological network of scales from White participants. Note. Line width and line color are both functions of the strength of the correlation. Only relationships above .3 are plotted. The proximity of nodes reflects the relative position of each scale given its correlation with all other scales

Notably, the two most straightforward measures of prejudice—a single seven-point measure of preference for White versus Black individuals (“OneItem”) and a difference score between ten-point thermometer ratings of White and Black individuals (“tDiff”)—are on the edge of the central cluster and are less strongly related to many of the other race-related scales, though they are more strongly correlated with implicit attitudes than the other scales (Axt, 2018). The network also illustrates that certain scales occupy less-populated theoretical spaces. While we cannot know what, exactly, these scales are capturing, this visualization makes clear the relative sameness or distinctiveness of each scale.

Measures of motivation to control prejudice (with the exception of External Motivation to Control Prejudice) and cultural knowledge of stereotypes occupy the area outside the main cluster, suggesting their relative distinctiveness as latent constructs. Finally, we note that Perceived Group Conflict is not strongly related to any of the other scales in the nomological net, in line with its intent to capture experiences of discrimination, rather than prejudiced attitudes (Sidanius et al., 2004). For researchers interested in more closely examining the connections in the dense central cluster, a nomological net depicting only correlations of .5 and above is available on the OSF page.

Robustness checks

Despite being the largest and most representative samples to examine the majority of these scales, one might be concerned that these results do not generalize beyond the volunteer Project Implicit sample. In particular, we considered the possibility that the floor and ceiling observed in the latent score distributions might be a unique characteristic of the Project Implicit sample, because participants self-selected into the study and as a result may have been more concerned about appearing unprejudiced. To explore this issue, we collected two separate samples from Mechanical Turk, an extremely common source of participants in modern psychology. For both theoretical and practical reasons, we opted to focus on the distribution of latent scores, which can easily and clearly be compared across samples despite the large difference in sample size. Alpha, omega, and model fit statistics are available on the OSF page.

To facilitate useful comparisons between the latent score distributions in the two samples, we selected scales with relatively good, moderate, or bad fit in the Project Implicit sample that were further characterized by either distinctive (i.e., large floor effects, ceiling effects, or central peaks) or relatively normal distributions. In the first sample (N = 308, NWhite = 280, 42.9% men, 56.8% women, .3% nonbinary, Mage = 42.5 years, SDage = 14.2 years), participants provided responses for Modern Racism and Prejudice Index (good fit but non-normal latent score distributions). In the second sample (N = 300, NWhite = 232, 53.8% men, 44.1% women, .7% nonbinary, Mage = 40.8 years, SDage = 13.4 years), participants provided responses for Racial Resentment, Racial Attitudes, Racial Arguments, and Social Dominance Orientation (moderate to bad fit with various latent score distributions). Both samples were collected in 2021.

The IRT latent score distributions for White North American participants (both Project Implicit and Mechanical Turk samples) are depicted in Fig. 7. Overall, results were extremely similar between the two samples. As in the Project Implicit sample, the distribution of latent scores in the Mechanical Turk sample showed floor effects for both the Modern Racism and Social Dominance Orientation scales. The Prejudice Index still demonstrates a noticeable peak in latent scores close to the center of the distribution. Finally, the other three scales’ distributions resembled those observed in the Project Implicit sample, with the exception of a slight floor effect for Racial Resentment that is present in the Project Implicit data but reduced in severity in the Mechanical Turk data. We interpret results as evidence that results of the present research are not merely a function of the Project Implicit sample (or the Mechanical Turk sample).

Fig. 7
figure 7

Note. Project Implicit (red) and Mechanical Turk (blue) latent score distributions are distributed in similar patterns. The density plot for the Mechanical Turk distribution is smoother because of the much smaller sample size. The Prejudice Index distribution is shifted in the Mechanical Turk sample because of the absence of any observations of certain scale values in the smaller sample (i.e., strong pro-Black attitudes), which changed the numeric center of the scale

General discussion

Historically, theoretical models are a major pillar of social psychology, postulating the structure and consequences of people’s attitudes, motivations, or beliefs toward those in other groups. Thousands of papers have been devoted to this topic and numerous scales have been developed to tap the relevant constructs. Accurate measurement is at the core of this theoretical progress. Those studying racial attitudes, motivations, and beliefs need to measure racial attitudes, motivations, and beliefs with precision. To the extent measurement is poor, data cannot provide clear evidence for theoretical models, even in the presence of significant findings. Improved measurement is ultimately critical for knowledge accumulation and without it the field is hindered (Flake & Fried, 2020).

To aid in this process, we performed the most comprehensive evaluation of race-related scales to date, evaluating the validity properties of 25 race-related scales using modern techniques. Using dynamic fit indices (McNeish & Wolf, 2021), we found that model fit of most scales ranged from unacceptable to highly unacceptable. We also found that some of the best-fitting race-related scales, such as Bayesian Racism and Modern Racism, exhibited problematic “peaks” at the floor of their distributions, indicating these scales are less adept at differentiating between individuals at lower values of the latent construct. Finally, we created a nomological net that helped to identify that prejudice measurement is a saturated space, an observation that informed our recommendations for scale use, which we describe below.

Recommendations

Simultaneously considering the results of our wide-ranging analyses, we provide four concrete recommendations along with three additional observations.

Recommendation 1

For researchers who do not have a strong a priori reason to use a specific measure of prejudice, we recommend using Prejudice Index, Modern Racism, or Bayesian Racism scale to measure general anti-Black prejudice. This recommendation is grounded in both the superior model fit indices of these constructs in the CFA section as well as their locations in the nomological net. Poor model fit indicates a mismatch between the theoretical structure of the model and the observed data, which makes unclear whether a specific latent construct is being measured at all (as shown in Fig. 1). If the data do not fit the theorized structure of data, researchers are not measuring what they think they are measuring, and their conclusions are more likely to be wrong. Modern Racism, Bayesian Racism, and Prejudice Index are located in the central cluster of the nomological net, a cluster that we interpret as “general anti-Black prejudice”. Thus, when researchers do not have interest in a specific race-related theoretical construct, we recommend these three scales.

Recommendation 2

We recommend using the Prejudice Index over Modern Racism and Bayesian Racism when researchers wish to differentiate between specific levels of pro-Black/anti-White sentiment. This recommendation is grounded in the IRT results for both the distribution of latent factor scores and the local reliabilities. Modern Racism and Bayesian Racism both demonstrate a floor effect, poorly capturing variation at the bottom of the scale, whereas the Prejudice Index, which uses difference scores between ratings of Black and White groups, has no such limitation. If a scale’s latent factor score does not cover a particular range of values, the scale is not sensitive to variation of the construct in that area. Conclusions hinging on sensitive measurement in that range are more likely to be wrong. Thus, the Prejudice Index is better suited for answering questions that pertain to variation in pro-Black/anti-White sentiments.

Recommendation 3

As a more general recommendation, we reiterate that simply reporting Cronbach’s α and McDonald’s ω as evidence of a scale’s validity is insufficient (see Flake et al., 2017). Many scales with high α and ω scores performed quite poorly in terms of model fit (e.g., Right-Wing Authoritarianism and General Intergroup Anxiety). Conversely, Prejudice Index, one of the best-performing scales in terms of model fit and latent score distribution coverage, had α and ω scores that were acceptable but relatively low compared to most other scales. We echo many others in cautioning against authors’ use of these scores as standalone justification for an existing or novel scale and correspondingly recommend that editors and reviewers push back against this practice.

Recommendation 4

Finally, we emphasize that researchers with strong motivation to measure a specific latent construct should not necessarily hesitate to use the appropriate scale. However, they should keep in mind the potential limitations that come with this decision (e.g., low confidence that the latent construct of interest is actually being captured). In this case, we recommend incorporating scale evaluation as part of the project and considering scale renovation (discussed below).

Additional observations

Researchers seeking to measure motivations to control prejudice would be reasonably well served by the Internal and External Motivation to Control Prejudice scales, with a couple caveats. The scale overall shows decent but not good model fit, and although External Motivation appears to be a quite distinct latent construct, Internal Motivation appears to be part of the general anti-Black prejudice cluster of scales in the nomological net.

Researchers seeking to measure cultural knowledge might be better served by the items in the Cultural Attitudes Toward Black People or Perceptions of Others’ Prejudice scales if they regard them as separate indicators of cultural knowledge about specific traits (e.g., aggression, attractiveness, trustworthiness). These scales do not appear to capture a single underlying latent construct.

Finally, the high correlations between many of the scales in the nomological net suggests that, in general, the theoretical space related to racial stereotyping and prejudice is highly saturated. We recommend that researchers think carefully about the extent to which a given scale that purports to measure a specific kind of racial prejudice or race-related attitude actually does so, at least in a way that is theoretically distinct from other related attitudes. If it is the case that some of these scales are conceptually redundant, this justifies the selection of scales for their measurement properties.

What this work does not mean

Although we present concrete recommendations, we also wish to be clear about what we are not saying. First, all of the recommendations above are based solely on the scales’ psychometric properties and location in the nomological network. External validity evidence for these constructs was not the aim of the present research, and we cannot speak to how well these scales predict outcomes of interest (though, all else equal, scales with more measurement error are less likely to predict with precision). Some researchers might believe that a specific scale is particularly well-suited for predicting a certain outcome. Although a scale’s central position in the nomological net might cast some doubt on the unique ability of a specific scale to predict a certain outcome, scales centrally located in the nomological net nevertheless possess some variance that is unique from other scales. In these cases, researchers can look to theory and previous external validity evidence for guidance.

Second, we are agnostic regarding the historical structural validity of these scales. Many of the scales evaluated are more than 20 years old and may have shown different psychometric properties when initially developed. In fact, part of our justification for the present research is that construct validation is an ongoing and living process (Cronbach & Meehl, 1955), such that both the content validity of individual items and the research culture broadly shift over time. Researchers will continue to ask different kinds of questions about different populations in different contexts. Some of these scales were originally administered to samples of college students, who are a considerably more constrained population than that sampled here. Finally, actual racial attitudes and beliefs also shift over time (Charlesworth & Banaji, 2019; Devine & Elliot, 1995), which may explain why we observed floor effects for many of the scales.

Third, we are not claiming that any scales reviewed here are uninformative. Although we do find that many of the scales are “noisy” instruments for measuring latent factors, some signal is captured. Researchers have revealed myriad important findings regarding stereotyping, prejudice, and discrimination using many of the scales reviewed in this paper, and we certainly do not argue that these findings are invalid. Rather, we view these results through an optimistic lens, as a guide for both selecting current best scales and for identifying useful avenues for scale renovation. To this end, we hope that our analyses lead to future work seeking to create updated versions of these scales that address some of the measurement weaknesses identified here.

Finally, we want to note that issues with the structural validity of psychological scales are not unique or specific to race-related scales. Although we focus on evaluating these scales, it is likely the case that many scales across the social and personality literature exhibit similar issues (e.g., Hussey & Hughes, 2020).

Implications for scale development and renovation

This work highlights clear future directions for scale development in racial stereotyping and prejudice research. Some areas of the nomological network are relatively sparse and feature few or no scales that show good structural validity. Researchers interested in investigating effects of stereotype knowledge or motivation to control prejudice might see this as an opportunity to develop a new scale using modern methods, which would constitute a valuable methodological contribution.

Furthermore, the information provided by IRT about individual items (available on the OSF page) is an excellent resource for systematically renovating existing scales, providing two main benefits for scale renovation. First, IRT analyses identify weak items that provide limited information to the latent factor (similar to examining latent factor loadings in CFA). For example, the IRT results for Right-Wing Authoritarianism show that there are two items in particular that provide low information about the latent factor and could be removed with little loss. Second, IRT analyses identify the range of the latent factor at which each item is informative, allowing researchers to identify when introducing a “harder” item (i.e., one that discriminates between those very high in the latent factor) or an “easier” item (i.e., one that discriminates between those very low in the latent factor) would improve the coverage of a scale. For example, although the Modern Racism scale shows very good model fit, IRT results suggest that the addition of a few more extreme pro-Black items would improve the coverage of the scale and differentiate between the high percentage of individuals who hit the floor of the scale. A figure illustrating these examples is available on the OSF page.

We certainly do not suggest abandoning rich theoretical constructs such as Right-Wing Authoritarianism or Symbolic Racism; rather, we suggest that there is great opportunity for renovating these scales, which will improve future research on these topics. We suggest that researchers interested in scale renovation employ IRT to pinpoint uninformative items for removal and to identify the difficulty level at which new items should be introduced. Overall, we hope that this work motivates and rewards researchers who pursue scale renovation and believe that such work would be highly beneficial to the field and to the further development of theories that hinge on the accurate measurement of specific latent constructs.

Limitations

We note a few key limitations of the current work. First, some of the scales used in our sample already have recently renovated versions that were not collected in our analysis. We note two prominent cases here. First, the SDO7 (Ho et al., 2015) renovates the scale items and reconceptualizes Social Dominance Orientation as a two-dimensional construct. SDO6 was used in the present work (Pratto et al., 1994), but it is important to note that this scale is still regularly used. For example, between January and March 2021, we identified seven papers published using SDO6. Similarly, the Racial Resentment scale was renovated in 2011 (Wilson & Davis, 2011), and our analyses reflect the psychometric properties of an earlier version (Kinder et al., 1996). Future analyses might include these updated versions of the scales, but our findings here are relevant to modern research even for these older but still used scales.

Our evaluation of race-related scales also does not capture the full “universe” of scales available in the literature. One notable exclusion (due to its absence from the Project Implicit dataset) is the Color-Blind Racial Attitudes Scale (Neville et al., 2000), which has been cited over 1100 times. Future work might collect or use data that includes important scales absent from the current investigation.

Furthermore, we used a self-selected sample of individuals who chose to visit Project Implicit and Mechanical Turk. It is possible that these scales show different psychometric properties in different populations. However, we note our analyses are already on a far larger and more diverse population than the original scale development work, which used smaller and more homogenous samples of American adults, White adults, and college students.

We have evaluated the construct validity of these scales with regard to measuring racial attitudes toward Black people among mostly U.S. participants. Because construct validation pertains to a specific use of a scale and can be context or population dependent (Kane, 2013; Messick, 1995), it is not necessarily the case that scales with good psychometric properties in this scenario would have good properties when assessing attitudes toward other groups drawing from other populations. Researchers using these scales should always first verify their measures have properties similar to previous analyses to ensure their measures are working as expected, especially for any new context.

The ongoing theoretical and methodological importance of explicit bias

Finally, we suggest that it may be a suitable time to revitalize research on explicitly expressed prejudice. Beginning in the 1980s, social scientists were increasingly concerned that individuals were no longer honestly reporting their prejudices on explicit self-report measures, due to social desirability concerns and the idea that appearing prejudiced was no longer publicly acceptable. Accordingly, the field began developing indirect assessments of bias (Devine, 1989; Fazio et al., 1986; Gaertner & McLaughlin, 1983; Greenwald et al., 1998). This focus fueled nearly 40 years of intense research into indirectly measured implicit biases, what they are, their causes, their consequences (Cameron et al., 2012; Dovidio et al., 2002; Greenwald et al., 2009; Hofmann et al., 2005; Kawakami et al., 2007; Kurdi et al., 2018; Nosek et al., 2007; Payne et al., 2005). This research has greatly informed our understanding of social cognition and bias, yet has also revealed some of the limitations of indirectly measured biases. Like many cognitive tasks, they have high measurement error (Cunningham et al., 2001; Gawronski et al., 2017; Hedge et al., 2018) and only weak relationships with behavior (Greenwald et al., 2009; Kurdi et al., 2018; Oswald et al., 2013). In contrast, explicit measures of racial attitudes typically have less measurement error (Gawronski et al., 2017) and stronger or at least equivalent relationships with individual level behavior (Oswald et al., 2013). Although we understand concerns about socially desirable responding, we do not believe there is a shortage in modern times of public expressions of prejudice (Crandall et al., 2018).

In all, explicit measures have superior measurement properties relative to implicit measures of racial attitudes. They have, at best, equal associations with behavior, yet explicit biases are easier to measure. People also appear to be willing to explicitly express prejudice toward stigmatized groups. Accordingly, we believe the need for effective self-report measures of explicit bias is alive and well and we encourage prejudice researchers to continue empirical attention on explicitly endorsed measures of racial prejudice and collect alongside implicit measures. The analyses provided in the present research can aid this endeavor.

Conclusions

Before any deep-sea dive, researchers and engineers carefully test their equipment to make sure that every tool and instrument is functioning properly. Although psychologists are not faced with the same high-cost, life-threatening stakes, we can nevertheless benefit by following suit, carefully considering and testing the instruments we use to study racial attitudes and other latent factors. By closely evaluating the measurement scales we use to “dive” into the minds of others and reveal people’s thoughts and beliefs, we can come ever closer to actually observing these thoughts and beliefs, allowing us to draw stronger conclusions about their nature, meaning, and consequences.