Educational Testing Service (ETS) was founded with a dual mission: to provide high-quality testing programs that would enhance educational decisions and to improve the theory and practice of testing in education through research and development (Bennett 2005; Educational Testing Service 1992). Since its inception in 1947, ETS has consistently evaluated its testing programs to help ensure that they meet high standards of technical and operational quality, and where new theory and new methods were called for, ETS researchers made major contributions to the conceptual frameworks and methodology.

This chapter reviews ETS’s contributions to validity theory and practice at various levels of generality, including overarching frameworks (Messick 1988, 1989), more targeted models for issues such as fairness , and particular analytic methodologies (e.g., reliability , equating, differential item functioning). The emphasis will be on contributions to the theory of validity and, secondarily, on the practice of validation rather than on specific methodologies.

1 Validity Theory

General conceptions of validity grew out of basic concerns about the accuracy of score meanings and the appropriateness of score uses (Kelley 1927), and they have necessarily evolved over time as test score uses have expanded, as proposed interpretations have been extended and refined, and as the methodology of testing has become more sophisticated.

In the first edition of Educational Measurement (Lindquist 1951), which was released just after ETS was founded, Cureton began the chapter on validity by suggesting that “the essential question of test validity is how well a test does the job it is employed to do” (Cureton 1951, p. 621) and went on to say that

validity has two aspects, which may be termed relevance and reliability. … To be valid—that is to serve its purpose adequately—a test must measure something with reasonably high reliability, and that something must be fairly closely related to the function it is used to measure. (p. 622)

In the late 1940s and early 1950s, tests tended to be employed to serve two kinds of purposes: providing an indication of the test taker’s standing on some attribute (e.g., cognitive ability , personality traits , academic achievement) and predicting future performance in some context.

Given ETS’s mission (Bennett, Chap. 1, this volume) and the then current conception of validity (Cureton 1951) , it is not surprising that much of the early work on validity at ETS was applied rather than theoretical; it focused on the development of measures of traits thought to be relevant to academic success and on the use of these measures to predict future academic performance. For example, the second Research Bulletin published at ETS (i.e., Frederiksen 1948) focused on the prediction of first-year grades at a particular college.

This kind of applied research designed to support and evaluate particular testing programs continues to be an essential activity at ETS, but over the years, these applied research projects have also generated basic questions about the interpretations of test scores, the statistical methodology used in test development and evaluation, the scaling and equating of scores, the variables to be used in prediction, structural models relating current performance to future outcomes, and appropriate uses of test scores in various contexts and with various populations. In seeking answers to these questions, ETS researchers contributed to the theory and practice of educational measurement by developing general frameworks for validation and related methodological developments that support validation.

As noted earlier, at the time ETS was founded, the available validity models for testing programs emphasized score interpretations in terms of traits and the use of scores as predictors of future outcomes, but over the last seven decades, the concept of validity has expanded. The next section reviews ETS’s contributions to the development and validation of trait interpretations , and the following section reviews ETS’s contributions to models for the prediction of intended “criterion” outcomes. The fourth describes ETS’s contributions to our conceptions and analyses of fairness in testing. The fifth section traces the development of Messick’s comprehensive, unified model of construct validity, a particularly important contribution to the theory of validity. The sixth section describes ETS’s development of argument-based approaches to validation. A seventh section, on validity research at ETS, focuses on the development of methods for the more effective interpretation and communication of test scores and for the control of extraneous variance. The penultimate section discusses fairness as a core validity concern. The last section provides some concluding comments.

This organization is basically thematic, with each section examining ETS’s contributions to the development of aspects of validity theory, but it is also roughly chronological. The strands of the story (trait interpretations , prediction, construct interpretations, models for fairness, Messick’s unified model of construct validity, models for the role of consequences of testing, and the development of better methods for encouraging clear interpretations and appropriate uses of test scores) overlap greatly, developed at different rates during different periods, and occasionally folded back on themselves, but there was also a gradual progression from simpler and more intuitive models for validity to more complex and comprehensive models, and the main sections in this chapter reflect this progression.

As noted, most of the early work on validity focused on trait interpretations and the prediction of desired outcomes. The construct validity model was proposed in the mid-1950s (Cronbach and Meehl 1955) , but it took a while for this model to catch on. Fairness became a major research focus in the 1970s. In the 1970s and 1980s, Messick developed his unified framework for the construct validity of score interpretations and uses, and the argument-based approaches were developed at the turn of the century.

It might seem appropriate to begin this chapter by defining the term validity, but as in any area of inquiry (and perhaps more so than in many other areas of inquiry), the major developments in validity theory have involved changes in what the term means and how it is used. The definition of validity has been and continues to be a work in progress. Broadly speaking, validation has always involved an evaluation of the proposed interpretations and uses of test scores (Cronbach 1971; Kane 2006, 2013a; Messick 1989), but both the range of proposed interpretations and the evaluative criteria have gradually expanded.

2 Validity of Trait Interpretations

For most of its history from the late nineteenth century to the present, test theory has tended to focus on traits, which were defined in terms of dispositions to behave or perform in certain ways in response to certain kinds of stimuli or tasks, in certain kinds of contexts. Traits were assumed to be personal characteristics with some generality (e.g., over some domain of tasks, contexts, occasions). In the late 1940s and early 1950s, this kind of trait interpretation was being applied to abilities, skills, aptitudes, and various kinds of achievement as well as to psychological traits as such. Trait interpretations provided the framework for test development and, along with predictive inferences, for the interpretation of test scores (Gulliksen 1950a). As theory and methodology developed, trait interpretations tended to become more sophisticated in their conceptualizations and in the methods used to estimate the traits. As a result, trait interpretations have come to overlap with construct interpretations (which can have more theoretical interpretations), but in this section, we limit ourselves to basic trait interpretations, which involve dispositions to perform in some way in response to tasks of some kind.

Cureton (1951) summarized the theoretical framework for this kind of trait interpretation:

When the item scores of a set of test-item performances correlate substantially and more or less uniformly with one another, the sum of the item scores (the summary score or test score) has been termed a quasi-measurement. It is a quasi-measurement of “whatever,” in the reaction-systems of the individuals, is invoked in common by the test items as presented in the test situation. This “whatever” may be termed a “trait.” The existence of the trait is demonstrated by the fact that the item scores possess some considerable degree of homogeneity; that is, they measure in some substantial degree the same thing. We term this “thing” the “trait.” (pp. 647–648)

These traits can vary in their content (e.g., achievement in geography vs. anxiety), in their generality (e.g., mechanical aptitude vs. general intelligence), and in the extent to which they are context or population bound, but they share three characteristics (Campbell 1960; Cureton 1951). First, they are basically defined in terms of some relatively specific domain of performance or behavior (with some domains broader than others). Second, the performances or behaviors are assumed to reflect some characteristic of individuals, but the nature of this characteristic is not specified in any detail, and as a result, the interpretation of the trait relies heavily on the domain definition. Third, traits are assumed to be enduring characteristics of individuals, with some more changeable (e.g., achievement in some academic subject) than others (e.g., aptitudes, personality).

Note that the extent to which a trait is enduring is context dependent. Levels of achievement in an academic subject such as geography would be expected to increase while a student is studying the subject and then to remain stable or gradually decline thereafter. A personality trait such as conscientiousness is likely to be more enduring, but even the most stable traits can change over time.

An understanding of the trait (rudimentary as it might be) indicates the kinds of tasks or stimuli that could provide information about it. The test items are designed to reflect the trait, and to the extent possible nothing else, and differences in test scores are assumed to reflect mainly differences in level of the trait.

The general notion of a trait as a (somewhat) enduring characteristic of a person that is reflected in certain kinds of behavior in certain contexts is a basic building block of “folk psychology,” and as such, it is ancient (e.g., Solomon was wise, and Caesar was said to be ambitious). As they have been developed to make sense of human behavior over the last century and a half, modern theories of psychology have made extensive use of a wide variety of traits (from introversion to mathematical aptitude) to explain human behavior. As Messick (1989) put it, “a trait is a relatively enduring characteristic of a person—an attribute, process, or disposition—which is consistently manifested to an appropriate degree when relevant, despite considerable variation in the range of settings and circumstances” (p. 15). Modern test theory grew out of efforts to characterize individuals in terms of traits, and essentially all psychometric theories (including classical test theory , generalizability theory , factor analysis , and item response theory) involve the estimation of traits of one kind or another.

From a psychological point of view, the notion of a trait suggests a persistent characteristic of a person that is prior to and independent of any testing program. The trait summarizes (and, in that sense, accounts for) performance or behavior. The trait is not synonymous with any statistical parameter, and it is reasonable to ask whether a parameter estimate based on a particular sample of behavior is an unbiased estimate of the trait of interest. Assuming that the estimate is unbiased, it is also reasonable to ask how precise the estimate is. An assessment of the trait may involve observing a limited range of performances or behaviors in a standardized context and format, but the trait is interpreted in terms of a tendency or disposition to behave in some way or an ability to perform some kinds of tasks in a range of test and nontest contexts. The trait interpretation therefore entails expectations that assessments of the trait using different methods should agree with each other, and assessments of different traits using common methods should not agree too closely (Campbell 1960; Campbell and Fiske 1959) .

Traits have two complementary aspects. On one hand, a trait is thought of as an unobservable characteristic of a person, as some latent attribute or combination of such attributes of the person. However, when asked to say what is meant by a trait, the response tends to be in terms of some domain of observable behavior or performance. Thus traits are thought of as unobservable attributes and in terms of typical performance over some domain. Most of the work described in this section focuses on traits as dispositions to behave in certain ways. In a later section, we will focus more on traits as theoretical constructs that are related to domains of behavior or performance but that are defined in terms of their properties as underlying latent attributes or constructs.

2.1 ETS’s Contributions to Validity Theory for Traits

Trait interpretations of test scores go back at least to the late nineteenth century and therefore predate both the use of the term validity and the creation of ETS. However ETS researchers made many contributions to theoretical frameworks and specific methodology for the validation of trait interpretations, including contributions to classical test theory (including reliability theory, standard errors , and confidence intervals ), item response theory, equating, factor analysis , scaling , and methods for controlling trait-irrelevant variance . The remainder of this section concentrates on ETS’s contributions to the development of these methodologies, all of which seek to control threats to validity.

ETS researchers have been involved in analyzing and measuring a wide variety of traits over ETS’s history (Stricker , Chap. 13, this volume), including acquiescence (Messick 1965, 1967), authoritarian attitudes (Messick and Jackson 1958) , emotional intelligence (Roberts et al. 2008) , cognitive structure (Carroll 1974) , response styles (Jackson and Messick 1961; Messick 1991), risk taking (Myers 1965) , and social intelligence (Stricker and Rock 1990) , as well as various kinds of aptitudes and achievement. ETS researchers have also made major contributions to the methodology for evaluating the assumptions inherent in trait interpretations and in ruling out factors that might interfere with the intended trait interpretations, particularly in classical test theory (Lord and Novick 1968) , theory related to the sampling of target domains (Frederiksen 1984) , and item response theory (Lord 1951, 1980) .

2.2 Classical Test Theory and Reliability

Classical test theory (CTT) is based on trait interpretations, particularly on the notion of a trait score as the expected value over the domain of replications of a measurement procedure. The general notion is that the trait being measured remains invariant over replications of the testing procedure; the test scores may fluctuate to some extent over replications, but the value of the trait is invariant, and fluctuations in observed scores are treated as random errors of measurement. Gulliksen (1950b) used this notion as a starting point for his book, in which he summarized psychometric theory in the late 1940s but used the term ability instead of trait:

It is assumed that the gross score has two components. One of these components (T) represents the actual ability of the person, a quantity that will be relatively stable from test to test as long as the tests are measuring the same thing. The other component (E) is an error. (p. 4)

Note that the true scores of CTT are expected values over replications of the testing procedure; they do not refer to an underlying, “real” value of the trait, which has been referred to as a platonic true score to differentiate it from the classical true score. Reliability coefficients were defined in terms of the ratio of true-score variance to observed-score variance, and the precision of the scores was evaluated in terms of the reliability or in terms of standard errors of measurement. Livingston (1972) extended the notion of reliability to cover the dependability of criterion-referenced decisions.

Evidence for the precision of test scores (e.g., standard errors, reliability) supports validity claims in at least three ways. First, some level of precision is necessary for scores to be valid for any interpretation; that is, if the trait estimates have low reliability (i.e., they fluctuate substantially over replications), the only legitimate interpretation of the scores is that they mostly represent error or “noise.” Second, the magnitude of the standard error can be considered part of the interpretation of the scores. For example, to say that a test taker has an estimated score of 60 with a standard error of 2 is a much stronger claim than a statement that a test taker has an estimated score of 60 with a standard error of 20. Third, the relationships between the precision of test scores and the number and characteristics of the items in the test can be used to develop tests that are more reliable without sacrificing relevance, thereby improving validity.

Classical test theory was the state of the art in the late 1940s, and as ETS researchers developed and evaluated tests of various traits, they refined old methods and developed new methods within the context of the CTT model (Moses , Chaps. 2 and 3, this volume). The estimation of reliability and standard errors has been an ongoing issue of fundamental importance (Horst 1951; Jöreskog 1971; Keats 1957; Kristof 1962, 1970, 1974; Lord 1955; Novick and Lewis 1967; Tucker 1949) . ETS’s efforts to identify the implications of various levels of reliability began soon after its inception and have continued since (Angoff 1953; Haberman 2008; Horst 1950a, b; Kristof 1971; Livingston and Lewis 1995; Lord 1956, 1957, 1959) .

An important early contribution of ETS researchers to the classical model was the development of conditional standard errors (Keats 1957; Lord 1955, 1956) and of associated confidence intervals around true-score estimates (Gulliksen 1950b; Lord and Novick 1968; Lord and Stocking 1976) . Putting a confidence interval around a true-score estimate helps to define and limit the inferences that can be based on the estimate; for example, a decision to assign a test taker to one of two categories can be made without much reservation if a highly conservative confidence interval (e.g., 99%) for a test taker does not include the cutscore between the two categories (Livingston and Lewis 1995). Analyses of the reliability and correlations of subscores can also provide guidance on whether it would be meaningful to report the subscores separately (Haberman 2008).

Evaluations of the precision of test scores serve an important quality-control function, and they can help to ensure an adequate level of precision in the test scores generated by the testing program (Novick and Thayer 1969). Early research established the positive relationship between test length and reliability as well as the corresponding inverse relationship between test length and standard errors (Lord 1956, 1959) . That research tradition also yielded methods for maximizing the reliability of composite measures (B.F. Green 1950) .

One potentially large source of error in testing programs that employ multiple forms of a test (e.g., to promote security) is variability in content and statistical characteristics (particularly test difficulty) across different forms of the test, involving different samples of test items. Assuming that the scores from the different forms are to be interpreted and used interchangeably, it is clearly desirable that each test taker’s score be more or less invariant across the forms, but this ideal is not likely to be met exactly, even if the forms are developed from the same specifications. Statistical equating methods are designed to minimize the impact of form differences by adjusting for differences in operating characteristics across the forms. ETS researchers have made major contributions to the theory and practice of equating (Angoff 1971; Holland 2007; Holland and Dorans 2006; Holland and Rubin 1982; Lord and Wingersky 1984; Petersen 2007; Petersen et al. 1989; A.A. von Davier 2011; A.A. von Davier et al. 2004) . In the absence of equating, form-to-form differences can introduce substantial errors, and equating procedures can reduce this source of error.

On a more general level, ETS researchers have played major roles in developing the CTT model and in putting it on firm foundations (Lord 1965; Novick 1965) . In 1968, Frederick Lord and Melvin Novick formalized and summarized most of what was known about the CTT model in their landmark book Statistical Theories of Mental Test Scores . They provided a very sophisticated statement of the classical test-theory model and extended it in many directions.

2.3 Adequate Sampling of the Trait

Adequate sampling of the trait domain requires a clear definition of the domain, and ETS researchers have devoted a lot of attention to developing a clear understanding of various traits and of the kinds of performances associated with these traits (Ebel 1962) . For example, Dwyer et al. (2003) defined quantitative reasoning as “the ability to analyze quantitative information” (p. 13) and specified that its domain would be restricted to quantitative tasks that would be new to the student (i.e., would not require methods that the test takers had been taught). They suggested that quantitative reasoning includes six more specific capabilities: (a) understanding quantitative information presented in various formats, (b) interpreting and drawing inferences from quantitative information, (c) solving novel quantitative problems, (d) checking the reasonableness of the results, (e) communicating quantitative information, and (f) recognizing the limitations of quantitative methods . The quantitative reasoning trait interpretation assumes that the tasks do not require specific knowledge that is not familiar to all test takers and, therefore, any impact that such knowledge has on the scores would be considered irrelevant variance.

As noted earlier, ETS has devoted a lot of attention to developing assessments that reflect traits of interest as fully as possible (Lawrence and Shea 2011) . Much of this effort has been devoted to more adequately sampling the domains associated with the trait, and thereby reducing the differences between the test content and format and the broader domain associated with the trait (Bejar and Braun 1999; Frederiksen 1984) . For example, the “in basket” test (Frederiksen et al. 1957) was designed to evaluate how well managers could handle realistic versions of management tasks that required decision making, prioritizing, and delegating. Frederiksen (1959) also developed a test of creativity in which test takers were presented with descriptions of certain results and were asked to list as many hypotheses as they could to explain the results. Frederiksen had coauthored the chapter on performance assessment in the first edition of Educational Measurement (Ryans and Frederiksen 1951) and consistently argued for the importance of focusing assessment on the kinds of performance that are of ultimate interest, particularly in a landmark article, “The Real Test Bias: Influences of Testing on Teaching and Learning” (Frederiksen 1984). More recently, ETS researchers have been developing a performance-based program of Cognitively Based Assessment of, for, and as Learning (the CBAL ® initiative) that elicits extended performances (Bennett 2010; Bennett and Gitomer 2009) . For CBAL , and more generally for educational assessments, positive changes in the traits are the goals of instruction and assessment, and therefore the traits being assessed are not expected to remain the same over extended periods.

The evidence-centered design (ECD) approach to test development, which is discussed more fully later, is intended to promote adequate sampling of the trait (or construct) by defining the trait well enough up front to get a good understanding of the kinds of behaviors or performance that would provide the evidence needed to draw conclusions about the trait (Mislevy et al. 1999, 2002). To the extent that the testing program is carefully designed to reflect the trait of interest, it is more likely that the observed behaviors or performances will adequately achieve that end.

Based on early work by Lord (1961) on the estimation of norms by item sampling, matrix sampling approaches, in which different sets of test tasks are taken by different subsamples of test takers, have been developed to enhance the representativeness of the sampled test performances for the trait of interest (Mazzeo et al. 2006; Messick et al. 1983). Instead of drawing a single sample of tasks that are administered to all test takers, multiple samples of tasks are administered to different subsamples of test takers. This approach allows for a more extensive sampling of content in a given amount of testing time. In addition, because it loosens the time constraints on testing, the matrix sampling approach allows for the use of a wider range of test tasks, including performance tasks that require substantial time to complete. These matrix sampling designs have proven to be especially useful in large-scale monitoring programs like the National Assessment of Educational Progress (NAEP) and in various international testing programs (Beaton and Barone, Chap. 8, Kirsch et al. Chap. 9, this volume).

2.4 Factor Analysis

Although a test may be designed to reflect a particular trait, it is generally the case that the test scores will be influenced by many characteristics of the individuals taking the test (e.g., motivation , susceptibility to distractions, reading ability). To the extent that it is possible to control the impact of test-taker characteristics that are irrelevant to the trait of interest, it may be possible to interpret the assessment scores as relatively pure measures of that focal trait (French 1951a, b, 1954, 1963) . More commonly, the assessment scores may also intentionally reflect a number of test-taker characteristics that, together, compose the trait. That is, broadly defined traits that are of practical interest may involve a number of more narrowly defined traits or factors that contribute to the test taker’s performance. For example, as noted earlier, Dwyer et al. (2003) defined the performance domain for quantitative reasoning in terms of six capabilities, including understanding quantitative information, interpreting quantitative information, solving quantitative problems, and estimating and checking answers for reasonableness. In addition, most trait measures require ancillary abilities (e.g., the ability to read) that are needed for effective performance in the assessment context.

In interpreting test scores , it is generally helpful to develop an understanding of how different characteristics are related to each other. Factor analysis models have been widely used to quantify the contributions of different underlying characteristics, or “factors,” to assessment scores, and ETS researchers have played a major role in the development of various factor-analytic methods (Moses , Chaps. 2 and 3, this volume), in part because of their interest in developing a variety of cognitive and noncognitive measures (French 1951a, b, 1954) .

Basic versions of exploratory factor analysis were in general use when ETS was formed, but ETS researchers contributed to the development and refinement of more sophisticated versions of these methods (Browne 1968; B.F. Green 1952; Harman 1967; Lord and Novick 1968; Tucker 1955) . Exploratory factor analysis makes it possible to represent the relationships (e.g., correlations or covariances) among observed scores on a set of assessments in terms of a statistical model describing the relationships among a relatively small number of underlying dimensions, or factors. The factor models decompose the observed total scores on the tests into a linear combination of factor scores, and they provide quantitative estimates of the relative importance of the different factors in terms of the variance explained by the factor.

By focusing on the traits as latent dimensions or factors or as some composite of more basic latent factors, and by embedding these factors within a web of statistical relationships, exploratory factor analysis provided a rudimentary version of the kind of nomological networks envisioned by Cronbach and Meehl (1955). The utility of exploratory analyses for explicating appropriate interpretations of test scores was enhanced by an extended research program at ETS to develop sets of reference measures that focused on particular basic factors (Ekstrom et al. 1979; French 1954; French et al. 1963). By including the reference tests with a more broadly defined trait measure, it would be possible to evaluate the factor structure of the broadly defined trait in terms of the reference factors.

As in other areas of theory development, the work done on factor analysis by ETS researchers tended to grow out of and be motivated by concerns about the need to build assessments that reflected certain traits and to evaluate how well the assessment actually reflected those traits. As a result, ETS’s research on exploratory factor analysis has involved a very fruitful combination of applied empirical studies of score interpretations and sophisticated theoretical modeling (Browne 1968; French 1951a, b; Harman 1967; Lord and Novick 1968) .

A major contribution to the theory and practice of validation that came out of research at ETS is confirmatory factor analysis (Jöreskog 1967, 1969; Jöreskog and Lawley 1967; Jöreskog and van Thillo 1972) . As its name indicates, exploratory factor analysis does not propose strong constraints a priori; the analysis essentially partitions the observed-score variances by using statistical criteria to fit the model to the data. In a typical exploratory factor analysis, theorizing tends to occur after the analysis, as the resulting factor structure is used to suggest plausible interpretations for the factors . If reference factors are included in the analysis, they can help orient the interpretation.

In confirmatory factor analysis (CFA) , a factor model is specified in advance by putting constraints on the factor structure, and the constrained model is fit to the data. The constraints imposed on the model are typically based on a priori theoretical assumptions, and the empirical data are used to check the hypotheses built into the models. As a result, CFAs can provide support for theory-based hypotheses or can result in refutations of some or all or the theoretical conjectures (Jöreskog 1969). This CFA model was extended as the basis for structural equation modeling (Jöreskog and van Thillo 1972). To the extent that the constraints incorporate theoretical assumptions, CFAs go beyond simple trait interpretations into theory-based construct interpretations.

CFA is very close in spirit and form to the nomological networks of Cronbach and Meehl (1955) . In both cases, there are networks of hypothesized relationships between constructs (or latent variables), which are explicitly defined a priori and which may be extensive, and there are proposed measures of at least some of the constructs. Given specification of the network as a confirmatory factor model (and adequate data), the hypotheses inherent in the network can be checked by evaluating the fit of the model to the data. If the model fits, the substantive assumptions (about relationships between the constructs) in the model and the validity of the proposed measures of the constructs are both supported. If the model does not fit the data, either the substantive assumptions and/or the validity of the measures is likely to be questioned. As is the case in the classic formulation of the construct validity model (Cronbach and Meehl 1955), the substantive theory and the assessments are initially validated (or invalidated) holistically as a network of interrelated assumptions. If the constrained model fails to fit the data, the data can be examined to identify potential weaknesses in the network. In addition, the model fit can be compared to the fit of alternate models that make different (perhaps stronger or weaker) assumptions.

2.5 Latent Traits

Two major developments in test theory in the second half of the twentieth century (the construct validity model and latent trait theory ) grew out of attempts to make the relationship between observed behaviors or performances and the relevant traits more explicit, and ETS researchers played major roles in both of these developments (see Carlson and von Davier, Chap. 5, this volume). Messick (1975, 1988, 1989) elaborated the construct validity model of Cronbach and Meehl (1955) , which sought to explicate the relationships between traits and observed assessment performances through substantive theories that would relate trait scores to the constructs in a theory and to other trait scores attached to the theory. Item response theory (IRT) deployed measurement models to specify the relationships between test performances and postulated latent traits and to provide statistical estimates of these traits (Lord 1951) . Messick’s contributions to construct validity theory will be discussed in detail later in this chapter. In this section, we examine contributions to IRT and the implications of these developments for validity.

In their seminal work on test theory , Lord and Novick (1968) used trait language to distinguish true scores from errors:

Let us suppose that we repeatedly administer a given test to a subject and thus obtain a measurement each day for a number of days. Further, let us assume that with respect to the particular trait the test is designed to measure, the person does not change from day to day and that successive measurements are unaffected by previous measurements. Changes in the environment or the state of the person typically result in some day-to-day variation in the measurements which are obtained. We may view this variation as the result of errors of measurement of the underlying trait characterizing the individual, or we may view it as a representation of a real change in this trait. (pp. 27–28)

In models for true scores , the true score captures the enduring component in the scores over repeated, independent testing, and the “random” fluctuations around this true score are relegated to error.

Lord and Novick (1968) also used the basic notion of a trait to introduce latent traits and item characteristic functions:

Any theory of latent traits supposes that an individual’s behavior can be accounted for, to a substantial degree, by defining certain human characteristics called traits, quantitatively estimating the individual’s standing on each of these traits, and then using the numerical values obtained to predict or explain performance in relevant situations. (p. 358)

Within the context of the statistical model, the latent trait accounts for the test performances , real and possible, in conjunction with item or task parameters. The latent trait has model-specific meaning and a model-specific use; it captures the enduring contribution of the test taker’s “ability” to the probability of success over repeated, independent performances on different tasks.

Latent trait models have provided a richer and in some ways firmer foundation for trait interpretations than offered by classical test theory. One motivation for the development of latent trait models (Lord 1951) was the realization that number-right scores and simple transformations of such scores would not generally yield the defining property of traits (i.e., invariance over measurement operations). The requirement that task performance data fit the model can also lead to a sharpening of the domain definition, and latent trait models can be helpful in controlling random errors by facilitating the development of test forms with optimal statistical properties and the equating of scores across different forms of a test.

A model-based trait interpretation depends on empirical evidence that the statistical model fits the data well enough; if it does, we can have confidence that the test scores reflect the trait conceived of as “whatever … is invoked in common by the test items” (Cureton 1951, p. 647) . The application of a CTT or latent trait model to student responses to generate estimates of a true score or a latent trait does not in itself justify the interpretation of scores in terms of a construct that causes and explains the task performances, and it does not necessarily justify inferences to any nontest performance. A stronger interpretation in terms of a psychological trait that has implications beyond test scores requires additional evidence (Messick 1988, 1989). We turn to such construct interpretations later in this chapter.

2.6 Controlling Irrelevant Variance

As is the case in many areas of inquiry, a kind of negative reasoning can play an important role in validation of trait interpretations. Tests are generally developed to yield a particular score interpretation and often a particular use, and the test development efforts make a case for the interpretation and use (Mislevy et al. 2002). Once this initial positive case has been made, it can be evaluated by subjecting it to empirical challenge. We can have confidence in claims that have survived all serious challenges.

To the extent that an alternate proposal is as plausible, or more plausible, than a proposed trait interpretation, we cannot have much confidence in the intended interpretation. This notion, which is a fundamental methodological precept in science (Popper 1965) , underlies, for example, multitrait–multimethod analyses (D. T. Campbell and Fiske 1959) and the assumption that reliability is a necessary condition for validity. As a result, to the extent that we can eliminate alternative interpretations of test scores, the proposed interpretation becomes more plausible, and if we can eliminate all plausible rivals for a proposed trait interpretation, we can accept that interpretation (at least for the time being).

In most assessment contexts, the question is not whether an assessment measures the trait or some alternate variable but rather the extent to which the assessment measures the trait of interest and is not overly influenced by sources of irrelevant variance. In their efforts to develop measures of various traits, ETS researchers have examined many potential sources of irrelevant variance, including anxiety (French 1962; Powers 1988, 2001) , response styles (Damarin and Messick 1965) , coaching (Messick 1981b, 1982a; Messick and Jungeblut 1981) , and stereotype threat (Stricker 2008; Stricker and Bejar 2004; Stricker and Ward 2004) . Messick (1975, 1989) made the evaluation of plausible sources of irrelevant variance a cornerstone of validation , and he made the evaluation of construct-irrelevant variance and construct underrepresentation central concerns in his unified model of validity.

It is, of course, desirable to neutralize potential sources of irrelevant variance before tests are administered operationally, and ETS has paid a lot of attention to the development and implementation of item analysis methodology, classical and IRT-based, designed to minimize irrelevant variance associated with systematic errors and random errors. ETS has played a particularly important role in the development of methods for the detection of differential item functioning (DIF) , in which particular items operate inconsistently across groups of test takers while controlling for ability and thereby introduce systematic differences that may not reflect real differences in the trait of interest (Dorans, Chap. 7, this volume, 1989, 2004; Dorans and Holland 1993; Holland and Wainer 1993; Zieky 1993, 2011).

Trait interpretations continue to play a major role in the interpretation and validation of test scores (Mislevy 2009). As discussed earlier, trait interpretations are closely tied to domains of possible test performances , and these domains provide guidance for the development of assessment procedures that are likely to support their intended function. In addition, trait interpretations can be combined with substantive assumptions about the trait and the trait’s relationships to other variables, thus going beyond the basic trait interpretation in terms of a domain of behaviors or performances to an interpretation of a theoretical construct (Messick 1989; Mislevy et al. 2002).

3 Validity of Score-Based Predictions

Between 1920 and 1950, test scores came to be used to predict future outcomes and to estimate concurrent criteria that were of practical interest but were not easily observed, and the validity of such criterion-based interpretations came to be evaluated mainly in terms of how well the test scores predicted the criterion (Angoff 1988; Cronbach 1971; Kane 2012; Messick 1988, 1989; Zwick 2006) . In the first edition of Educational Measurement, which was written as ETS was being founded, Cureton (1951) associated validity with “the correlation between the actual test scores and the ‘true’ criterion score” (p. 623), which would be estimated by the correlation between the test scores and the criterion scores, with an adjustment for unreliability in the criterion.

The criterion variable of interest was assumed to have a definite value for each person, which was reflected by the criterion measure, and the test scores were to “predict” these values as accurately as possible (Gulliksen 1950b). Given this interpretation of the test scores as stand-ins for the true criterion measure, it was natural to evaluate validity in terms of the correlation between test scores and criterion scores:

Reliability has been regarded as the correlation of a given test with a parallel form. Correspondingly, the validity of a test is the correlation of the test with some criterion. In this sense a test has a great many different “validities.” (Gulliksen 1950b, p. 88)

The criterion scores might be obtained at about the same time as the test scores (“concurrent validity”), or they might be a measure of future performance (e.g., on the job, in college), which was not available at the time of testing (“predictive validity ”). If a good criterion were available, the criterion model could provide simple and elegant estimates of the extent to which scores could be used to estimate or predict criterion scores (Cureton 1951; Gulliksen 1950b; Lord and Novick 1968) . For admissions, placement, and employment, the criterion model is still an essential source of validity evidence. In these applications, criterion-related inferences are core elements in the proposed interpretations and uses of the test scores. Once the criterion is specified and appropriate data are collected, a criterion-based validity coefficient can be estimated in a straightforward way.

As noted earlier, the criterion model was well developed and widely deployed by the late 1940s, when ETS was founded (Gulliksen 1950b). Work at ETS contributed to the further development of these models in two important ways: by improving the accuracy and generality of the statistical models and frameworks used to estimate various criteria (N. Burton and Wang 2005; Moses , Chaps. 2 and 3, this volume) and by embedding the criterion model in a more comprehensive analysis of the plausibility of the proposed interpretation and use of test scores (Messick 1981a, 1989). The criterion model can be implemented more or less mechanically once the criterion has been defined, but the specification of the criterion typically involves value judgments and a consideration of consequences (Messick 1989).

Much of the early research at ETS addressed the practical issues of developing testing programs and criterion-related validity evidence, but from the beginning, researchers were also tackling more general questions about the effective use of standardized tests in education. The criterion of interest was viewed as a measure of a trait, and the test was conceived of as a measure of another trait that was related to the criterion trait, as an aptitude is related to subsequent achievement. As discussed more fully in a later section, ETS researchers conducted extensive research on the factors that tend to have an impact on the correlations of predictors (particularly SAT ® scores) with criteria (e.g., first-year college grades), which served as measures of academic achievement (Willingham et al. 1990) .

In the 1940s and 1950s, there was a strong interest in measuring both cognitive and noncognitive traits (French 1948) . One major outcome of this extensive research program was the finding that cognitive measures (test scores, grades) provided fairly accurate predictions of performance in institutions of higher education and that the wide range of noncognitive measures that were evaluated did not add much to the accuracy of the predictions (Willingham et al. 1990).

As noted by Zwick (2006) , the validity of tests for selection has been judged largely in terms of how well the test scores can predict some later criterion of interest. This made sense in 1950, and it continues to make sense into the twenty-first century. The basic role of criterion-related validity evidence in evaluating the accuracy of such predictions continues to be important for the validity of any interpretation or use that relies on predictions of future performance (Kane 2013a), but these paradigm cases of prediction now tend to be evaluated in a broader theoretical context (Messick 1989) and from a broader set of perspectives (Dorans 2012; Holland 1994; Kane 2013b). In this broader context, the accuracy of predictions continues to be important, but concerns about fairness and utility are getting more attention than they got before the 1970s.

4 Validity and Fairness

Before the 1950s, the fairness of testing programs tended to be evaluated mainly in terms of equivalent or comparable treatment of test takers. This kind of procedural fairness was supported by standardizing test administration, materials, scoring, and conditions of observation, as a way of eliminating favoritism or bias; this approach is illustrated in the civil service testing programs, in licensure programs, and in standardized educational tests (Porter 2003) . It is also the standard definition of fairness in sporting events and other competitions and is often discussed in terms of candidates competing on “a level playing field.” Before the 1950s, this very basic notion of fairness in testing programs was evaluated mainly at the individual level; each test taker was to be treated in the same way, or if some adjustment were necessary (e.g., due to a candidate’s disability or a logistical issue), as consistently as possible. In the 1950s, 1960s, and 1970s, the civil rights movement, legislation, and litigation raised a broader set of fairness issues, particularly issues of fair treatment of groups that had suffered discrimination in the past (Cole and Moss 1989; Willingham 1999) .

With respect to the treatment of groups, concerns about fairness and equal opportunity prior to this period did exist but were far more narrowly defined. One of the goals of James Conant and others in promoting the use of the Scholastic Aptitude Test was to expand the pool of students admitted to major universities by giving all high school students an opportunity to be evaluated in terms of their aptitude and not just in terms of the schools they attended or the curriculum that they had experienced. As president of Harvard in the 1930s, Conant found that most of Harvard’s students were drawn from a small set of elite prep schools and that the College Board examinations, as they then existed, evaluated mastery of prep school curricula (Bennett, Chap. 1, this volume) : “For Conant, Harvard admission was being based largely on ability to pay. If a student could not afford to attend prep school, that student was not going to do well on the College Boards, and wasn’t coming to Harvard” (p. 5). In 1947, when ETS was founded, standardized tests were seen as a potentially important tool for improving fairness in college admissions and other contexts, at least for students from diverse economic backgrounds. The broader issues of adverse impact and fairness as they related to members of ethnic, racial, and gender groups had not yet come into focus.

Those broader issues of racial, ethnic, and gender fairness and bias moved to center stage in the 1960s:

Hard as it now may be to imagine, measurement specialists more or less discovered group-based test fairness as a major issue only some 30 years ago. Certainly, prior to that time, there was discussion of the cultural fairness of a test and its appropriateness for some examinees, but it was the Civil Rights movement in the 1960s that gave social identity and political dimension to the topic. That was the period when hard questions were first asked as to whether the egalitarian belief in testing was justified in the face of observed subgroup differences in test performance . The public and test specialists alike asked whether tests were inherently biased against some groups, particularly Black and Hispanic examinees . (Willingham 1999, p. 214)

As our conceptions of fairness and bias in testing expanded between the 1960s and the present, ETS played a major role in defining the broader notions of fairness and bias in testing. ETS researchers developed frameworks for evaluating fairness issues, and they developed and implemented methodology to control bias and promote fairness. These frameworks recognized the value of consistent treatment of individual test takers, but they focused on a more general conception of equitable treatment of individuals and groups (J. Campbell 1964; Anne Cleary 1968; Cleary and Hilton 1966; Cole 1973; Cole and Moss 1989; Dorans , Chap. 7, this volume; Frederiksen 1984; Linn 1973, 1975, 1976; Linn and Werts 1971; Messick 1975, 1980, 1989; Wild and Dwyer 1980; Willingham and Cole 1997; Xi 2010) .

4.1 Fairness and Bias

Although the terms fairness and bias can be interpreted as covering roughly the same ground, with fairness being defined as the absence of bias, fairness often reflects a broader set of issues, including the larger issues of social equity. In contrast, bias may be given a narrower and more technical interpretation in terms of irrelevant factors that distort the interpretation of test scores:

The word fairness suggests fairness that comes from impartiality, lacking in prejudice or favoritism. This implies that a fair test is comparable from person to person and group to group. Comparable in what respect? The most reasonable answer is validity, since validity is the raison d’etre of the entire assessment enterprise. (Willingham 1999, p. 220)

In its broadest uses, fairness tends to be viewed as an ethical and social issue concerned with “the justice and impartiality inherent in actions” (Willingham 1999, p. 221). Bias, conversely, is often employed as a technical concept, akin to the notion of bias in the estimation of a statistical parameter. For example , Cole and Moss (1989) defined bias as the “differential validity of a particular interpretation of a test score for any definable, relevant group of test takers” (p. 205).

Standardized testing programs are designed to treat all test takers in the same way (or if accommodations are needed, in comparable ways), thereby eliminating as many sources of irrelevant variance as possible. By definition, to the extent that testing materials or conditions are not standardized, they can vary from test taker to test taker and from one test administration to another, thereby introducing irrelevant variance, or bias, into test scores. Much of this irrelevant variance would be essentially random, but some of it would be systematic in the sense that some test scores (e.g., those from a test site with an especially lenient or especially severe proctor) would be consistently too high or too low. Standardization also tends to control some kinds of intentional favoritism or negative bias by mandating consistent treatment of all test takers. Test scores that consistently underestimate or overestimate the variable of interest for a subgroup for any reason are said to be biased, and standardization tends to control this kind of bias, whether it is inadvertent or intentional.

ETS and other testing organizations have developed systematic procedures designed to identify and eliminate any aspects of item content or presentation that might have an undue effect on the performance of some test takers: “According to the guidelines used at ETS, for example, the test ‘must not contain language, symbols, words, phrases, or examples that are generally regarded as sexist, racist, or otherwise potentially offensive, inappropriate, or negative toward any group’” (Zwick 2006, p. 656) . Nevertheless, over time, there was a growing realization that treating everyone in the same way does not necessarily ensure fairness or lack of bias. It is a good place to start (particularly as a way to control opportunities for favoritism, racism, and other forms of more or less overt bias), but it does not fully resolve the issue. As Turnbull (1951) and others recognized from mid-century, fairness depends on the appropriateness of the uses of test scores, and test scores that provide unbiased measures of a particular set of skills may not provide unbiased measures of a broader domain of skills needed in some context (e.g., in an occupation or in an educational program ). In such cases, those test scores may not provide a fair basis for making decisions about test takers (Shimberg 1981, 1982, 1990).

Over the last 65 years or so, ETS researchers have been active in investigating questions about bias and fairness in testing, in defining issues of fairness and bias, and in developing approaches for minimizing bias and for enhancing fairness. Many of the issues are still not fully resolved, in part because questions of bias depend on the intended interpretation and because questions of fairness depend on values.

4.2 Adverse Impact and Differential Prediction

Unless we are willing to assume, a priori, that there are no differences between groups in the characteristic being measured, simple differences between groups in average scores or the percentages of test takers achieving some criterion score do not necessarily say anything about the fairness of test scores or of score uses . In 1971, the U.S. Supreme Court, in Griggs v. Duke Power Co. , struck down some employment practices at the Duke Power Company that had led to substantially different hiring rates between Black and White applicants, and in its decision, the Court relied on two concepts, adverse impact and business necessity, that have come to play an important role in discussions of possible bias in score-based selection programs. Adverse impact occurs if a protected group (defined by race, ethnicity, or gender, as specified in civil rights legislation) has a substantially lower rate of selection, certification, or promotion compared to the group with the highest rate. A testing program has business necessity if the scores are shown to be related to some important outcome (e.g., some measure of performance on the job). A testing program with adverse impact against one or more protected groups was required to demonstrate business necessity for the testing program; if there was no adverse impact, there was no requirement to establish business necessity. Employers and other organizations using test scores for selection would either have to develop selection programs that had little adverse impact or would have to demonstrate business necessity (Linn 1972) . In Griggs, Duke Power’s testing program was struck down because it had substantial adverse impact, and the company had made no attempt to investigate the relationship between test scores and performance on the job.

Although the terminology of adverse impact and business necessity was not in common use before Griggs, the notion that test scores can be considered fair if they reflect real differences in performance, even if they also suffer from adverse impact, was not new. Turnbull (1951) had pointed out the importance of evaluating fairness in terms of the proposed interpretation and use of the scores:

That method is to define the criterion to which a test is intended to relate, and then to justify inter-group equality or inequality of test scores on the basis of its effect on prediction. It is necessarily true that an equality of test scores that would signify fairness of measurement for one criterion on which cultural groups performed alike would signify unfairness for another criterion on which group performance differed. Fairness, like its amoral brother, validity, resides not in tests or test scores but in the relation of test scores to criteria . (pp. 148–149)

Adverse impact does not necessarily say much about fairness, but it does act as a trigger that suggests that the relationship between test scores and appropriate criteria be evaluated (Dorans, Chap. 7, this volume; Linn 1973, 1975; Linn and Werts 1971; Messick 1989) .

By 1971, when the Griggs decision was rendered, Cleary (1968) had already published her classic study of differential prediction, which was followed by a number of differential-prediction studies at ETS and elsewhere. The Cleary model stipulated that

a test is biased for members of a subgroup of the population if, in the prediction of a criterion for which the test is designed, consistent nonzero errors of prediction are made for members of the subgroup. In other words, the test is biased if the criterion score predicted from the common regression line is consistently too high or too low for members of the subgroup. (p. 115)

The Cleary criterion is simple, clear, and direct; if the scores underpredict or overpredict the relevant criterion, the predictions can be considered biased. Note that although Cleary talked about the test being biased, her criterion applies to the predictions based on the scores and not directly to the test or test scores. In fact, the predictions can be biased in Cleary’s sense without having bias in the test scores, and the predictions can be unbiased in Cleary’s sense while having bias in the test scores (Zwick 2006) . Nevertheless, assuming that the criterion measure is appropriate and unbiased (which can be a contentious assumption in many contexts; e.g., see Linn 1976; Wild and Dwyer 1980) , the comparison of regressions made perfect sense as a way to evaluate predictive bias.

However, as a criterion for evaluating bias in the test scores, the comparison of regression lines is problematic for a number of reasons. Linn and Werts (1971) pointed out two basic statistical problems with the Cleary model; the comparisons of the regression lines can be severely distorted by errors of measurement in the independent variable (or variables) and by the omission of relevant predictor variables. Earlier, Lord (1967) had pointed to an ambiguity in the interpretation of differential-prediction analyses for groups with different means on the two measures, if the measures had less than perfect reliability or relevant predictors had been omitted.

In the 1970s, a concerted effort was made by many researchers to develop models of fairness that would make it possible to identify and remove (or at least ameliorate) group inequities in score-based decision procedures, and ETS researchers were heavily involved in these efforts (Linn 1973, 1984; Linn and Werts 1971; Myers 1975; Petersen and Novick 1976) . These efforts raised substantive questions about what we might mean by fairness in selection, but by the early 1980s, interest in this line of research had declined for several reasons.

First, a major impetus for the development of these models was the belief in the late 1960s that at least part of the explanation for the observed disparities in test scores across groups was to be found in the properties of the test. The assumption was that cultural differences and differences in educational and social opportunities caused minority test takers to be less familiar with certain content and to be less adept at taking objective tests, and therefore the test scores were expected to underpredict performance in nontest settings (e.g., on the job, in various educational programs ). Many of the fairness models were designed to adjust for inequities (defined in various ways) that were expected to result from the anticipated underprediction of performance. However, empirical results indicated that the test scores did not underpredict the scores of minority test takers, but rather overpredicted the performance of Black and Hispanic students on standard criteria , particularly first-year grade point average (GPA) in college (Cleary 1968; Young 2004; Zwick 2006) . The test scores did underpredict the scores of women, but this difference was due in part to differences in courses taken (Wild and Dwyer 1980; Zwick 2006) .

Second, Petersen and Novick (1976) pointed out some basic inconsistencies in the structures of the fairness models and suggested that it was necessary to explicitly incorporate assumptions about relative utilities of different outcomes for different test takers to resolve these discrepancies. However, it was not clear how to specify such utilities, and it was especially not clear how to get all interested stakeholders to agree on a specific set of such utilities.

As the technical difficulties mounted (Linn 1984; Linn and Werts 1971; Petersen and Novick 1976) and the original impetus for the development of the models (i.e., underprediction for minorities) turned out to be wrong (Cleary 1968; Linn 1984), interest in the models proposed to correct for underprediction faded.

An underlying concern in evaluating fairness was (and is) the acknowledged weaknesses in the criterion measures (Wild and Dwyer 1980) . In addition to being less reliable than the tests being evaluated, and in representing proxy measures of success that are appealing in large part because of their ready availability, there is evidence that the criteria are, themselves, not free of bias (Wild and Dwyer 1980) .

One major result of this extended research program is a clear realization that fairness and bias are very complex, multifaceted issues that cannot be easily reduced to a formal model of fairness or evaluated by straightforward statistical analyses (Cole and Moss 1989; Messick 1989; Wild and Dwyer 1980) : “The institutions and professionals who sponsor and use tests have one view as to what is fair; examinees have another. They will not necessarily always agree, though both have a legitimate claim” (Willingham 1999, p. 224) . Holland (1994) and Dorans (2012) suggested that analyses of test score fairness should go beyond the measurement perspective, which tends to focus on the elimination or reduction of construct-irrelevant variance (or measurement bias), to include the test taker’s perspective, which tends to view tests as “contests,” and Kane (2013b) has suggested adding an institutional perspective, which has a strong interest in eliminating any identifiable source of bias but also has an interest in reducing adverse impact, whether it is due to an identifiable source of bias or not.

4.3 Differential Item Functioning

ETS played a major role in the introduction of DIF methods as a way to promote fairness in testing programs (Dorans and Holland 1993; Holland and Thayer 1988) . These methods identify test items that, after matching on an estimate of the attribute of interest, are differentially difficult or easy for a target group of test takers, as compared to some reference group . ETS pioneered the development of DIF methodology, including the development of the most widely used methods, as well as investigations of the statistical properties of these methods, matching variables, and sample sizes (Dorans , Chap. 7, this volume; Holland and Wainer 1993) .

DIF analyses are designed to differentiate, across groups, between real differences in the construct being measured and sources of group-related construct-irrelevant variance . Different groups are not evaluated in terms of their differences in performance but rather in terms of differences in performance on each item, given the candidates’ standings on the construct being measured, as indicated by the test taker’s total score on the test (or some other relevant matching variable ). DIF analyses provide an especially appealing way to address fairness issues, because the data required for DIF analyses (i.e., item responses and test scores) are readily available for most standardized testing programs and because DIF analyses provide a direct way to decrease construct-irrelevant differential impact (by avoiding the use of items with high DIF).

Zieky (2011) has provided a particularly interesting and informative analysis of the origins of DIF methodology. As noted earlier, from ETS’s inception, its research staff had been concerned about fairness issues and had been actively investigating group differences in performance since the 1960s (Angoff and Ford 1973; Angoff and Sharon 1974; Cardall and Coffman 1964; Cleary 1968) , but no fully adequate methodology for addressing group differences at the item level had been identified. The need to address the many obstacles facing the effective implementation of DIF was imposed on ETS researchers in the early 1980s:

In 1984, ETS settled a lawsuit with the Golden Rule Insurance Company by agreeing to use raw differences in the percentages correct on an item in deciding on which items to include in a test to license insurance agents in Illinois; if two items were available that both met test specifications, the item with the smallest black-white difference in percentage correct was to be used; any difference in the percentages was treated as bias “even if it were caused by real and relevant differences between the groups in average knowledge of the tested subject.” (Zieky 2011, p. 116)

The Golden Rule procedure was seen as causing limited harm in a minimum-competency licensing context but was seen as much more problematic in other contexts in which candidates would be ranked in terms of cognitive abilities or achievement, and concern grew that test quality would suffer if test developers were required to use only items “with the smallest raw differences in percent correct between Black and White test takers , regardless of the causes of these differences” (Zieky 2011, pp. 117–118) :

The goal was an empirical means of distinguishing between real group differences in the knowledge and skill measured by the test and unfair differences inadvertently caused by biased aspects of items. Test developers wanted help in ensuring that items were fair, but each method tried so far either had methodological difficulties or was too unwieldy to use on an operational basis with a wide variety of tests and several groups of test takers. The threat of legislation that would mandate use of the Golden Rule procedure for all tests further motivated ETS staff members to adopt a practical measure of DIF. (p. 118)

In response, ETS researchers (e.g., Dorans and Holland 1993; Holland and Thayer 1988) developed procedures that evaluated differential group performance, conditional on test takers’ relative standing on the attribute of interest. The DIF methodology developed at ETS is now widely used in testing programs that aid in making high-stakes decisions throughout the world.

E. Burton and Burton (1993) found that the differences in scores across groups did not narrow substantially after the implementation of DIF analyses. Test items are routinely screened for sensitivity and other possible sources of differential functioning before administration, and relatively few items are flagged by the DIF statistics. As Zwick (2006) noted,

even in the absence of evidence that it affects overall scores, … DIF screening is important as a precaution against the inclusion of unreasonable test content and as a source of information that can contribute to the construction of better tests in the future. (p. 668)

DIF screening addresses an issue that has to be confronted for psychometric and ethical reasons. That these checks on the quality of test items turn up relatively few cases of questionable item content is an indication that the item development and screening procedures are working as intended.

4.4 Identifying and Addressing Specific Threats to Fairness/Validity

As illustrated in the two previous subsections, much of the research on fairness at ETS, and more generally in the measurement research community, has focused on the identification and estimation of differential impact and potential bias in prediction and selection, a global issue, and on DIF, which addresses particular group-specific item effects that can generate adverse impact or bias. However, some researchers have sought to address other potential threats to fairness and, therefore, to validity.

Xi (2010) pointed out that fairness is essential to validity and validity is essential to fairness. If we define validity in terms of the appropriateness of proposed interpretations and uses of scores, and fairness in terms of the appropriateness of proposed interpretations and uses of scores across groups, then fairness would be a necessary condition for validity; if we define fairness broadly in terms of social justice, then validity would be a necessary condition for fairness. Either way, the two concepts are closely related; as noted earlier, Turnbull referred to validity as the “amoral brother” of fairness (Dorans, Chap. 7, this volume; Turnbull 1951).

Xi (2010) combined fairness and validity in a common framework by evaluating fairness as comparable validity across groups within the population of interest. She proposed to identify and evaluate any fairness-based objections to proposed interpretations and uses of the test scores as a fairness argument that would focus on whether an interpretation is equally plausible for different groups and whether the decision rules are appropriate for the groups. Once the inferences and assumptions inherent in the proposed interpretation and use of the test scores have been specified, they can be evaluated in terms of whether they apply equally well to different groups. For example, it can be difficult to detect construct underrepresentation in a testing program by qualitatively evaluating how well the content of the test represents the content of a relevant domain, but empirical results indicating that there are substantial differences across groups in the relationship between performance on the test and more thorough measures of performance in the domain as a whole could raise serious questions about the representativeness of the test content. This argument-based approach can help to focus research on serious, specific threats to fairness/validity (Messick 1989).

Dorans and colleagues (Dorans, Chap. 7, this volume; Dorans and Holland 2000; Holland and Dorans 2006) have addressed threats to fairness/validity that can arise in scaling /equating test scores across different forms of a test:

Scores on different forms or editions of a test that are supposed to be used interchangeably should be related to each other in the same way across different subpopulations. Score equity assessment (SEA) uses subpopulation invariance of linking functions across important subpopulations to assess the interchangeability of the scores. (Dorans, Chap. 7, this volume)

If the different forms of the test are measuring the same construct or combination of attributes in the different subpopulations, the equating function should not depend on the subpopulation on which it is estimated, and

one way to demonstrate that two test forms are not equatable is to show that the equating functions used to link their scores are not invariant across different subpopulations of examinees. Lack of invariance in a linking function indicates that the differential difficulty of the two test forms is not consistent across different groups. (Dorans, Chap. 7, this volume)

SEA uses the invariance of the linking function across groups to evaluate consistency of the proposed interpretation of scores across groups and, thereby, to evaluate the validity of the proposed interpretation.

Mislevy et al. (2013) sought to develop systematic procedures for minimizing threats to fairness due to specific construct-irrelevant sources of variance in the assessment materials or procedures. To the extent that a threat to validity can be identified in advance or concurrently, the threat could be eliminated by suitably modifying the materials or procedures; for example, if it is found that English language learners have not had a chance to learn specific nontechnical vocabulary in a mathematics item, that vocabulary could be changed or the specific words could be defined. Mislevy et al. combined the general methodology of “universal design” with the ECD framework . In doing so, they made use of M. von Davier’s (2008) general diagnostic model as a psychometric framework to identify specific requirements in test tasks. Willingham (1999) argued that test uses would be likely to be fairer across groups if “the implications of design alternatives are carefully examined at the outset” (p. 235) but recognized that this examination would be difficult to do “without much more knowledge of subgroup strengths and weaknesses… than is normally available” (p. 236). Mislevy et al. (2013) have been working to develop the kind of knowledge needed to build more fairness into testing procedures from the design stage.

5 Messick’s Unified Model of Construct Validity

Samuel Messick spent essentially all of his professional life at ETS, and during his long and productive career, he made important contributions to many parts of test theory and to ETS testing programs, some of which were mentioned earlier. In this section, we focus on his central role in the development of the construct validity model and its transformation into a comprehensive, unified model of validity (Messick 1975, 1988, 1989). Messick’s unified model pulled the divergent strands in validity theory into a coherent framework, based on a broad view of the meaning of test scores and the values and consequences associated with the scores, and in doing so, he gave the consequences of score use a prominent role.

Messick got his bachelor’s degree in psychology and natural sciences from the University of Pennsylvania in 1951, and he earned his doctorate from Princeton University in 1954, while serving as an ETS Psychometric Fellow. His doctoral dissertation, “The Perception of Attitude Relationships: A Multidimensional Scaling Approach to the Structuring of Social Attitudes,” reflected his dual interest in quantitative methods and in personality theory and social psychology . He completed postdoctoral fellowships at the University of Illinois, studying personality dynamics, and at the Menninger Foundation, where he did research on cognition and personality and received clinical training. He started as a full-time research psychologist at ETS in 1956, and he remained there until his death in 1998. Messick also served as a visiting lecturer at Princeton University on personality theory, abnormal psychology, and human factors between 1956 and 1958 and again in 1960–1961.

Messick completed his doctoral and postdoctoral work and started his career at ETS just as the initial version of construct validity was being developed (Cronbach and Meehl 1955) . As noted, he came to ETS with a strong background in personality theory (e.g., see Messick 1956, 1972), where constructs play a major role, and a strong background in quantitative methods (e.g., see Gulliksen and Messick 1960; Messick and Abelson 1957; Schiffman and Messick 1963) . Construct validity was originally proposed as a way to justify interpretations of test scores in terms of psychological constructs (Cronbach and Meehl 1955) , and as such, it focused on psychological theory. Subsequently, Loevinger (1957) suggested that the construct model could provide a framework for all of validity, and Messick made this suggestion a reality. Between the late 1960s and the 1990s, he developed a broadly defined construct-based framework for the validation of test score interpretations and uses; his unified framework had its most complete statement in his validity chapter in the third edition of Educational Measurement (Messick 1989).

As Messick pursued his career, he maintained his dual interest in psychological theory and quantitative methods , applying this broad background to problems in educational and psychological measurement (Jackson and Messick 1965; Jackson et al. 1957; Messick and Frederiksen 1958; Messick and Jackson 1958; Messick and Ross 1962) . He had close, long-term collaborations with a number of research psychologists (e.g., Douglas Jackson , Nathan Kogan , and Lawrence Stricker ). His long-term collaboration with Douglas Jackson, whom he met while they were both postdoctoral fellows at the Menninger Foundation, and with whom he coauthored more than 25 papers and chapters (Jackson 2002) , was particularly productive.

Messick’s evolving understanding of constructs, their measurement, and their vicissitudes was, no doubt, strongly influenced by his background in social psychology and personality theory and by his ongoing collaborations with colleagues with strong substantive interest in traits and their roles in psychological theory. His work reflected an ongoing concern about how to differentiate between constructs (Jackson and Messick 1958; Stricker et al. 1969) , between content and style (Jackson and Messick 1958; Messick 1962, 1991), and between constructs and potential sources of irrelevant variance (Messick 1962, 1964, 1981b; Messick and Jackson 1958).

Given his background and interests, it is not surprising that Messick became an “early adopter” of the construct validity model. Throughout his career, Messick tended to focus on two related questions: Is the test a good measure of the trait or construct of interest, and how can the test scores be appropriately used (Messick 1964, 1965, 1970, 1975, 1977, 1980, 1989, 1994a, b)? For measures of personality, he addressed the first of these questions in terms of “two critical properties for the evaluation of the purported personality measure … the measure’s reliability and its construct validity” (Messick 1964, p. 111). Even in cases where the primary interest is in predicting behavior as a basis for decision making, and therefore, where it is necessary to develop evidence for adequate predictive accuracy, he emphasized the importance of evaluating the construct validity of the scores:

Instead of talking about the reliability and construct validity (or even the empirical validity) of the test per se, it might be better to talk about the reliability and construct validity of the responses to the test, as summarized in a particular score, thereby emphasizing that these test properties are relative to the processes used by the subjects in responding. (Messick 1964, p. 112)

Messick also exhibited an abiding concern about ethical issues in research and practice throughout his career (Messick 1964, 1970, 1975, 1977, 1980, 1981b, 1988, 1989, 1998, 2000). In 1965, he examined some criticisms of psychological testing and discussed the possibilities for regulation and self-regulation for testing. He espoused “an ‘ethics of responsibility,’ in which pragmatic evaluations of the consequences of alternative actions form the basis for particular ethical decisions” (p. 140). Messick (1965) went on to suggest that policies based on values reflect and determine how we see the world, in addition to their intended regulatory effects, and he focused on “the value-laden nature of validity and fairness as psychometric concepts” (Messick 2000, p. 4) throughout his career. It is to this concern with meaning and values in measurement that we now turn.

5.1 Meaning and Values in Measurement

Messick was consistent in emphasizing ethical issues in testing, the importance of construct validity in evaluating meaning and ethical questions, and the need to consider consequences in evaluating test use: “But the ethical question of ‘Should these actions be taken?’ cannot be answered by a simple appeal to empirical validity alone. The various social consequences of these actions must be contended with” (Messick and Anderson 1970, p. 86) . In 1975, Messick published a seminal paper that focused on meaning and values in educational measurement and explored the central role of construct-based analyses in analyzing meaning and in anticipating consequences. In doing so, he sketched many of the themes that he would subsequently develop in more detail. The paper (Messick 1975) was the published version of his presidential speech to Division 5 (Evaluation and Measurement) of the American Psychological Association . The title, “The Standard Problem: Meaning and Values in Measurement and Evaluation,” indicates the intended breadth of the discussion and its main themes. As would be appropriate for such a speech, Messick focused on big issues in the field, and we will summarize five of these: (a) the central role of construct-based reasoning and analysis in validation , (b) the importance of ruling out alternate explanations, (c) the need to be precise about the intended interpretations, (d) the importance of consequences , and (e) the role of content-related evidence in validation.

First, Messick emphasized the central role of construct-based reasoning and analysis in validation . He started the paper by saying that any discussion of the meaning of a measure should center on construct validity as the “evidential basis” for inferring score meaning, and he associated construct validity with basic scientific practice:

Construct validation is the process of marshalling evidence in the form of theoretically relevant empirical relations to support the inference that an observed response consistency has a particular meaning. The problem of developing evidence to support an inferential leap from an observed consistency to a construct that accounts for that consistency is a generic concern of all science. (Messick 1975, p. 955)

A central theme in the 1975 paper is the interplay between theory and data. Messick suggested that, in contrast to concurrent, predictive, and content-based approaches to validation , each of which focused on a specific question, construct validation involves hypothesis testing and “all of the philosophical and empirical means by which scientific theories are evaluated” (p. 956). He wrote, “The process of construct validation, then, links a particular measure to a more general theoretical construct, usually an attribute or process or trait, that itself may be embedded in a more comprehensive theoretical network” (Messick 1975, p. 955). Messick took construct validation to define validation in the social sciences but saw education as slow in adopting this view. A good part of Messick’s (1975) exposition is devoted to suggestions for why education had not adopted the construct model more fully by the early 1970s and for why that field should expand its view of validation beyond simple content and predictive interpretations. He quoted Loevinger (1957) to the effect that, from a scientific point of view, construct validity is validity, but he went further, claiming that content and criterion analyses are not enough, even for applied decision making, and that “the meaning of the measure must also be pondered in order to evaluate responsibly the possible consequences of the proposed use” (Messick 1975, p. 956). Messick was not so much suggesting the adoption of a particular methodology but rather encouraging us to think deeply about meanings and consequences.

Second, Messick (1975) emphasized the importance of ruling out alternate explanations in evaluation and in validation . He suggested that it would be effective and efficient to

direct attention from the outset to vulnerabilities in the theory by formulating counterhypotheses, or plausible alternative interpretations of the observed consistencies. If repeated challenges from a variety of plausible rival hypotheses can be systematically discounted, then the original interpretation becomes more firmly grounded. (p. 956)

Messick emphasized the role of convergent/divergent analyses in ruling out alternative explanations of test scores.

This emphasis on critically evaluating proposed interpretations by empirically checking their implications was at the heart of Cronbach and Meehl’s (1955) formulation of construct validity, and it reflects Popper’s (1965) view that conjecture and refutation define the basic methodology of science. Messick’s insistence on the importance of this approach probably originates less in the kind of philosophy of science relied on by Cronbach and Meehl and more on his training as a psychologist and on his ongoing collaborations with psychologists, such as Jackson, Kogan , and Stricker . Messick had a strong background in measurement and scaling theory (Messick and Abelson 1957) , and he maintained his interest in these areas and in the philosophy of science throughout his career (e.g., see Messick 1989, pp. 21–34). His writings, however, strongly suggested a tendency to start with a substantive problem in psychology and then to bring methodology and “philosophical conceits” (Messick 1989) to bear on the problem, rather than to start with a method and look for problems to which it can be applied. For example, Messick (1984, 1989; Messick and Kogan 1963) viewed cognitive styles as attributes of interest and not simply as sources of irrelevant variance.

Third, Messick (1975) recognized the need to be precise about the intended interpretations of the test scores. If the extent to which the test scores reflect the intended construct, rather than sources of irrelevant variance, is to be investigated, it is necessary to be clear about what is and is not being claimed in the construct interpretation, and a clear understanding of what is being claimed helps to identify plausible competing hypotheses. For example, in discussing the limitations of a simple content-based argument for the validity of a dictated spelling test, Messick pointed out that

the inference of inability or incompetence from the absence of correct performance requires the elimination of a number of plausible rival hypotheses dealing with motivation , attention, deafness, and so forth. Thus, a report of failure to perform would be valid, but one of inability to perform would not necessarily be valid. The very use of the term inability invokes constructs of attribute and process, whereas a content-valid interpretation would stick to the outcomes. (p. 960)

To validate, or evaluate, the interpretation and use of the test scores, it is necessary to be clear about the meanings and values inherent in that interpretation and use.

Fourth, Messick (1975) gave substantial attention to values and consequences and suggested that, in considering any test use, two questions needed to be considered:

First, is the test any good as a measure of the characteristic it is interpreted to assess? Second, should the test be used for the proposed purpose? The first question is a technical and scientific one and may be answered by appraising evidence bearing on the test’s psychometric properties, especially construct validity. The second question is an ethical one, and its answer requires an evaluation of the potential consequences of the testing in terms of social values. We should be careful not to delude ourselves that answers to the first question are also sufficient answers to the second (except of course when a test’s poor psychometric properties preclude its use). (p. 960)

Messick saw meaning and values as intertwined: “Just as values play an important role in measurement, where meaning is the central issue, so should meaning play an important role in evaluation, where values are the central issue” (p. 962). On one hand, the meanings assigned to scores reflect the intended uses of the scores in making claims about test takers and in making decisions. Therefore the meanings depend on the values inherent in these interpretations and uses. On the other hand, an analysis of the meaning of scores is fundamental to an evaluation of consequences because (a) the value of an outcome depends in part on how it is achieved (Messick 1970, 1975) and (b) an understanding of the meaning of scores and the processes associated with performance is needed to anticipate unintended consequences as well as intended effects of score uses .

Fifth, Messick (1975) recognized that content representativeness is an important issue in test development and score interpretation , but that, in itself, it cannot establish validity. For one, content coverage is a property of the test:

The major problem here is that content validity in this restricted sense is focused upon test forms rather than test scores, upon instruments rather than measurements. Inferences in educational and psychological measurement are made from scores, … and scores are a function of subject responses. Any concept of validity of measurement must include reference to empirical consistency. (p. 960)

Messick suggested that Loevinger’s (1957) substantive component of validity, defined as the extent to which the construct to be measured by the test can account for the properties of the items included in the test, “involves a confrontation between content representativeness and response consistency” (p. 961). The empirical analyses can result in the exclusion of some items because of perceived defects, or these analyses may suggest that the conception of the trait and the corresponding domain may need to be modified:

These analyses offer evidence for the substantive component of construct validity to the extent that the resultant content of the test can be accounted for by the theory of the trait (along with collateral theories of test-taking behavior and method distortion). (p. 961)

Thus the substantive component goes beyond traditional notions of content validity to incorporate inferences and evidence on response consistency as well as on the extent to which the response patterns are consistent with our understanding of the corresponding construct.

5.2 A Unified but Faceted Framework for Validity

Over the following decade, Messick developed his unified, construct-based conception of validity in several directions. In the third edition of Educational Measurement (Messick 1989), he proposed a very broad and open framework for validity as scientific inquiry. The framework allows for different interpretations at different levels of abstraction and generality, and it encourages the use of multiple modes of inquiry. It also incorporates values and consequences . Given the many uses of testing in our society and the many interpretations entailed by these uses, Messick’s unified model inevitably became complicated, but he wanted to get beyond the narrow views of validation in terms of content-, criterion-, and construct-related evidence:

What is needed is a way of cutting and combining validity evidence that forestalls undue reliance on selected forms of evidence, that highlights the important though subsidiary role of specific content- and criterion-related evidence in support of construct validity in testing applications, and that formally brings consideration of value implications and social consequences into the validity framework . (Messick 1989, p. 20)

Messick organized his discussion of the roles of different kinds of evidence in validation in a 2 × 2 table (see Fig. 16.1) that he had introduced a decade earlier (Messick 1980). The table has four cells, defined in terms of the function of testing (interpretation or use) and the justification for testing (evidence or consequences):

The evidential basis of test interpretation is construct validity. The evidential basis of test use is also construct validity, but as buttressed by evidence for the relevance of the test to the specific applied purpose and for the utility of the testing in the applied setting. The consequential basis of test interpretation is the appraisal of the value implications of the construct label, of the theory underlying test interpretation, and of the ideology in which the theory is embedded…. Finally, the consequential basis of test use is the appraisal of both potential and actual social consequences of the applied testing. (Messick 1989, p. 20, emphasis added)

Messick acknowledged that these distinctions were “interlocking and overlapping” (p. 20) and therefore potentially “fuzzy” (p. 20), but he found the distinctions and resulting fourfold classification to be helpful in structuring his description of the unified model of construct validity .

Fig. 16.1
figure 1

Messick’s facets of validity. From Test Validity and the Ethics of Assessment (p. 30, Research Report No. RR-79-10), by S. Messick, 1979, Princeton, NJ: Educational Testing Service . Copyright 1979 by Educational Testing Service. Reprinted with permission

5.3 The Evidential Basis of Test Score Interpretations

Messick (1989) began his discussion of the evidential basis of score interpretation by focusing on construct validity: “Construct validity, in essence, comprises the evidence and rationales supporting the trustworthiness of score interpretation in terms of explanatory concepts that account for both test performance and relationships with other variables” (p. 34). Messick saw convergent and discriminant evidence as “overarching concerns” in discounting construct-irrelevant variance and construct underrepresentation . Construct-irrelevant variance occurs to the extent that test score variance includes “excess reliable variance that is irrelevant to the interpreted construct” (p. 34). Construct underrepresentation occurs to the extent that “the test is too narrow and fails to include important dimensions or facets of the construct” (p. 34).

Messick (1989) sought to establish the “trustworthiness” of the proposed interpretation by ruling out the major threats to this interpretation. The basic idea is to develop a construct interpretation and then check on plausible threats to this interpretation. To the extent that the interpretation survives all serious challenges (i.e., the potential sources of construct-irrelevant variance and construct underrepresentation ), it can be considered trustworthy. Messick was proposing that strong interpretations (i.e., in terms of constructs) be adopted, but he also displayed a recognition of the essential limits of various methods of inquiry. This recognition is the essence of the constructive-realist view he espoused; our constructed interpretations are ambitious, but they are constructed by us, and therefore they are fallible. As he concluded,

validation in essence is scientific inquiry into score meaning—nothing more, but also nothing less. All of the existing techniques of scientific inquiry, as well as those newly emerging, are fair game for developing convergent and discriminant arguments to buttress the construct interpretation of test scores. (p. 56)

That is, rather than specify particular rules or guidelines for conducting construct validations , he suggested broad scientific inquiry that could provide support for and illuminate the limitations of proposed interpretations and uses of test scores.

Messick suggested that construct-irrelevant variance and construct underrepresentation should be considered serious when they interfere with intended interpretations and uses of scores to a substantial degree. The notion of “substantial” in this context is judgmental and depends on values , but the judgments are to be guided by the intended uses of the scores. This is one way in which interpretations and meanings are not value neutral.

5.4 The Evidential Basis of Test Score Use

According to Messick (1989), construct validity provides support for test uses. However, the justification of test use also requires evidence that the test is appropriate for a particular applied purpose in a specific applied setting: “The construct validity of score interpretation undergirds all score-based inferences, not just those related to interpretive meaningfulness but also the content- and criterion-related inferences specific to applied decisions and actions based on test scores” (pp. 63–64). Messick rejected simple notions of content validity in terms of domain representativeness in favor of an analysis of the constructs associated with the performance domain. “By making construct theories of the performance domain and of its key attributes more explicit, however, test construction and validation become more rational, and the supportive evidence sought becomes more attuned to the inferences made” (p. 64). Similarly, Messick rejected the simple model of predictive validity in terms of a purely statistical relationship between test scores and criterion scores in favor of a construct-based approach that focuses on hypotheses about relationships between predictor constructs and criterion constructs: “There is simply no good way to judge the appropriateness, relevance, and usefulness of predictive inferences in the absence of evidence as to what the predictor and criterion scores mean” (p. 64). In predictive contexts, it is the relationship between the characteristics of test takers and their future performances that is of interest. The observed relationship between predictor scores and criterion scores provides evidence relevant to this hypothetical relationship, but it does not exhaust the meaning of that relationship.

In elaborating on the evidential basis of test use, Messick (1989) discussed a number of particular kinds of score uses (e.g., employment, selection, licensure), and a number of issues that would need to be addressed (e.g., curriculum, instructional, or job relevance or representativeness; test–criterion relationships; the utility of criteria ; and utility and fairness in decision making), rather than relying on what he called ad hoc targets. He kept the focus on construct validation and suggested that “one should strive to maximize the meaningfulness of score interpretation and to minimize construct-irrelevant test variance. The resulting construct-valid scores then provide empirical components for rationally defensible prediction systems and rational components for empirically informed decision making” (p. 65). Messick (1989) was quite consistent in insisting on the primacy of construct interpretations in validity, even in those areas where empirical methods had tended to predominate. He saw the construct theory of domain performance as the basis for developing both the criterion and the predictor . Constructs provided the structure for validation and the glue that held it all together.

5.5 The Consequential Basis of Test Score Interpretation

Messick (1989) saw the consequential basis of test score interpretation as involving an analysis of the value implications associated with the construct label, with the construct theory, and with the general conceptual frameworks , or ideologies, surrounding the theory. In doing so, he echoed his earlier emphasis (Messick 1980) on the role of values in validity:

Constructs are broader conceptual categories than are test behaviors and they carry with them into score interpretation a variety of value connotations stemming from at least three major sources: the evaluative overtones of the construct labels themselves; the value connotations of the broader theories or nomological networks in which constructs are embedded; and the value implications of still broader ideologies about the nature of humankind, society, and science that color our manner of perceiving and proceeding. (Messick 1989, p. 59)

Neither constructs nor the tests developed to estimate constructs are dictated by the data, as such. We make decisions about the kinds of attributes that are of interest to us, and these choices are based on the values inherent in our views.

Messick (1989) saw values as pervading and shaping the interpretation and use of test scores and therefore saw the evaluation of value implications as an integral part of validation :

In sum, the aim of this discussion of the consequential basis of test interpretation was to raise consciousness about the pervasive consequences of value-laden terms (which in any event cannot be avoided in either social action or social science) and about the need to take both substantive aspects and value aspects of score meaning into account in test validation. (p. 63)

Under a constructive-realist model, researchers have to decide how to carve up and interpret observable phenomena, and they should be clear about the values that shape these choices.

5.6 The Consequential Basis of Test Score Use

The last cell (bottom right) of Messick’s progressive matrix addresses the social consequences of testing as an “integral part of validity” (Messick 1989, p. 84). The validity of a testing program is to be evaluated in terms of how well the program achieves its intended function or purpose without undue negative consequences:

Judging validity in terms of whether a test does the job it is employed to do … that is, whether it serves its intended function or purpose—requires evaluation of the intended or unintended social consequences of test interpretation and use. The appropriateness of the intended testing purpose and the possible occurrence of unintended outcomes and side effects are the major issues. (pp. 84–85)

The central question is whether the testing program achieves its goals well enough and at a low enough cost (in terms of negative consequences, anticipated and unanticipated) that it should be used.

Messick’s (1989) discussion of the consequences of testing comes right after an extended discussion of criterion-related evidence and analyses of utility, in terms of a specific criterion in selection, and it emphasizes that such utility analyses are important, but they are not enough. The evaluation, or validation , of a test score use requires an evaluation of all major consequences of the testing program and not simply evidence that a particular criterion is being estimated and optimized:

Even if adverse testing consequences derive from valid test interpretation and use, the appraisal of the functional worth of the testing in pursuit of the intended ends should take into account all of the ends, both intended and unintended, that are advanced by the testing application, including not only individual and institutional effects but societal or systemic effects as well. Thus, although appraisal of intended ends of testing is a matter of social policy, it is not only a matter of policy formation but also of policy evaluation that weighs all of the outcomes and side effects of policy implementation by means of test scores. Such evaluation of the consequences and side effects of testing is a key aspect of the validation of test use. (p. 85)

Messick used the term functional worth to refer to the extent that a testing program achieves its intended goals and is relatively free of unintended negative consequences. He seems to contrast this concept with test validity, which focuses on the plausibility of the proposed interpretation of the test scores. The approach is unified, but the analysis in terms of the progressive matrix is structured, complex, and nuanced.

Messick (1989) made several points about the relationship between validity and functional worth. First, to the extent that consequences are relevant to the evaluation of a testing program (in terms of either validity or functional worth), both intended and unintended consequences are to be considered. Second, consequences are relevant to the evaluation of test validity if they result from construct-irrelevant characteristics of the testing program. Third, if the unintended consequences cannot be traced to construct-irrelevant aspects of the testing program, the evaluation of consequences , intended and unintended, becomes relevant to the functional worth of the testing program, which is in Messick’s progressive matrix “an aspect of the validation of test use” (p. 85). Messick’s main concern in his discussion of functional worth was to emphasize that in evaluating such worth, it is necessary to evaluate unintended negative consequences as well as intended, criterion outcomes so as to further inform judgments about test use.

Construct meaning entered Messick’s (1989) discussion of the consequential basis of test use in large part as a framework for identifying unintended consequences that merit further study:

But once again, the construct interpretation of the test scores plays a facilitating role. Just as the construct meaning of the scores afforded a rational basis for hypothesizing predictive relationships to criteria , construct meaning provides a rational basis for hypothesizing potential testing outcomes and for anticipating possible side effects. That is, the construct theory, by articulating links between processes and outcomes, provides clues to possible effects. Thus, evidence of construct meaning is not only essential for evaluating the import of testing consequences, it also helps determine where to look for testing consequences. (pp. 85–86)

Messick’s unified framework for validity encourages us to think broadly and deeply, in this case in evaluating unintended consequences. He encouraged the use of multiple value perspectives in identifying and evaluating consequences. The unified framework for validity incorporates evaluations of the extent to which test scores reflect the construct of interest (employing a range of empirical and conceptual methods) and an evaluation of the appropriateness of the construct measures for the use at hand (employing a range of values and criteria ), but ultimately, questions about how and where tests are used are policy issues.

Messick (1989) summarized the evidential and consequential bases of score interpretation and use in terms of the four cells in his progressive matrix:

The process of construct interpretation inevitably places test scores both in a theoretical context of implied relationships to other constructs and in a value context of implied relationships to good and bad valuations, for example, of the desirability or undesirability of attributes and behaviors. Empirical appraisals of the former substantive relationships contribute to an evidential basis of test interpretation, that is, to construct validity . Judgmental appraisals of the latter value implications provide a consequential basis of test interpretation.

The process of test use inevitably places test scores both in a theoretical context of implied relevance and utility and in a value context of implied means and ends. Empirical appraisals of the former issues of relevance and utility, along with construct validity contribute to an evidential basis for test use. Judgmental appraisals of the ends a proposed test use might lead to, that is, of the potential consequences of a proposed use and of the actual consequences of applied testing, provide a consequential basis for test use. (p. 89)

The four aspects of the unified, construct-based approach to validation provide a comprehensive framework for validation, but it is a framework intended to encourage and guide conversation and investigation. It was not intended as an algorithm or a checklist for validation.

Messick’s (1989) chapter is sometimes criticized for being long and hard to read, and it is in places, but this perception should not be so surprising, because he was laying out a broad framework for validation; making the case for his proposal; putting it in historical context; and, to some extent, responding to earlier, current, and imagined future critics—not a straightforward task. When asked about the intended audience for his proposed framework , he replied, “Lee Cronbach” (M. Zieky , personal communication, May 20, 2014). As is true in most areas of scientific endeavor, theory development is an ongoing dialogue between conjectures and data, between abstract principles and applications, and between scholars with evolving points of view.

5.7 Validity as a Matter of Consequences

In one of his last papers, Messick (1998) revisited the philosophical conceits of his 1989 chapter, and in doing so, he reiterated the importance of values and consequences for validity:

What needs to be valid are the inferences made about score meaning, namely, the score interpretation and its action implications for test use. Because value implications both derive from and contribute to score meaning, different value perspectives may lead to different score implications and hence to different validities of interpretation and use for the same scores. (p. 37)

Messick saw construct underrepresentation and construct-irrelevant variance as serious threats to validity in all cases, but he saw them as especially serious if they led to adverse consequences:

All educational and psychological tests underrepresent their intended construct to some degree and all contain sources of irrelevant variance. The details of this underrepresentation and irrelevancy are typically unknown to the test maker or are minimized in test interpretation and use because they are deemed to be inconsequential. If noteworthy adverse consequences occur that are traceable to these two major sources of invalidity, however, then both score meaning and intended uses need to be modified to accommodate these findings. (p. 42)

And he continued, “This is precisely why unanticipated consequences constitute an important form of validity evidence. Unanticipated consequences signal that we may have been incomplete or off-target in test development and, hence, in test interpretation and use” (p. 43). Levels of construct underrepresentation and construct-irrelevant variance that would otherwise be acceptable would become unacceptable if it were shown that they had serious negative consequences.

5.8 The Central Messages

Messick’s (1975, 1980, 1981a, 1988, 1989, 1995) treatment of validity is quite thorough and complex, but he consistently emphasizes a few basic conclusions.

First, validity is a unified concept. It is “an integrated evaluative judgment” of the degree to which evidence and rationales support the inferences and actions based on test scores. We do not have “kinds” of validity for different score interpretations or uses.

Second, all validity is construct validity. Construct validity provides the framework for the unified model of validity because it subsumes both the content and criterion models and reflects the general practice of science in which observation is guided by theory.

Third, validation is scientific inquiry. It is not a checklist or procedure but rather a search for the meaning and justification of score interpretations and uses. The meaning of the scores is always important, even in applied settings, because meaning guides both score interpretation and score use. Similarly, values guide the construction of meaning and the goals of test score use .

Fourth, validity and science are value laden. Construct labels, theories, and supporting conceptual frameworks involve values, either explicitly or implicitly, and it is good to be clear about the underlying assumptions. It is better to be explicit than implicit about our values.

Fifth, Messick maintained that validity involves the appraisal of social consequences of score uses . Evaluating whether a test is doing what it was intended to do necessarily involves an evaluation of intended and unintended consequences.

There were two general concerns that animated Messick’s work on validity theory over his career, both of which were evident from his earliest work to his last papers. One was his abiding interest in psychological theory and in being clear and explicit about the theoretical and pragmatic assumptions being made. Like Cronbach, he was convinced that we cannot do without theory and, more specifically, theoretical constructs, and rather than ignoring substantive, theoretical assumptions, he worked to understand the connections between theories, constructs, and testing.

The second was his abiding interest in values, ethics, and consequences , which was evident in his writing from the 1960s (Messick 1965) to the end of his career (Messick 1998). He recognized that values influence what we look at and what we see and that if we try to exclude values from our testing programs, we will tend to make the values implicit and unexamined. So he saw a role for values in evaluating the validity of both the interpretations of test scores and the uses of those scores. He did not advocate that the measurement community should try to impose any particular set of values, but he was emphatic and consistent in emphasizing that we should recognize and make public the value implications inherent in score interpretations and uses.

6 Argument-Based Approaches to Validation

Over a period of about 25 years, from the early 1960s to 1989, Messick developed a broad construct-based framework for validation that incorporated concerns about score interpretations and uses, meaning and values, scientific reasoning and ethics, and the interactions among these different components. As a result, the framework was quite complex and difficult to employ in applied settings.

Since the early 1990s, researchers have developed several related approaches to validation (Kane 1992, 2006, 2013a; Mislevy 2006, 2009; Mislevy et al. 1999; Mislevy et al. 2003b; Shepard 1993) that have sought to streamline models of validity and to add some more explicit guidelines for validation by stating the intended interpretation and use of the scores in the form of an argument. The argument would provide an explicit statement of the claims inherent in the proposed interpretation and use of the scores (Cronbach 1988) .

By explicitly stating the intended uses of test scores and the score interpretations supporting these uses, these argument-based approaches seek to identify the kinds of evidence needed to evaluate the proposed interpretation and use of the test scores and thereby to specify necessary and sufficient conditions for validation.

Kane (1992, 2006) suggested that the proposed interpretation and use of test scores could be specified in terms of an interpretive argument. After coming to ETS, he extended the argument-based framework to focus on an interpretation/use argument (IUA), a network of inferences and supporting assumptions leading from a test taker’s observed performances on test tasks or items to the interpretive claims and decisions based on the test scores (Kane 2013a). Some of the inferences in the IUA would be statistical (e.g., generalization from an observed score to a universe score or latent variable, or a prediction of future performance); other inferences would rely on expert judgment (e.g., scoring, extrapolations from the testing context to nontest contexts); and many of the inferences might be evaluated in terms of several kinds of evidence.

Most of the inferences in the IUA would be presumptive in the sense that the inference would establish a presumption in favor of its conclusion, or claim, but it would not prove the conclusion or claim. The inference could include qualitative qualifiers (involving words such as “usually”) or quantitative qualifiers (e.g., standard errors or confidence intervals ), as well as conditions under which the inference would not apply. The IUA is intended to represent the claims being made in interpreting and using scores and is not limited to any particular kind of claim.

The IUAs for most interpretations and uses would involve a chain of linked inferences leading from the test performances to claims based on these performances; the conclusion of one inference would provide the starting point, or datum, for subsequent inferences. The IUA is intended to provide a fairly detailed specification of the reasoning inherent in the proposed interpretation and uses of the test scores. Assuming that the IUA is coherent, in the sense that it hangs together, and complete, in the sense that it fully represents the proposed interpretation and use of the scores, it provides a clear framework for validation. The inferences and supporting assumptions in the IUA can be evaluated using evidence relevant to their plausibility. If all of the inferences and assumptions hold up under critical evaluation (conceptual and empirical), the interpretation and use can be accepted as plausible, or valid; if any of the inferences or assumptions fail to hold up under critical evaluation, the proposed interpretation and use of the scores would not be considered valid.

An argument-based approach provides a validation framework that gives less attention to philosophical foundations and general concerns about the relationship between meaning and values than did Messick’s unified, construct-based validation framework, and more attention to the specific IUA under consideration. In doing so, an argument-based approach can provide necessary and sufficient conditions for validity in terms of the plausibility of the inferences and assumptions in the IUA. The validity argument is contingent on the specific interpretation and use outlined in the IUA; it is the proposed interpretation and uses that are validated and not the test or the test scores.

The argument-based approach recognizes the importance of philosophical foundations and of the relationship between meaning and values, but it focuses on how these issues play out in the context of particular testing programs with a particular interpretation and use proposed for the test scores. The conclusions of such argument-based analyses depend on the characteristics of the testing program and the proposed interpretation and uses of the scores; the claims being based on the test scores are specified and the validation effort is limited to evaluating these claims.

Chapelle et al. (2008, 2010) used the argument-based approach to analyze the validity of the TOEFL ® test in some detail and, in doing so, provided insight into the meaning of the scores as well as their empirical characteristics and value implications. In this work, it is clear how the emphasis in the original conception of construct validity (Cronbach and Meehl 1955) on the need for a program of validation research rather than a single study and Messick’s emphasis on the need to rule out threats to validity (e.g., construct-irrelevant variance and construct underrepresentation ) play out in an argument-based approach to validation.

Mislevy (1993, 1994, 1996, 2007) focused on the role of evidence in validation, particularly in terms of model-based reasoning from observed performances to more general claims about students and other test takers. Mislevy et al. (1999, 2002, 2003a, b) developed an ECD framework that employs argument-based reasoning. ECD starts with an analysis of the attributes, or constructs, of interest and the social and cognitive contexts in which they function and then designs the assessment to generate the kinds and amounts of evidence needed to draw the intended inferences. The ECD framework involves several stages of analysis (Mislevy and Haertel 2006; Mislevy et al. 1999, 2002, 2003a) . The first stage, domain analysis, concentrates on building substantive understanding of the performance domain of interest, including theoretical conceptions and empirical research on student learning and performance, and the kinds of situations in which the performances are likely to occur. The goal of this first stage is to develop an understanding of how individuals interact with tasks and contexts in the domain.

At the second stage, domain modeling, the relationships between student characteristics, task characteristics, and situational variables are specified (Mislevy et al. 2003a, b) . The structure of the assessment to be developed begins to take shape, as the kinds of evidence that would be relevant to the goals of the assessment are identified.

The third stage involves the development of a conceptual assessment framework that specifies the operational components of the test and the relationships among these components, including a student model, task models, and evidence models. The student model provides an abstract account of the student in terms of ability parameters (e.g., in an IRT model ). Task models posit schemas for collecting data that can be used to estimate the student parameters and guidelines for task development. The evidence model describes how student performances are to be evaluated, or scored, and how estimates of student parameters can be made or updated. With this machinery in place, student performances on a sample of relevant tasks can be used to draw probabilistic inferences about student characteristics.

The two dominant threads in these argument-based approaches to validation are the requirement that the claims to be made about test takers (i.e., the proposed interpretation and use of the scores) be specified in advance, and then justified, and that inferences about specific test takers be supported by warrants or models that have been validated, using empirical evidence and theoretical rationales. The argument-based approaches are consistent with Messick’s unified framework , but they tend to focus more on specific methodologies for the validation of proposed interpretations and uses than did the unified framework .

7 Applied Validity Research at ETS

In addition to the contributions to validity theory described above, ETS research has addressed numerous practical issues in documenting the validity of various score uses and interpretations and in identifying the threats to the validity of ETS tests. Relatively straightforward predictive validity studies were conducted at ETS from its earliest days, but ETS research also has addressed problems in broadening both the predictor and criterion spaces and in finding better ways of expressing the results of predictive validity studies. Samuel Messick’s seminal chapter in the third edition of Educational Measurement (Messick 1989) focused attention on the importance of identifying factors contributing to construct-irrelevant variance and identifying instances of construct underrepresentation , and numerous ETS studies have focused on both of these problems.

7.1 Predictive Validity

Consistent with the fundamental claim that tests such as the SAT test were useful because they could predict academic performance, predictive validity studies were common throughout the history of ETS. As noted earlier, the second Research Bulletin published by ETS (RB-48-02) was a predictive study titled The Prediction of First Term Grades at Hamilton College (Frederiksen 1948) . The abstract noted, “It was found that the best single predictor of first term average grade was rank in secondary school (r = .57). The combination of SAT scores with school rank was found to improve the prediction considerably (R = .67).” By 1949, enough predictive validity studies had been completed that results of 17 such studies could be summarized by Allen (1949) . This kind of study was frequently repeated over the years, but even in the very earliest days there was considerable attention to a more nuanced view of predictive validity from both the perspective of potential predictors and potential criteria . As noted, the Frederiksen study cited earlier was the second Research Bulletin published by ETS, but the first study published (College Board 1948) examined the relationship of entrance test scores at the U.S. Coast Guard Academy to outcome variables that included both course grades and nonacademic ratings. On the predictor side, the study proposed that “a cadet’s standing at the Academy be based on composite scores based on three desirable traits: athletic ability, adaptability, and academic ability.” A follow-up study (French 1948) included intercorrelations of 76 measures that included academic and nonacademic tests as predictors and included grades and personality ratings as criteria. The conclusions supported the use of the academic entrance tests but noted that the nonacademic tests in that particular battery did not correlate with either grades or personality ratings.

Although there were a number of studies focusing on the prediction of first-year grades in the 1950s (e.g., Abelson 1952; Frederiksen et al. 1950a, b; Mollenkopf 1951; Schultz 1952) , a number of studies went beyond that limited criterion. For example, Johnson and Olsen (1952) compared 1-year and 3-year predictive validities of the Law School Admissions Test in predicting grades. Mollenkopf (1950) studied the ability of aptitude and achievement tests to predict both first- and second-year grades at the U.S. Naval Postgraduate School. Although the second-year validities were described as “fairly satisfactory,” they were substantially lower than the Year 1 correlations. This difference was attributed to a number of factors, including differences in the first- and second-year curricula, lower reliability of second-year grades, and selective dropout. Besides looking beyond the first year, these early studies also considered other criteria. French (1957), in a study of 12th-grade students at 42 secondary schools, related SAT scores and scores on the Tests of Developed Ability (TDA) to criteria that included high school grades but also included students’ self-reports of their experiences and interests and estimations of their own abilities. In addition, teachers nominated students who they believed exhibited outstanding ability. The study concluded not only that the TDA predicted grades in physics, chemistry, biology, and mathematics but that, more so than the SAT, it was associated with self-reported scientific interests and experiences.

From the 1960s through the 1980s, ETS conducted a number of SAT validity studies that focused on routine predictions of the freshman grade point average (FGPA) with data provided from colleges using the College Board /ETS Validity Study Service as summarized by Ramist and Weiss (1990) . In 1994, Ramist et al. (1994) produced a groundbreaking SAT validity study that introduced a number of innovations not found in prior work. First, the study focused on course grades, rather than FGPA, as the criterion. Because some courses are graded much more strictly than others, when grades from these courses are combined without adjustment in the FGPA, the ability of the SAT to predict freshman performance is underestimated. Several different ways of making the adjustment were described and demonstrated. Second, the study corrected for the range restriction in the predictors caused by absence of data for the low-scoring students not admitted to college. (Although the range restriction formulas were not new, they had not typically been employed in multicollege SAT validity studies.) Third, the authors adjusted course grades for unreliability. Fourth, they provided analyses separately for a number of subgroups defined by gender, ethnicity, best language , college selectivity, and college size. When adjustments were made for multivariate range restriction in the predictors, grading harshness/leniency for specific courses, and criterion unreliability, the correlation of the SAT with the adjusted grades was .64, and the multiple correlation of SAT and high school record with college grades was .75.

Subsequent SAT validity studies incorporated a number of these methods and provided new alternatives. Bridgeman et al. (2000), for example, used the course difficulty adjustments from the 1994 study but noted that the adjustments could be quite labor intensive for colleges trying to conduct their own validity studies. They showed that simply dividing students into two categories based on intended major (math/science [where courses tend to be severely graded] vs. other) recovered many of the predictive benefits of the complex course difficulty adjustments. In a variation on this theme, a later study by Bridgeman et al. (2008c) provided correlations separately for courses in four categories (English, social science, education, and science/math/engineering) and focused on cumulative grades over an entire college career, not just the first year. This study also showed that, contrary to the belief that the SAT predicts only FGPA, predictions of cumulative GPA over 4 or 5 years are similar to FGPA predictions.

7.2 Beyond Correlations

From the 1950s through the early 2000s, the predictive validity studies for the major admissions testing programs (e.g., SAT, the GRE ® test, GMAT ) tended to rely on correlations to characterize the relationship between test scores and grades. Test critics would often focus on unadjusted correlations (typically around .30). Squaring this number to get “variance accounted for,” the critics would suggest that a test that explained less than 10% of the variance in grades must be of very little practical value (e.g., Fairtest 2003) . To counter this perception, Bridgeman and colleagues started supplementing correlational results by showing the percentage of students who would succeed in college at various score levels (e.g., Bridgeman et al. 2008a, b, c; Cho and Bridgeman 2012) . For example, in one study, 12,529 students at moderately selective colleges who had high school GPAs of at least 3.7 were divided into groups based on their combined Verbal and Mathematics SAT scores (Bridgeman et al. 2008a). Although college success can be defined in many different ways, this study defined success relatively rigorously as achieving a GPA of 3.5 or higher at the end of the college career. For students with total SAT scores (verbal + mathematics) of 1000 or lower, only 22% had achieved this level of success, whereas 73% of students in the 1410–1600 score category had finished college with a 3.5 or higher. Although SAT scores explained only about 12% of the variance in the overall group (which may seem small), the difference between 22% and 73% is substantial. This general approach to meaningful presentation of predictive validity results was certainly not new; rather, it is an approach that must be periodically rediscovered. As Ben Schrader noted in 1965,

during the past 60 years, correlation and regression have come to occupy a central position in measurement and research…. Psychologists and educational researchers use these methods with confidence based on familiarity. Many persons concerned with research and testing, however, find results expressed in these terms difficult or impossible to interpret, and prefer to have results expressed in more concrete form. (p. 29)

He then went on to describe a method using expectancy tables that showed how standing on the predictor, in terms of fifths, related to standing on the criterion, also in terms of fifths. He used scores on the Law School Admission Test as the predictor and law school grades as the criterion. Even the 1965 interest in expectancy tables was itself a rediscovery of their explanatory value. In their study titled “Prediction of First Semester Grades at Kenyon College, 1948–1949,” Frederiksen et al. (1950a) included an expectancy table that showed the chances in 100 that a student would earn an average of at least a specified letter grade given a predicted grade based on a combination of high school rank and SAT scores. For example, for a predicted grade of B, the chance in 100 of getting at least a C+ was 88, at least a B was 50, and at least an A− was 12.

Despite the appeal of the expectancy table approach, it lay dormant until modest graphical extensions of Schrader’s ideas were again introduced in 2008 and beyond. An example of this more graphical approach is in Fig. 16.2 (Bridgeman et al. 2008a, p. 10). Within each of 24 graduate biology programs, students were divided into quartiles based on graduate grades and into quartiles based on combined GRE verbal and quantitative scores. These results were then aggregated across the 24 programs and graphed. The graph shows that almost three times as many students with top quartile GRE scores were in the high-GPA category (top quartile) compared to the number of high-GPA students in the bottom GRE quartile.

Fig. 16.2
figure 2

Percentage of biology graduate students in GRE quartile categories whose graduate grade point averages were in the bottom quartile, mid-50%, or top quartile of the students in their departments across 24 programs (Adapted from Bridgeman et al. 2008a. Copyright 2008 by Educational Testing Service. Used with permission)

The same report also used a graphical approach to show a kind of incremental validity information. Specifically, the bottom and top quartiles in each department were defined in terms of both undergraduate grade point average (UGPA) and GRE scores. Then, within the bottom UGPA quartile, students with top or bottom GRE scores could be compared (and similarly for the top UGPA quartile). Because graduate grades tend to be high, success was defined as achieving a 4.0 grade average. Figure 16.3 indicates that, even within a UGPA quartile, GRE scores matter for identifying highly successful students (i.e., the percentage achieving a 4.0 average).

Fig. 16.3
figure 3

Percentage of students in graduate biology departments earning a 4.0 grade point average by undergraduate grade point average and GRE high and low quartiles (Adapted from Bridgeman et al. 2008a Copyright 2008 by Educational Testing Service. Used with permission)

7.3 Construct-Irrelevant Variance

The construct-irrelevant factors that can influence test scores are almost limitless. A comprehensive review of all ETS studies related to construct-irrelevant variance would well exceed the space limitations in this document; rather, a sampling of studies that explore various aspects of construct-irrelevant variance is presented. Research on one source of irrelevant variance, coaching , is described in a separate chapter by Donald Powers (Chap. 17, this volume).

7.3.1 Fatigue Effects

The potential for test-taker fatigue to interfere with test scores was already a concern in 1948, as suggested by the title of ETS Research Memorandum No. 48-02 by Tucker (1948) , Memorandum Concerning Study of Effects of Fatigue on Afternoon Achievement Test Scores Due to Scholastic Aptitude Test Being Taken in the Morning. A literature review on the effects of fatigue on test scores completed in 1966 reached three conclusions:

1) Sufficient evidence exists in the literature to discount any likelihood of physiological consequences to the development of fatigue during a candidate’s taking the College Board SAT or Achievement Tests; 2) the decline in feeling-tone experienced by an individual is often symptomatic of developing fatigue, but this decline does not necessarily indicate a decline in the quantity or quality of work output; and 3) the amount of fatigue that develops as a result of mental work is related to the individual’s conception of, and attitude and motivation toward, the task being performed. (Wohlhueter 1966, Abstract)

A more recent experimental study conducted when the SAT was lengthened by the addition of the writing section reached a similar conclusion: “Results indicated that while the extended testing time for the new SAT may cause test takers to feel fatigued, fatigue did not affect test taker performance” (Liu et al. 2004, Abstract) .

7.3.2 Time Limits

If a test is designed to assess speed of responding, then time limits merely enforce construct-relevant variance. But if the time limit is imposed primarily for administrative convenience, then a strict time limit might not be construct relevant. On one hand, an early study on the influence of timing on Cooperative Reading Test scores suggested no significant changes in means or standard deviations with extended time (Frederiksen 1951). On the other hand, Lord (1953, Abstract) concluded that “unspeeded (power ) tests are more valid” based on a study of 649 students at one institution. Evans (1980) created four SAT-like test forms that were administered in one of three speededness conditions: normal, speeded, and unspeeded. Degree of speededness affected scores but did not interact with gender or ethnicity. The technical handbook for the SAT by Donlon (1984) indicated that the speed with which students can answer the questions should play only a very minor role in determining scores. A study of the impact of extending the amount of time allowed per item on the SAT concluded that there were some effects of extended time (1.5 times regular time); average gains for the verbal score were less than 10 points on the 200–800 scale and about 30 points for the mathematics scores (Bridgeman et al. 2004b). But these effects varied considerably depending on the ability level of the test taker. Somewhat surprisingly, for students with SAT scores of 400 or lower, extra time had absolutely no impact on scores. Effects did not interact with either gender or ethnicity. Extended time on the GRE was similarly of only minimal benefit with an average increase of 7 points for both verbal and quantitative scores on the 200–800 scale when the time limit was extended to 1.5 times standard time (Bridgeman et al. 2004a).

When new tests are created or existing tests are modified, appropriate time limits must be set. A special timing study was conducted when new item types were to be introduced to the SAT to provide an estimate of the approximate amount of time required to answer new and existing item types (Bridgeman and Cahalan 2007) . The study used three approaches to estimate the amount of time needed to answer questions of different types and difficulties: (a) Item times were automatically recorded from a computer-adaptive version of the SAT, (b) students were observed from behind a one-way mirror in a lab setting as they answered SAT questions under strict time limits and the amount of time taken for each question was recorded, and (c) high school students recorded the amount of time taken for test subsections that were composed of items of a single type. The study found that the rules of thumb used by test developers were generally accurate in rank ordering the item types from least to most time consuming but that the time needed for each question was higher than assumed by test developers.

Setting appropriate time limits that do not introduce construct-irrelevant variance is an especially daunting challenge for evaluating students with disabilities , as extended time is the most common accommodation for these students. Evaluating the appropriateness of extended time limits for students with disabilities has been the subject of several research reports (e.g., Cahalan et al. 2006; Packer 1987; Ragosta and Wendler 1992) as well as receiving considerable attention in the book Testing Handicapped People (Willingham et al. 1988) .

Setting appropriate time limits on a computer-adaptive test (CAT) in which different students respond to different items can be especially problematic. Bridgeman and Cline (2000) showed that when the GRE was administered as a CAT, items at the same difficulty level and meeting the same general content specifications could vary greatly in the time needed to answer them. For example, a question assessing the ability to add numbers with negative exponents could be answered very quickly while a question at the same difficulty level that required the solution of a pair of simultaneous equations would require much more time even for very able students. Test takers who by chance received questions that could be answered quickly would then have an advantage on a test with relatively strict time limits. Furthermore, running out of time on a CAT and guessing to avoid the penalty for an incomplete test can have a substantial impact on the test score because the CAT scoring algorithm assumed that an incorrect answer reflected a lack of ability and not an unlucky guess (Bridgeman and Cline 2004) . A string of unlucky guesses at the end of the GRE CAT (because the test taker ran out of time and had to randomly respond) could lower the estimated score by more than 100 points (on a 200–800 scale) compared to the estimated score when the guessing began.

7.3.3 Guessing

Guessing can be a source of construct-irrelevant variance because noise is added to measurement precision when test takers answer correctly by guessing but actually know nothing about the answer (Wendler and Walker 2006) . Corrections for guessing often referred to as formula scoring attempt to limit this irrelevant variance by applying a penalty for incorrect answers so that answering incorrectly has more negative consequences than merely leaving a question blank. For example, with the five-option multiple-choice questions on the SAT (prior to 2016), a test taker received 1 point for a correct answer and 0 points for an omitted answer, and one-fourth of a point was subtracted for each incorrect answer. (The revised SAT introduced in 2016 no longer has a correction for guessing.) By the time ETS was founded, there were already more than 20 years of research on the wisdom and effects of guessing corrections. Freeman (1952) surveyed this research and observed,

At the outset, it may be stated that the evidence is not conclusive. While much that is significant has been written about the theoretical need to correct for guessing, and about the psychological and instructional value of such a correction, the somewhat atomistic, or at least uncoordinated, research that has been done during the last 25 years fails to provide an answer that can be generalized widely. (p. 1)

More than 60 years later, research is still somewhat contradictory and a definitive answer is still illusive. Lord (1974) argued that under certain assumptions, formula scoring is “clearly superior” to number-right scoring, though it remains unclear how often those assumptions are actually met. Angoff (1987) conducted an experimental study with different guessing instructions for SAT Verbal items and concluded, “Formula scoring is not disadvantageous to students who are less willing to guess and attempt items when they are not sure of the correct answer” (abstract). Conversely, some individuals and population subgroups may differ in their willingness to guess so that conclusions based on averages in the population as a whole may not be valid for all people. Rivera and Schmitt (1988) , for example, noted a difference in willingness to guess on the part of Hispanic test takers , especially Mexican Americans. Beginning in the 1981–1982 test year, the GRE General Test dropped formula scoring and became a rights-only scored test, but the GRE Subject Tests retained formula scoring. In 2011, the Advanced Placement ® (AP ®) test program dropped formula scoring and the penalty for incorrect answers. At the end of 2014, the SAT was still using formula scoring, but the announcement had already been made that the revised SAT would use rights-only scoring.

7.3.4 Scoring Errors

Any mistakes made in scoring a test will contribute to irrelevant variance. Although the accuracy of machine scoring of multiple-choice questions is now almost taken for granted, early in the history of ETS, there were some concerns with the quality of the scores produced by the scanner. Note that the formula scoring policy put special demands on the scoring machine because omitted answers and incorrect answers were treated differently. The machine needed to determine if a light mark was likely caused by an incomplete erasure (indicating intent to omit) or if the relatively light mark was indeed the intended answer. The importance of the problem may be gauged by the status of the authors, Fan, Lord, and Tucker, who devised “a system for reducing the number of errors in machine-scoring of multiple-choice answer sheets” (Fan et al. 1950, Abstract) . Measuring and reducing rater-related scoring errors on essays and other constructed responses were also of very early concern. A study of the reading reliability of the College Board English Composition test was completed in 1948 (Aronson 1948; ETS 1948) . In the following years, controlling irrelevant variance introduced by raters of constructed responses (whether human or machine) was the subject of a great deal of research, which is discussed in another chapter (Bejar, Chap. 18, this volume) .

7.4 Construct Underrepresentation

Whereas construct-irrelevant variance describes factors that should not contribute to test scores, but do, construct underrepresentation is the opposite—failing to include factors in the assessment that should contribute to the measurement of a particular construct. If the purpose of a test or battery of tests is to assess the likelihood of success in college (i.e., the construct of interest), failure to measure the noncognitive skills that contribute to such success could be considered a case of construct underrepresentation. As noted, from the earliest days of ETS, there was interest in assessing more than just verbal and quantitative skills. In 1948, the organization’s first president, Chauncey, called for a “Census of Abilities” that would assess attributes that went beyond just verbal and quantitative skills to include “personal qualities, … drive (energy), motivation (focus of energy), conscientiousness, … ability to get along with others” (Lemann 1995, p. 84) . From 1959 to 1967, ETS had a personality research group headed by Samuel Messick. The story of personality research at ETS is described in two other chapters (Kogan, Chap. 14, this volume; Stricker, Chap. 13, this volume) .

Despite the apparent value of broadening the college-readiness construct beyond verbal and quantitative skills, the potential of such additional measures as a part of operational testing programs needed to be rediscovered from time to time. Frederiksen and Ward (1978) described a set of tests of scientific thinking that were developed as potential criterion measures, though they could also be thought of as additional predictors. The tests assessed both quality and quantity of ideas in formulating hypotheses and solving methodological problems. In a longitudinal study of 3,500 candidates for admission to graduate programs in psychology, scores were found to be related to self-appraisals of professional skills, professional accomplishments in collaborating in research, designing research apparatus, and publishing scientific papers. In a groundbreaking article in the American Psychologist, Norman Frederiksen (1984) expanded the argument for a broader conception of the kinds of skills that should be assessed. In the article, titled “The Real Test Bias: Influences of Testing on Teaching and Learning,” Frederiksen argued that

there is evidence that tests influence teacher and student performance and that multiple-choice tests tend not to measure the more complex cognitive abilities . The more economical multiple-choice tests have nearly driven out other testing procedures that might be used in school evaluation. (Abstract)

Another article, published in the same year, emphasized the critical role of social intelligence (Carlson et al. 1984) . The importance of assessing personal qualities in addition to academic ability for predicting success in college was further advanced in a multiyear, multicampus study that was the subject of two books (Willingham 1985; Willingham and Breland 1982) . This study indicated the importance of expanding both the predictor and criterion spaces. The study found that if the only criterion of interest is academic grades, SAT scores and high school grades appear to be the best available predictors, but, if criteria such as leadership in school activities or artistic accomplishment are of interest, the best predictors are previous successes in those areas.

Baird (1979) proposed a measure of documented accomplishments to provide additional evidence for graduate admissions decisions. In contrast to a simple listing of accomplishments, documented accomplishments require candidates to provide verifiable evidence for their claimed accomplishments. The biographical inventory developed in earlier stages was evaluated in 26 graduate departments that represented the fields of English, biology, and psychology. Responses to the inventory were generally not related to graduate grades, but a number of inventory responses reflecting preadmission accomplishments were significantly related to accomplishments in graduate school (Baird and Knapp 1981) . Lawrence Stricker and colleagues further refined measures of documented accomplishments (Stricker et al. 2001) .

Moving into the twenty-first century, there was rapidly increasing interest in noncognitive assessments (Kyllonen 2005) , and a group was established at ETS to deal specifically with these new constructs (or to revisit older noncognitive constructs that in earlier years had failed to gain traction in operational testing programs). The label “noncognitive” is not really descriptive and was a catch-all that included any assessment that went beyond the verbal, quantitative, writing, and subject matter skills and knowledge that formed the backbone of most testing programs at ETS. Key noncognitive attributes include persistence, dependability, motivation , and teamwork. One measure that was incorporated into an operational program was the ETS ® Personal Potential Index (ETS ® PPI) service , which was a standardized rating system in which individuals who were familiar with candidates for graduate school, such as teachers or advisors, could rate core personal attributes: knowledge and creativity , resilience, communication skills, planning and organization, teamwork, and ethics and integrity. All students who registered to take the GRE were given free access to the PPI and a study was reported that demonstrated how the diversity of graduate classes could be improved by making the PPI part of the selection criteria (Klieger et al. 2013) . Despite its potential value, the vast majority of graduate schools were reluctant to require the PPI, at least in part because they were afraid of putting in place any additional requirements that they thought might discourage applicants, especially if their competition did not have a similar requirement. Because of this very low usage, ETS determined that the resources needed to support this program could be better used elsewhere and, in 2015, announced the end of the PPI as part of the GRE program. This announcement certainly did not signal an end to interest in noncognitive assessments. A noncognitive assessment, the SuccessNavigator ® assessment, which was designed to assist colleges in making course placement decisions, was in use at more than 150 colleges and universities in 2015. An ongoing research program provided evidence related to placement validity claims, reliability, and fairness of the measure’s scores and placement recommendations (e.g., Markle et al. 2013; Rikoon et al. 2014) .

The extent to which writing skills are an important part of the construct of readiness for college or graduate school also has been of interest for many years. Although a multiple-choice measure of English writing conventions, the Test of Standard Written English, was administered along with the SAT starting in 1977, it was seen more as an aid to placement into English classes than as part of the battery intended for admissions decisions. Rather than the 200–800 scale used for Verbal and Mathematics tests, it had a truncated scale running from 20 to 60. By 2005, the importance of writing skills to college preparedness was recognized by inclusion of a writing score based on both essay and multiple-choice questions and reported on the same 200–800 scale as Verbal and Mathematics. Starting in the mid-1990s, separately scored essay-based writing sections became a key feature of high-stakes admissions tests at ETS, starting with the GMAT , then moving on to the GRE and the TOEFL iBT ® test. A major reason for the introduction of TOEFL iBT in 2005 was to broaden the academic English construct assessed (i.e., reduce the construct underrepresentation) by adding sections on speaking and writing skills. By 2006, the TOEIC ® tests, which are designed to evaluate English proficiency in the workplace, were also offering an essay section.

The importance of writing in providing adequate construct representation was made clear for AP tests by the discovery of nonequivalent gender differences on the multiple-choice and constructed-response sections of many AP tests (Mazzeo et al. 1993) . That finding meant that a different gender mix of students would be granted AP credit depending on which item type was given more weight, including if only one question type was used. Bridgeman and Lewis (1994) noted that men scored substantially higher than women (by about half of a standard deviation) on multiple-choice portions of AP history examinations but that women and men scored almost the same on the essays and that women tended to get slightly higher grades in their college history courses. Furthermore, the composite of the multiple-choice and essay sections provided better prediction of college history grades than either section by itself for both genders. Thus, if the construct were underrepresented by a failure to include the essay section, not only would correlations have been lower but substantially fewer women would have been granted AP credit. Bridgeman and McHale (1998) performed a similar analysis for the GMAT , demonstrating that the addition of the essay would create more opportunities for women.

8 Fairness as a Core Concern in Validity

Fairness is a thread that has run consistently through this chapter because, as Turnbull (1951) and others have noted, the concepts of fairness and validity are very closely related. Also noted at a number of points in this chapter, ETS has been deeply concerned about issues of fairness and consequences for test takers as individuals throughout its existence, and these concerns have permeated its operational policies and its research program (Bennett, Chap. 1, this volume; Messick 1975, 1989, 1994a, 1998, 2000; Turnbull 1949, 1951). However, with few exceptions, measurement professionals paid little attention to fairness across groups until the 1960s (D.R. Green 1982), when this topic became a widespread concern among test developers and many test publishers instituted fairness reviews and empirical analyses to promote item and test fairness (Zieky 2006).

Messick’s (1989) fourfold analysis of the evidential and consequential bases of test score interpretations and uses gave a lot of attention to evaluations of the fairness and overall effectiveness of testing programs in achieving intended outcomes and in minimizing unintended negative consequences. As indicated earlier, ETS researchers have played a major role in developing statistical models and methodology for identifying and controlling likely sources of construct-irrelevant variance and construct underrepresentation and thereby promoting fairness and reducing bias. In doing so, they have tried to clarify how the evaluation of consequences fits into a more general validation framework .

Frederiksen (1984, 1986) made the case that objective (multiple-choice) formats tended to measure a subset of the skills important for success in various contexts but that reliance on that format could have a negative, distorting effect on instruction. He recalled that, while conducting validity studies during the Second World War, he was surprised that reading comprehension tests and other verbal tests were the best predictors of grades in gunner’s mate school. When he later visited the school, he found that the instruction was mostly lecture–demonstration based on the content of manuals, and the end-of-course tests were based on the lectures and manuals. Frederiksen’s group introduced performance tests that required students to service real guns, and grades on the end-of-course tests declined sharply. As a result, the students began assembling and disassembling guns, and the instructors “moved out the classroom chairs and lecture podium and brought in more guns and gunmounts” (Frederiksen 1984, p. 201) . Scores on the new performance tests improved. In addition, mechanical aptitude and knowledge became the best predictors of grades:

No attempt was made to change the curriculum or teacher behavior. The dramatic changes in achievement came about solely through a change in the tests. The moral is clear: It is possible to influence teaching and learning by changing the tests of achievement. (p. 201)

Testing programs can have dramatic systemic consequences , positive or negative.

Negative consequences count against a decision rule (e.g., the use of a cut score), but they can be offset by positive consequences. A program can have substantial negative consequences and still be acceptable, if the benefits outweigh those costs. Negative consequences that are not offset by positive consequences tend to render a decision rule unacceptable (at least for stakeholders who are concerned about these consequences ).

In reviewing a National Academy of Sciences report on ability testing (Wigdor and Garner 1982) , Messick (1982b) suggested that the report was dispassionate and wise but that it “evinces a pervasive institutional bias” (p. 9) by focusing on common analytic models for selection and classification, which emphasize the intended outcomes of the decision rule:

Consider that, for the most part, the utility of a test for selection is appraised statistically in terms of the correlation coefficient between the test and the criterion … but this correlation is directly proportional to the obtained gains over random selection in the criterion performance of the selected group…. Our traditional statistics tend to focus on the accepted group and on minimizing the number of poor performers who are accepted, with little or no attention to the rejected group or those rejected individuals who would have performed adequately if given the chance. (p. 10)

Messick went on to suggest that “by giving primacy to productivity and efficiency, the Committee simultaneously downplays the significance of other important goals in education and the workplace” (p. 11). It is certainly appropriate to evaluate a decision rule in terms of the extent to which it achieves the goals of the program, but it is also important to attend to unintended effects that have potentially serious consequences .

Holland (1994) and Dorans (2012) have pointed out that that different stakeholders (test developers, test users, test takers) can have very different but legitimate perspectives on testing programs and on the criteria to be used in evaluating the programs. For some purposes and in some contexts, it is appropriate to think of testing programs primarily as measurement procedures designed to produce accurate and precise estimates of some variable of interest; within this measurement perspective (Dorans 2012; Holland 1994), the focus is on controlling potential sources of random error and potential sources of bias (e.g., construct-irrelevant score variance, construct underrepresentation , method effects). However, in any applied context, additional considerations are relevant. For example, test takers often view testing programs as contests in which they are competing for some desired outcome, and whether they achieve their goal or not, they want the process to be fair; Holland (1994) and Dorans (2012) referred to this alternate, and legitimate, point of view as the contest perspective.

A pragmatic perspective (Kane 2013b) focuses on how well the program, as implemented, achieves its goals and avoids unintended negative effects. The pragmatic perspective is particularly salient for testing programs that serve as the bases for high-stakes decisions in public contexts. To the extent that testing programs play important roles in the public arena, their claims need to be justified. The pragmatic perspective is particularly concerned about fairness but also values objectivity (defined as the absence of subjectivity or preference) as a core concern; decision makers want testing procedures to be clearly relevant, fair, and practical. In general, it is important to evaluate how well testing programs work in practice, in the contexts in which they are operating (e.g., as the basis for decisions in employment, in academic selection, in placement, in licensure and certification). Testing programs can have strong effects on individuals and institutions, both positive and negative (Frederiksen 1984) . The pragmatic perspective suggests identifying those effects and explicitly weighing them against one another in considering the value, or functional worth, of a testing program.

9 Concluding Remarks

ETS has been heavily involved in the development of validity theory, the creation of models for validation , and the practice of validation since the organization’s creation. All of the work involved in designing and developing tests, score scales, and the materials and procedures involved in reporting and interpreting scores contributes to the soundness and plausibility of the results. Similarly, all of the research conducted on how testing programs function, on how test scores are used, and on the impact of such uses on test takers and institutions contributes to the evaluation of the functional worth of programs.

This chapter has focused on the development of validity theory, but the theory developed out of a need to evaluate testing programs in appropriate ways, and therefore it has been based on the practice of assessment. At ETS, most theoretical innovations have come out of perceived needs to solve practical problems, for which the then current theory was inadequate or unwieldy. The resulting theoretical frameworks may be abstract and complex, but they were suggested by practical problems and were developed to improve practice.

This chapter has been organized to reflect a number of major developments in the history of validity theory and practice. The validity issues and validation models were developed during different periods, but the fact that a new issue or model appeared did not generally lead to a loss of interest in the older topics and models. The issues of fairness and bias in selection and admissions were topics of interest in the early days of ETS; their conceptualization and work on them were greatly expanded in the 1960s and 1970s, and they continue to be areas of considerable emphasis today. Although the focus has shifted and the level of attention given to different topics has varied over time, the old questions have neither died nor faded away; rather, they have evolved into more general and sophisticated analyses of the issues of meaning and values that test developers and users have been grappling with for longer than a century.

Messick shaped validity theory in the last quarter of the twentieth century; therefore this chapter on ETS’s contributions has given a lot of attention to his views, which are particularly comprehensive and complex. His unified, construct-based framework assumes that “validation in essence is scientific inquiry into score meaning—nothing more, but also nothing less” (Messick 1989, p. 56) and that “judging validity in terms of whether a test does the job it is employed to do … requires evaluation of the intended or unintended social consequences of test interpretation and use” (pp. 84–85). Much of the work on validity theory at the beginning of the twenty-first century can be interpreted as attempts to build on Messick’s unified, construct-based framework, making it easier to apply in a straightforward way so that tests can be interpreted and used to help achieve the goals of individuals, education, and society.