This book has documented the history of ETS’s contributions to educational research and policy analysis, psychology, and psychometrics. We close the volume with a brief synthesis in which we try to make more general meaning from the diverse directions that characterized almost 70 years of work.

Synthesizing the breadth and depth of the topics covered over that time period is not simple. One way to view the work is across time. Many of the book’s chapters presented chronologies, allowing the reader to follow the path of a research stream over the years. Less evident from these separate chronologies was the extent to which multiple streams of work not only coexisted but sometimes interacted.

From its inception, ETS was rooted in Henry Chauncey’s vision of describing individuals through broad assessment of their capabilities, helping them to grow and society to benefit (Elliot 2014). Chauncey’s conception of broad assessment of capability required a diverse research agenda.

Following that vision, his research managers assembled an enormous range of staff expertise. Only through the assemblage of such expertise could one bring diverse perspectives and frameworks from many fields to a problem, leading to novel solutions.

In the following sections, we summarize some of the key research streams evident in different time periods, where each period corresponds to roughly a decade. Whereas the segmentation of these time periods is arbitrary, it does give a general sense of the progression of topics across time.Footnote 1 Also somewhat arbitrary is the use of publication date as the primary determinant of placement into a particular decade. Although the work activity leading up to publication may well have occurred in the previous period, the result of that activity and the impact that it had was typically through its dissemination.

1 The Years 1948–1959

1.1 Psychometric and Statistical Methodology

As will be the case for every period , a very considerable amount of work centered on theory and on methodological development in psychometrics and statistics. With respect to the former, the release of Gulliksen’s (1950) Theory of Mental Tests deserves special mention for its codification of classical test theory . But more forward looking was work to create a statistically grounded foundation for the analysis of test scores , a latent-trait theory (Lord 1952, 1953) . This direction would later lead to the groundbreaking development of item response theory (IRT; Lord and Novick 1968) , which became a well-established part of applied statistical research in domains well beyond education and is now an important building block of generalized modeling frameworks , which connect the item response functions of IRT with structural models (Carlson and von Davier, Chap. 5, this volume). Green’s (1950a, b) work can be seen as an early example that has had continued impact not commonly recognized. His work pointed out how latent structure and latent-trait models are related to factor analysis , while at the same time placing latent-trait theory into the context of latent class models. Green’s insights had profound impact, reemerging outside of ETS in the late 1980s (de Leeuw and Verhelst 1986; Follman 1988; Formann 1992; Heinen 1996) and, in more recent times, at ETS in work on generalized latent variable models (Haberman et al. 2008; Rijmen et al. 2014).

In addition to theoretical development, substantial effort was focused on methodological development for, among other purposes, the generation of engineering solutions to practical scale-linking problems. Examples include Karon and Cliff’s (1957) proposal to smooth test-taker sample data before equating, a procedure used today by most testing programs that employ equipercentile equating (Dorans and Puhan , Chap. 4, this volume); Angoff’s (1953) method for equating test forms by using a miniature version of the full test as an external anchor; and Levine’s (1955) procedures for linear equating under the common-item, nonequivalent-population design.

1.2 Validity and Validation

In the 2 years of ETS’s beginning decade, the 1940s, and in the 1950s that followed, great emphasis was placed on predictive studies, particularly for success in higher education. Studies were conducted against first-semester performance (Frederiksen 1948) as well as 4-year academic criteria (French 1958). As Kane and Bridgeman (Chap. 16, this volume) noted, this emphasis was very much in keeping with conceptions of validity at the time, and it was, of course, important to evaluating the meaning and utility of scores produced by the new organization’s operational testing programs . However, also getting attention were studies to facilitate trait interpretations of scores (French et al. 1952). These interpretations posited that response consistencies were the result of test-taker dispositions to behave in certain ways in response to certain tasks, dispositions that could be investigated through a variety of methods, including factor analysis . Finally, the compromising effects of construct -irrelevant influences, in particular those due to coaching , were already a clear concern (Dear 1958; French and Dear 1959).

1.3 Constructed-Response Formats and Performance Assessment

Notably, staff interests at this time were not restricted to multiple-choice tests because, as Bejar (Chap. 18, this volume) pointed out, the need to evaluate the value of additional methods was evident. Work on constructed-response formats and performance assessment was undertaken (Ryans and Frederiksen 1951), including development of the in-basket test (Fredericksen et al. 1957), subsequently used throughout the world for job selection, and a measure of the ability to formulate hypotheses as an indicator of scientific thinking (Frederiksen 1959). Research on direct writing assessment (e.g., through essay testing) was also well under way (Diederich 1957; Huddleston 1952; Torgerson and Green 1950).

1.4 Personal Qualities

Staff interests were not restricted to the verbal and quantitative abilities underlying ETS’s major testing programs, the Scholastic Aptitude Test (the SAT ® test) and the GRE ® General Test. Rather, a broad investigative program on what might be termed personal qualities was initiated. Cognition, more generally defined, was one key interest, as evidenced by publication of the Kit of Selected Tests for Reference Aptitude and Achievement Factors (French 1954) . The Kit was a compendium of marker assessments investigated with sufficient thoroughness to make it possible to use in factor analytic studies of cognition such that results could be more directly compared across studies. Multiple reference measures were provided for each factor, including measures of abilities in the reasoning, memory, spatial, verbal, numeric, motor, mechanical, and ideational fluency domains.

In addition, substantial research targeted a wide variety of other human qualities. This research included personality traits, interests, social intelligence , motivation , leadership, level of aspiration and need for achievement , and response styles (acquiescence and social desirability ), among other things (French 1948, 1956; Hills 1958; Jackson and Messick 1958; Melville and Frederiksen 1952; Nogee 1950; Ricciuti 1951) .

2 The Years 1960–1969

2.1 Psychometric and Statistical Methodology

If nothing else, this period was notable for the further development of IRT (Lord and Novick 1968). That development is one of the major milestones of psychometric research. Although the organization made many important contributions to classical test theory , today psychometrics around the world mainly uses IRT-based methods, more recently in the form of generalized latent variable models . One of the important differences from classical approaches is that IRT properly grounds the treatment of categorical data in probability theory and statistics. The theory’s modeling of how responses statistically relate to an underlying variable allows for the application of powerful methods for generalizing test results and evaluating the assumptions made. IRT-based item functions are the building blocks that link item responses to underlying explanatory models (Carlson and von Davier, Chap. 5, this volume). Leading up to and concurrent with the seminal volume Statistical Theories of Mental Test Scores (Lord and Novick 1968) , Lord continued to make key contributions to the field (Lord 1965a, b, 1968a, b).

In addition to the preceding landmark developments, a second major achievement was the invention of confirmatory factor analysis by Karl Jöreskog (1965, 1967, 1969) , a method for rigorously evaluating hypotheses about the latent structure underlying a measure or collection of measures. This invention would be generalized in the next decade and applied to the solution of a great variety of measurement and research problems.

2.2 Large-Scale Survey Assessments of Student and Adult Populations

In this period, ETS contributed to the design and conducted the analysis of the Equality of Educational Opportunity Study (Beaton and Barone , Chap. 8, this volume). Also of note was that, toward the end of the decade, ETS’s long-standing program of longitudinal studies began with initiation of the Head Start Longitudinal Study (Anderson et al. 1968). This study followed a sample of children from before preschool enrollment through their experience in Head Start , in another preschool, or in no preschool program.

2.3 Validity and Validation

The 1960s saw continued interest in prediction studies (Schrader and Pitcher 1964), though noticeably less than in the prior period. The study of construct-irrelevant factors that had concentrated largely on coaching was less evident, with interest emerging in the phenomenon of test anxiety (French 1962). Of special note is that, due to the general awakening in the country over civil rights, ETS research staff began to focus on developing conceptions of equitable treatment of individuals and groups (Cleary 1968).

2.4 Constructed-Response Formats and Performance Assessment

The 1960s saw much investigation of new forms of assessment, including in-basket performance (Frederiksen 1962; L. B. Ward 1960), formulating-hypotheses tasks (Klein et al. 1969) , and direct writing assessment . As described by Bejar (Chap. 18, this volume), writing assessment deserves special mention for the landmark study by Diederich et al. (1961) documenting that raters brought “schools of thought ” to the evaluation of essays, thereby initiating interest in the investigation of rater cognition, or the mental processes underlying essay grading. A second landmark was the study by Godshalk et al. (1966) that resulted in the invention of holistic scoring .

2.5 Personal Qualities

The 1960s brought a very substantial increase to work in this area. The work on cognition produced the 1963 “Kit of Reference Tests for Cognitive Factors” (French et al. 1963), the successor to the 1954 “Kit.” Much activity concerned the measurement of personality specifically, although a range of related topics was also investigated, including continued work on response styles (Damarin and Messick 1965; Jackson and Messick 1961; Messick 1967), the introduction into the social–psychological literature of the concept of prosocial (or altruistic) behavior (Bryan and Test 1967; Rosenhan 1969; Rosenhan and White 1967), and risk taking (Kogan and Doise 1969; Kogan and Wallach 1964; Wallach et al. 1962). Also of note is that this era saw the beginnings of ETS’s work on cognitive styles (Gardner et al. 1960; Messick and Fritzky 1963; Messick and Kogan 1966). Finally , a research program on creativity began to emerge (Skager et al. 1965, 1966), including Kogan’s studies of young children (Kogan and Morgan 1969; Wallach and Kogan 1965), a precursor to the extensive line of developmental research that would appear in the following decade.

2.6 Teacher and Teaching Quality

Although ETS had been administering the National Teachers Examination since the organization’s inception, relatively little research had been conducted around the evaluation of teaching and teachers. The 1960s saw the beginnings of such research, with investigations of personality (Walberg 1966) , values (Sprinthall and Beaton 1966) , and approaches to the behavioral observation of teaching (Medley and Hill 1967) .

3 The Years 1970–1979

3.1 Psychometric and Statistical Methodology

Causal inference was a major area of research in the field of statistics generally in this decade, and that activity included ETS . Rubin (1974b, 1976a, b, c, 1978) made fundamental contributions to the approach that allows for evaluating the extent to which differences observed in experiments can be attributed to effects of underlying variables.

More generally, causal inference as treated by Rubin can be understood as a missing-data and imputation problem. The estimation of quantities under incomplete-data conditions was a chief focus, as seen in work by Rubin (1974a, 1976a, b) and his collaborators (Dempster et al. 1977), who created the expectation-maximization (EM) algorithm , which has become a standard analytical method used not only in estimating modern psychometric models but throughout the sciences. As of this writing, the Dempster et al. (1977) article had more than 45,000 citations in Google Scholar.

Also falling under causal inference was Rubin’s work on matching. Matching was developed to reduce bias in causal inferences using data from nonrandomized studies. Rubin’s (1974b, 1976a, b, c, 1979) work was central to evaluating and improving this methodology.

Besides landmark contributions to causal inference , continued development of IRT was taking place. Apart from another host of papers by Lord (1970, 1973, 1974a, b, 1975a, b, 1977), several applications of IRT were studied, including for linking test forms (Marco 1977; see also Carlson and von Davier, Chap. 5, this volume). In addition, visiting scholars made seminal contributions as well. Among these contributions were ones on testing the Rasch model as well as on bias in estimates (Andersen 1972, 1973), ideas later generalized by scholars elsewhere (Haberman 1977).

Finally, this period saw Karl Jöreskog and colleagues implement confirmatory factor analysis (CFA) in the LISREL computer program (Jöreskog and van Thillo 1972) and generalize CFA for the analysis of covariance structures (Jöreskog 1970), path analysis (Werts et al. 1973), simultaneous factor analysis in several populations (Jöreskog 1971), and the measurement of growth (Werts et al. 1972). Their inventions, particularly LISREL, continue to be used throughout the social sciences within the general framework of structural equation modeling to pose and evaluate psychometric, psychological, sociological, and econometric theories and the hypotheses they generate.

3.2 Large-Scale Survey Assessments of Student and Adult Populations

Worthy of note were two investigations, one of which was a continuation from the previous decade. That latter investigation, the Head Start Longitudinal Study, was documented in a series of program reports (Emmerich 1973; Shipman 1972; Ward 1973). Also conducted was the National Longitudinal Study of the High School Class of 1972 (Rock , Chap. 10, this volume).

3.3 Validity and Validation

In this period, conceptions of validity, and concerns for validation, were expanding. With respect to conceptions of validity, Messick’s (1975) seminal paper “The Standard Problem: Meaning and Values in Measurement and Evaluation” called attention to the importance of construct interpretations in educational measurement, a perspective largely missing from the field at that time. As to validation, concerns over the effects of coaching reemerged with research finding that two quantitative item types being considered for the SAT were susceptible to short-term preparation (Evans and Pike 1973), thus challenging the College Board’s position on the existence of such effects. Concerns for validation also grew with respect to test fairness and bias, with continued development of conceptions and methods for investigating these issues (Linn 1973, 1976; Linn and Werts 1971) .

3.4 Constructed-Response Formats and Performance Assessment

Relatively little attention was given to this area. An exception was continued investigation of the formulating-hypotheses item type (Evans and Frederiksen 1974; Ward et al. 1980).

3.5 Personal Qualities

The 1970s saw the continuation of a significant research program on personal qualities. With respect to cognition, the third version of the “Factor Kit” was released in 1976: the “Kit of Factor-Referenced Cognitive Tests ” (Ekstrom et al. 1976) . Work on other qualities continued, including on prosocial behavior (Rosenhan 1970, 1972) and risk taking (Kogan et al. 1972; Lamm and Kogan 1970; Zaleska and Kogan 1971) . Of special note was the addition to the ETS staff of Herman Witkin and colleagues, who significantly extended the prior decade’s work on cognitive styles (Witkin et al. 1974, 1977; Zoccolotti and Oltman 1978) . Work on kinesthetic aftereffect (Baker et al. 1976, 1978, 1979) and creativity (Frederiksen and Ward 1978; Kogan and Pankove 1972; Ward et al. 1972) was also under way.

3.6 Human Development

The 1970s saw the advent of a large work stream that would extend over several decades. This work stream might be seen as a natural extension of Henry Chauncey’s interest in human abilities, broadly conceived; that is, to understand human abilities, it made sense to study from where those abilities emanated. That stream, described in detail by Kogan et al. (Chap. 15, this volume), included research in many areas. In this period, it focused on infants and young children , encompassing their social development (Brooks and Lewis 1976; Lewis and Brooks-Gunn 1979), emotional development (Lewis 1977; Lewis et al. 1978; Lewis and Rosenblum 1978) , cognitive development (Freedle and Lewis 1977; Lewis 1977, 1978), and parental influences (Laosa 1978; McGillicuddy-DeLisi et al. 1979).

3.7 Educational Evaluation and Policy Analysis

One of the more notable characteristics of ETS research in this period was the emergence of educational evaluation, in good part due to an increase in policy makers’ interest in appraising the effects of investments in educational interventions . This work, described by Ball (Chap. 11, this volume), entailed large-scale evaluations of television programs like Sesame Street and The Electric Company (Ball and Bogatz 1970, 1973) and early computer-based instructional systems like PLATO and TICCIT (Alderman 1978; Murphy 1977), as well as a wide range of smaller studies (Marco 1972; Murphy 1973). Some of the accumulated wisdom gained in this period was synthesized in two books, the Encyclopedia of Educational Evaluation (Anderson et al. 1975) and The Profession and Practice of Program Evaluation (Anderson and Ball 1978) .

Alongside the intense evaluation activity was the beginning of a work stream on policy analysis (see Coley et al., Chap. 12, this volume). That beginning concentrated on education finance (Goertz 1978; Goertz and Moskowitz 1978) .

3.8 Teacher and Teaching Quality

Rounding out the very noticeable expansion of research activity that characterized the 1970s were several lines of work on teachers and teaching. One line concentrated on evaluating the functioning of the National Teachers Examination (NTE; Quirk et al. 1973). A second line revolved around observing and analyzing teaching behavior (Quirk et al. 1971, 1975). This line included the Beginning Teacher Evaluation Study, the purpose of which was to identify teaching behaviors effective in promoting learning in reading and mathematics in elementary schools, a portion of which was conducted by ETS under contract to the California Commission for Teacher Preparation and Licensing. The study included extensive classroom observation and analysis of the relations among the observed behaviors, teacher characteristics, and student achievement (McDonald and Elias 1976; Sandoval 1976). The final line of research concerned college teaching (Baird 1973; Centra 1974).

4 The Years 1980–1989

4.1 Psychometric and Statistical Methodology

As was true for the 1970s, in this decade, ETS methodological innovation was notable for its far-ranging impact. Lord (1980) furthered the development and application of IRT, with particular attention to its use in addressing a wide variety of testing problems, among them parameter estimation, linking, evaluation of differential item functioning (DIF) , and adaptive testing . Holland (1986, 1987), as well as Holland and Rubin (1983) , continued the work on causal inference , further developing its philosophical and epistemological foundations, including exploration of a long-standing statistical paradox described by Lord (1967).Footnote 2 An edited volume, Drawing Inferences From Self-Selected Samples (Wainer 1986), collected work on these issues.

Rubin’s work on matching , particularly propensity score matching , was a key activity through this decade. Rubin (1980a), as well as Rosenbaum and Rubin (1984, 1985), made important contributions to this methodology. These widely cited publications outlined approaches that are frequently used in scientific research when experimental manipulation is not possible.

Building on his research of the previous decade, Rubin (1980b, c) developed “multiple imputation ,” a statistical technique for dealing with nonresponse by generating random draws from the posterior distribution of a variable, given other variables. The multiple imputations methodology forms the underlying basis for several major group-score assessments (i.e., tests for which the focus of inference is on population, rather than individual, performance), including the National Assessment of Educational Progress (NAEP) , the Programme for International Student Assessment (PISA) , and the Programme of International Assessment of Adult Competencies (PIAAC ; Beaton and Barone , Chap. 8, this volume; Kirsch et al., Chap. 9, this volume).

Also of note was the emergence of DIF as an important methodological research focus. The standardization method (Dorans and Kulick 1986) , and the more statistically grounded Mantel and Haenszel (1959) technique proposed by Holland and Thayer (1988), became stock approaches used by operational testing programs around the world for assessing item-level fairness . Finally, the research community working on DIF was brought together for an invited conference in 1989 at ETS.

Although there were a large number of observed-score equating studies in the 1980s, one development stands out in that it foreshadowed a line of research undertaken more than a decade later. The method of kernel equating was introduced by Holland and Thayer (1989) as a general procedure that combines smoothing , modeling, and transforming score distributions. This combination of statistical procedures was intended to provide a flexible tool for observed-score equating in a nonequivalent-groups anchor-test design.

4.2 Large-Scale Survey Assessments of Student and Adult Populations

ETS was first awarded the contract for NAEP in 1983 after evaluating previous NAEP analytic procedures and releasing A New Design for a New Era (Messick et al. 1983). The award set the stage for advances in assessment design and psychometric methodology, including extensions of latent-trait models that employed covariates. These latent regression models used maximum likelihood methods to estimate population parameters from observed item responses without estimating individual ability parameters for test takers (Mislevy 1984, 1985). Many of the approaches developed for NAEP were later adopted by other national and international surveys, including the Progress in International Reading Literacy Study (PIRLS) , the Trends in International Mathematics and Science Study (TIMSS) , PISA , and PIAAC . These surveys are either directly modeled on NAEP or are based on other surveys that were themselves NAEP’s direct derivates.

The major design and analytic features shared by these surveys include (a) a balanced incomplete block design that allows broad coverage of content frameworks , (b) use of modern psychometric methods to link across the multiple test forms covering this content, (c) integration of cognitive tests and respondent background data using those psychometric methods, and (d) a focus on student (and adult) populations rather than on individuals as the targets of inference and reporting.

Two related developments should be mentioned. The chapters by Kirsch et al. (Chap. 9, this volume) and Rock (Chap. 10, this volume) presented in more detail work on the 1984 Young Adult Literacy Study (YALS) and the 1988 National Educational Longnitudinal Study, respectively. These studies also use multiple test forms and advanced psychometric methods based on IRT. Moreover, YALS was the first to apply a multidimensional item response model (Kirsch and Jungeblut 1986) .

4.3 Validity and Validation

The 1980s saw the culmination of Messick’s landmark unified model (Messick 1989), which framed validity as a unitary concept. The highlight of the period, Messick’s chapter in Educational Measurement, brought together the major strands of validity theory, significantly influencing conceptualization and practice throughout the field.

Also in this period , research on coaching burgeoned in response to widespread public and institutional user concerns (see Powers , Chap. 17, this volume). Notable was publication of The Effectiveness of Coaching for the SAT: Review and Reanalysis of Research From the Fifties to the FTC (Messick 1980), though many other studies were also released (Alderman and Powers 1980; Messick 1982; Powers 1985; Powers and Swinton 1984; Swinton and Powers 1983). Other sources of construct-irrelevant variance were investigated, particularly test anxiety (Powers 1988). Finally, conceptions of fairness became broader still, motivated by concerns over the flagging of scores from admissions tests that were administered under nonstandard conditions to students with disabilities ; these concerns had been raised most prominently by a National Academy of Sciences panel (Sherman and Robinson 1982). Most pertinent was the 4-year program of research on the meaning and use of such test scores for the SAT and GRE General Test that was initiated in response to the panel’s report. Results were summarized in the volume Testing Handicapped People by Willingham et al. (1988).

4.4 Constructed-Response Formats and Performance Assessment

Several key publications highlighted this period . Frederiksen’s (1984) American Psychologist article “The Real Test Bias: Influences of Testing on Teaching and Learning” made the argument for the use of response formats in assessment that more closely approximated the processes and outcomes important for success in academic and work environments. This classic article anticipated the K–12 performance assessment movement of the 1990s and its 2010 resurgence in the Common Core Assessments. Also noteworthy were Breland’s (1983) review showing the incremental predictive value of essay tasks over multiple-choice measures at the postsecondary level and his comprehensive study of the psychometric characteristics of such tasks (Breland et al. 1987). The Breland et al. volume included analyses of rater agreement , generalizability, and dimensionality. Finally, while research continued on the formulating-hypotheses item type (Ward et al. 1980), the investigation of portfolios also emerged (Camp 1985).

4.5 Personal Qualities

Although investigation of cognitive style continued in this period (Goodenough et al. 1987; Messick 1987; Witkin and Goodenough 1981) , the death of Herman Witkin in 1979 removed its intellectual leader and champion, contributing to its decline. This decline coincided with a drop in attention to personal qualities research more generally, following a shift in ETS management priorities from the very clear think tank orientation of the 1960s and 1970s to a greater focus on research to assist existing testing programs and the creation of new ones. That focus remained centered largely on traditional academic abilities, though limited research proceeded on creativity (Baird and Knapp 1981; Ward et al. 1980) .

4.6 Human Development

Whereas the research on personal qualities noticeably declined, the work on human development remained vibrant, at least through the early part of this period, in large part due to the availability of external funding and staff members highly skilled at attracting it. With a change in management focus, the reassignment of some developmental staff to other work, and the subsequent departure of the highly prolific Michael Lewis , interest began to subside . Still, this period saw a considerable amount and diversity of research covering social development (Brooks-Gunn and Lewis 1981; Lewis and Feiring 1982), emotional development (Feinman and Lewis 1983; Lewis and Michalson 1982) , cognitive development (Lewis and Brooks-Gunn 1981a, b; Sigel 1982) , sexual development (Brooks-Gunn 1984; Brooks-Gunn and Warren 1988) , development of Chicano children (Laosa 1980a, 1984) , teenage motherhood (Furstenberg et al. 1987) , perinatal influences (Brooks-Gunn and Hearn 1982) , parental influences (Brody et al. 1986; Laosa 1980b) , atypical development (Brinker and Lewis 1982; Brooks-Gunn and Lewis 1982), and interventions for vulnerable children (Brooks-Gunn et al. 1988; Lee et al. 1988).

4.7 Educational Evaluation and Policy Analysis

As with personal qualities, the evaluation of educational programs began to decline during this period. In contrast to the work on personal qualities, evaluation activities had been almost entirely funded through outside grants and contracts, which diminished considerably in the 1980s. In addition, the organization’s most prominent evaluator, Samuel Ball , departed to take an academic appointment in his native Australia. The work that remained investigated the effects of instructional software like the IBM Writing to Read program (Murphy and Appel 1984) , educational television (Murphy 1988), alternative higher education programs (Centra and Barrows 1982) , professional training (Campbell et al. 1982) , and the educational integration of students with severe disabilities (Brinker and Thorpe 1984) .

Whereas funding for evaluation was in decline, support for policy analysis grew. Among other things, this work covered finance (Berke et al. 1984), teacher policy (Goertz et al. 1984), education reform (Goertz 1989), gender equity (Lockheed 1985), and access to and participation in graduate education (Clewell 1987).

4.8 Teacher and Teaching Quality

As with program evaluation , the departure of key staff during this period resulted in diminished activity, with only limited attention given to the three dominant lines of research of the previous decade: functioning of the NTE (Rosner and Howey 1982) , classroom observation (Medley and Coker 1987; Medley et al. 1981) , and college teaching (Centra 1983). Of particular note was Centra and Potter’s (1980) article “School and Teacher Effects: An Interrelational Model,” which proposed an early structural model for evaluating input and context variables in relation to achievement.

5 The Years 1990–1999

5.1 Psychometric and Statistical Methodology

DIF continued to be an important methodological research focus. In the early part of the period, an edited volume, Differential Item Functioning, was released based on the 1989 DIF conference (Holland and Wainer 1993) . Among other things, the volume included research on the Mantel–Haenszel (1959) procedure . Other publications, including on the standardization method, have had continued impact on practice (Dorans and Holland 1993; Dorans et al. 1992) . Finally, of note were studies that placed DIF into model-based frameworks . The use of mixture models (Gitomer and Yamamoto 1991; Mislevy and Verhelst 1990; Yamamoto and Everson 1997) , for example, illustrated how to relax invariance assumptions and test DIF in generalized versions of item response models.

Among the notable methodological book publications of this period was Computer Adaptive Testing : A Primer, edited by Wainer et al. (1990) . This volume contained several chapters by ETS staff members and their colleagues.

Also worthy of mention was research on extended IRT models , which resulted in several major developments. Among these developments were the generalized partial credit model (Muraki 1992) , extensions of mixture IRT models (Bennett et al. 1991; Gitomer and Yamamoto 1991; Yamamoto and Everson 1997) , and models that were foundational for subsequent generalized modeling frameworks . Several chapters in the edited volume Test Theory for a New Generation of Tests (Frederiksen et al. 1993) described developments around these extended IRT models.

5.2 Large-Scale Survey Assessments of Student and Adult Populations

NAEP entered its second decade with the new design and analysis methodology introduced by ETS. Articles describing these methodological innovations were published in a special issue of the Journal of Educational Statistics (Mislevy et al. 1992b; Yamamoto and Mazzeo 1992) . Many of these articles remain standard references, used as a basis for extending the methods and procedures of group-score assessments. In addition, Mislevy (1991, 1993a, b) continued work on related issues.

A significant extension to the large-scale assessment work was a partnership with Statistics Canada that resulted in development of the International Adult Literacy Survey (IALS). IALS collected data in 23 countries or regions of the world, 7 in 1994 and an additional 16 in 1996 and 1998 (Kirsch et al., Chap. 9, this volume). Also in this period, ETS research staff helped the International Association for the Evaluation of Educational Achievement (IEA ) move the TIMSS 1995 and 1999 assessments to a more general IRT model , later described by Yamamoto and Kulick (2002). Finally, this period saw the beginning of the Early Childhood Longitudinal Study, Kindergarten Class of 1998–1999 (ECLS-K) , which followed students through the eighth grade (Rock , Chap. 10, this volume).

5.3 Validity and Validation

Following the focus on constructs advocated by Messick’s (1989) chapter, the 1990s saw a shift in thinking that resulted in concerted attempts to ground assessment design in domain theory, particularly in domains in which design had been previously driven by content frameworks . Such theories often offered a deeper and clearer description of the cognitive components that made for domain proficiency and the relationships among the components. A grounding in cognitive-domain theory offered special advantages for highly interactive assessments like simulations because of the expense involved in their development, which increased dramatically without the guidance provided by theory for task creation and scoring. From Messick (1994a), and from work on an intelligent tutoring system that combined domain theory with rigorous probability models (Gitomer et al. 1994) , the foundations of evidence-centered design (ECD) emerged (Mislevy 1994, 1996) . ECD, a methodology for rigorously reasoning from assessment claims to task development, and from item responses back to claims, is now used throughout the educational assessment community as a means of creating a stronger validity argument a priori .

During this same period, other investigators explored how to estimate predictive validity coefficients by taking into account differences in grading standards across college courses (Ramist et al. 1994) . Finally, fairness for population groups remained in focus, with continued attention to admissions testing for students with disabilities (Bennett 1999) and release of the book Gender and Fair Assessment by Willingham and Cole (1997) , which comprehensively examined the test performance of males and females to identify potential sources of unfairness and possible solutions.

5.4 Constructed-Response Formats and Performance Assessment

At both the K–12 and postsecondary levels, interest in moving beyond multiple-choice measures was widespread. ETS work reflected that interest and, in turn, contributed to it. Highlights included Messick’s (1994a) paper on evidence and consequences in the validation of performance assessments, which provided part of the conceptual basis for the invention of ECD , and publication of the book Construction Versus Choice in Cognitive Measurement (Bennett and Ward 1993) , framing the breadth of issues implicated in the use of non-multiple-choice formats.

In this period, many aspects of the functioning of constructed-response formats were investigated, including construct equivalence (Bennett et al. 1991; Bridgeman 1992) , population invariance (Breland et al. 1994; Bridgeman and Lewis 1994) , and effects of allowing test takers choice in task selection (Powers and Bennett 1999) . Work covered a variety of presentation and response formats, including formulating hypotheses (Bennett and Rock 1995) , portfolios (Camp 1993; LeMahieu et al. 1995), and simulations for occupational and professional assessment (Steinberg and Gitomer 1996) .

Appearing in this decade were ETS’s first attempts at automated scoring , including of computer science subroutines (Braun et al. 1990), architectural designs (Bejar 1991) , mathematical step-by-step solutions and expressions (Bennett et al. 1997; Sebrechts et al. 1991) , short-text responses (Kaplan 1992) , and essays (Kaplan et al. 1995). By the middle of the decade, the work on scoring architectural designs had been implemented operationally as part of the National Council of Architectural Registration Board’s Architect Registration Examination (Bejar and Braun 1999) . Also introduced at the end of the decade into the Graduate Management Admission Test was the e-rater ® automated scoring engine, an approach to automated essay scoring (Burstein et al. 1998) . The e-rater scoring engine continues to be used operationally for the GRE General Test Analytical Writing Assessment , the TOEFL ® test, and other examinations .

5.5 Personal Qualities

Interest in this area had been in decline since the 1980s. The 1990s brought an end to the cognitive styles research, with only a few publications released (Messick 1994b, 1996). Some research on creativity continued (Bennett and Rock 1995; Enright et al. 1998) .

5.6 Human Development

As noted, work in this area also began to decline in the 1980s. The 1990s saw interest diminish further with the departure of Jeanne Brooks-Gunn , whose extensive publications covered an enormous substantive range. Still, a significant amount of research was completed, including on parental influences and beliefs (Sigel 1992) , representational competence (Sigel 1999), the distancing model (Sigel 1993), the development of Chicano children (Laosa 1990) , and adolescent sexual, emotional, and social development (Brooks-Gunn 1990).

5.7 Education Policy Analysis

This period saw the continuation of a vibrant program of policy studies. Multiple areas were targeted, including finance (Barton et al. 1991) , teacher policy (Bruschi and Coley 1999) , education reform (Barton and Coley 1990), education technology (Coley et al. 1997), gender equity (Clewell et al. 1992) , education and the economy (Carnevale 1996; Carnevale and DesRochers 1997) , and access to and participation in graduate education (Ekstrom et al. 1991; Nettles 1990) .

5.8 Teacher and Teaching Quality

In this period, a resurgence of interest occurred due to the need to build the foundation for the PRAXIS ® program, which replaced the NTE . An extensive series of surveys, job analyses, and related studies was conducted to understand the knowledge, skills, and abilities required for newly licensed teachers (Reynolds et al. 1992; Tannenbaum 1992; Tannenbaum and Rosenfeld 1994) . As in past decades, work was done on classroom performance (Danielson and Dwyer 1995; Powers 1992) , some of which supplied the initial foundation for the widely used Framework for Teaching Evaluation Instrument (Danielson 2013).

6 The Years 2000–2009

6.1 Psychometric and Statistical Methodology

The first decade of the current century saw increased application of Bayesian methods in psychometric research, in which staff members continued ETS’s tradition of integrating advances in statistics with educational and psychological measurement. Among the applications were posterior predictive checks (Sinharay 2003) , a method not unlike the frequentist resampling and resimulation studied in the late 1990s (M. von Davier 1997), as well as the use of Bayesian networks to specify complex measurement models (Mislevy et al. 2000) . Markov chain Monte Carlo methods were employed to explore the comprehensive estimation of measurement and structural models in modern IRT (Johnson and Jenkins 2005) but, because of their computational requirements, currently remain limited to small- to medium-sized applications.

Alternatives to these computationally demanding methods were considered to enable the estimation of high-dimensional models, including empirical Bayes methods and approaches that utilized Monte Carlo integration, such as the stochastic EM algorithm (M. von Davier and Sinharay 2007).

These studies were aimed at supporting the use of explanatory IRT applications taking the form of a latent regression that includes predictive background variables in the structural model. Models of this type are used in the NAEP , PISA , PIAAC , TIMSS , and PIRLS assessments, which ETS directly or indirectly supported. Sinharay and von Davier (2005) also presented extensions of the basic numerical integration approach to data having more dimensions. Similar to Johnson and Jenkins (2005), who proposed a Bayesian hierarchical model for the latent regression, Li et al. (2009) examined the use of hierarchical linear (or multilevel) extensions of the latent regression approach.

The kernel equating procedures proposed earlier by Holland and Thayer (1989; also Holland et al. 1989) were extended and designs for potential applications were described in The Kernel Method of Test Equating by A. A. von Davier, Holland, and Thayer (2004). The book’s framework for observed-score equating encapsulates several well-known classical methods as special cases, from linear to equipercentile approaches.

A major reference work was released, titled Handbook of Statistics: Vol. 26. Psychometrics and edited by Rao and Sinharay (2006). This volume contained close to 1200 pages and 34 chapters reviewing state-of-the-art psychometric modeling. Sixteen of the volume’s chapters were contributed by current or former ETS staff members.

The need to describe test-taker strengths and weaknesses has long motivated the reporting of subscores on tests that were primarily designed to provide a single score. Haberman (2008) presented the concept of proportional reduction of mean squared errors, which allows an evaluation of whether subscores are technically defensible. This straightforward extension of classical test theory derives from a formula introduced by Kelley (1927) and provides a tool to check whether a subscore is reliable enough to stand on its own or whether the true score of the subscore under consideration would be better represented by the observed total score. (Multidimensional IRT was subsequently applied to this issue by Haberman and Sinharay 2010 , using the same underlying argument.)

Also for purposes of better describing test-taker strengths and weaknesses, generalized latent variable models were explored, but with the intention of application to tests designed to measure multiple dimensions. Apart from the work on Bayesian networks (Mislevy and Levy 2007; Mislevy et al. 2003) , there were significant extensions of approaches tracing back to the latent class models of earlier decades (Haberman 1988) and to the rule space model (Tatsuoka 1983) . Among these extensions were developments around the reparameterized unified model (DiBello et al. 2006) , which was shown to partially alleviate the identification issues of the earlier unified model, as well as around the general diagnostic model (GDM ; M. von Davier 2008a). The GDM was shown to include many standard and extended IRT models , as well as several diagnostic models, as special cases (M. von Davier 2008a, b). The GDM has been successfully applied to the TOEFL iBT ® test, PISA , NAEP , and PIRLS data in this as well as in the subsequent decade (M. von Davier 2008a; Oliveri and von Davier 2011, 2014; Xu and von Davier 2008) . Other approaches later developed outside of ETS, such as the log-linear cognitive diagnostic model (LCDM; Henson et al. 2009), can be directly traced to the GDM (e.g., Rupp et al. 2010) and have been shown to be a special case of the GDM (M. von Davier 2014).

6.2 Large-Scale Survey Assessments of Student and Adult Populations

As described by Rock (Chap. 10, this volume), the Early Childhood Longitudinal Study continued through much of this decade, with the last data collection in the eighth grade, taking place in 2007. Also, recent developments in the statistical procedures used in NAEP were summarized and future directions described (M. von Davier et al. 2006).

A notable milestone was the Adult Literacy and Lifeskills (ALL) assessment, conducted in 2003 and 2006–2008 (Kirsch et al., Chap. 9, this volume). ALL was a household-based, international comparative study designed to provide participating countries with information about the literacy and numeracy skills of their adult populations. To accomplish this goal, ALL used nationally representative samples of 16- to 65-year-olds.

In this decade, ETS staff members completed a multicountry feasibility study for PISA of computer-based testing in multiple languages (Lennon, Kirsch, von Davier, Wagner, and Yamamoto 2003) and a report on linking and linking stability (Mazzeo and von Davier 2008) .

Finally, in 2006, ETS and IEA established the IEA/ETS research institute (IERI), which promotes research on large-scale international skill surveys, publishes a journal, and provides training around the world through workshops on statistical and psychometric topics (Wagemaker and Kirsch 2008) .

6.3 Validity and Validation

In the 2000s, Mislevy and colleagues elaborated the theory and generated additional prototypic applications of ECD (Mislevy et al. 2003, 2006), including proposing extensions of the methodology to enhance accessibility for individuals from special populations (Hansen and Mislevy 2006) . Part of the motivation behind ECD was the need to more deeply understand the constructs to be measured and to use that understanding for assessment design. In keeping with that motivation, the beginning of this period saw the release of key publications detailing construct theory for achievement domains, which feed into the domain analysis and modeling aspects of ECD. Those publications concentrated on elaborating the construct of communicative competence for the TOEFL computer-based test (CBT), comprising listening, speaking, writing, and reading (Bejar et al. 2000; Butler et al. 2000; Cumming et al. 2000; Enright et al. 2000) . Toward the end of the period, the Cognitively Based Assessment of, for, and as Learning (CBAL ®) initiative (Bennett and Gitomer 2009) was launched. This initiative took a similar approach to construct definition as TOEFL CBT but, in CBAL’s case, to the definition of English language arts and mathematics constructs for elementary and secondary education.

At the same time, the communication of predictive validity results for postsecondary admissions tests was improved. Building upon earlier work, Bridgeman and colleagues showed how the percentage of students who achieved a given grade point average increased as a function of score level, a more easily understood depiction than the traditional validity coefficient (Bridgeman et al. 2008). Also advanced was the research stream on test anxiety , one of several potential sources of irrelevant variance (Powers 2001) .

Notable too was the increased attention given students from special populations. For students with disabilities , two research lines dominated, one related to testing and validation concerns that included but went beyond the postsecondary admissions focus of the 1980s and 1990s (Ekstrom and Smith 2002; Laitusis et al. 2002) , and the second on accessibility (Hansen et al. 2004; Hansen and Mislevy 2006; Hansen et al. 2005). For English learners, topics covered accessibility (Hansen and Mislevy 2006; Wolf and Leon 2009) , accommodations (Young and King 2008) , validity frameworks and assessment guidelines (Pitoniak et al. 2009; Young 2009) , and instrument and item functioning (Martiniello 2009; Young et al. 2008) .

6.4 Constructed-Response Formats and Performance Assessment

Using ECD , several significant computer-based assessment prototypes were developed, including for NAEP (Bennett et al. 2007) and for occupational and professional assessment (Mislevy et al. 2002) . The NAEP Technology-Rich Environments project was significant because assessment tasks involving computer simulations were administered to nationally representative samples of students and because it included an analysis of students’ solution processes. This study was followed by NAEP’s first operational technology-based component, the Interactive Computer Tasks, as part of the 2009 science assessment (U.S. Department of Education, n.d.-a) . Also of note was the emergence of research on games and assessment (Shute et al. 2008, 2009) .

With the presentation of constructed-response formats on computer came added impetus to investigate the effect of computer familiarity on performance . That issue was explored for essay tasks in NAEP (Horkay et al. 2006) as well as for the entry of complex expressions in mathematical reasoning items (Gallagher et al. 2002) .

Finally, attention to automated scoring increased considerably. Streams of research on essay scoring and short-text scoring expanded (Attali and Burstein 2006; Leacock and Chodorow 2003; Powers et al. 2002; Quinlan et al. 2009) , a new line on speech scoring was added (Zechner et al. 2007, 2009) , and publications were released on the grading of graphs and mathematical expressions (Bennett et al. 2000).

6.5 Personal Qualities

Although it almost disappeared in the 1990s, ETS’s interest in this topic reemerged following from the popularization of so-called noncognitive constructs in education, the workplace, and society at large (Goleman 1995) . Two highly visible topics accounted for a significant portion of the research effort, one being emotional intelligence (MacCann and Roberts 2008; MacCann et al. 2008; Roberts et al. 2006) and the other stereotype threat (Stricker and Bejar 2004; Stricker and Ward 2004) , the notion that concern about a negative belief as to the ability of one’s demographic group might adversely affect test performance .

6.6 Human Development

With the death of Irving Sigel in 2006, the multidecade history of contributions to this area ended. Before his death, however, Sigel continued to write actively on the distancing model, representation, parental beliefs, and the relationship between research and practice generally (Sigel 2000, 2006). Notable in this closing period was publication of his coedited Child Psychology in Practice, volume 4 of the Handbook of Child Psychology (Renninger and Sigel 2006) .

6.7 Education Policy Analysis

Work in this area increased considerably. Several topics stood out for the attention given them. In elementary and secondary education, the achievement gap (Barton 2003) , gender equity (Coley 2001), the role of the family (Barton and Coley 2007), and access to advanced course work in high school (Handwerk et al. 2008) were each examined. In teacher policy and practice, staff examined approaches to teacher preparation (Wang et al. 2003) and the quality of the teaching force (Gitomer 2007b).

With respect to postsecondary populations, new analyses were conducted of data from the adult literacy surveys (Rudd et al. 2004; Sum et al. 2002) , and access to graduate education was studied (Nettles and Millett 2006) . A series of publications by Carnevale and colleagues investigated the economic value of education and its equitable distribution (Carnevale and Fry 2001, 2002; Carnevale and Rose 2000) . Among the many policy reports released, perhaps the highlight was America’s Perfect Storm (Kirsch et al. 2007), which wove labor market trends, demographics, and student achievement into a social and economic forecast that received international media attention.

6.8 Teacher and Teaching Quality

Notable in this period were several lines of research. One centered on the functioning and impact of the certification assessments created by ETS for the National Board of Professional Teaching Standards (Gitomer 2007a; Myford and Engelhard 2001) , which included the rating of video-recorded classroom performances. A second line more generally explored approaches for the evaluation of teacher effectiveness and teaching quality (Gitomer 2009; Goe et al. 2008; Goe and Croft 2009) as well as the link between teaching quality and student outcomes (Goe 2007). Deserving special mention was Braun’s (2005) report “Using Student Progress to Evaluate Teachers: A Primer on Value-Added Models ,” which called attention to the problems with this approach. Finally, a third work stream targeted professional development, including enhancing teachers’ formative assessment practices (Thompson and Goe 2009; Wylie et al. 2009) .

7 The Years 2010–2016

7.1 Psychometric and Statistical Methodology

Advances in computation have historically been an important driver of psychometric developments. In this period, staff members continued to create software packages, particularly for complex multidimensional analyses. One example was software for the operational use of multidimensional item response theory (MIRT) for simultaneous linking of multiple assessments (Haberman 2010) . Another example was software for the operational use of the multidimensional discrete latent-trait model for IRT (and MIRT) calibration and linking (M. von Davier and Rost 2016) . This software is used extensively for PIAAC and PISA .

Whereas software creation has constituted a continued line of activity, research on how to reduce computational burden has also been actively pursued. Of note in this decade was the use of graphical modeling frameworks to reduce the calculations required for complex multidimensional estimation. Rijmen (2010) as well as Rijmen et al. (2014) showed how these advances can be applied in large-scale testing applications, producing research software for that purpose. On a parallel track, von Davier (2016) described the use of all computational cores of a workstation or server to solve measurement problems in many dimensions more efficiently and to analyze the very large data sets coming from online testing and large-scale assessments of national or international populations.

In the same way as advances in computing have spurred methodological innovation, those computing advances have made the use of new item response types more feasible (Bejar , Chap. 18, this volume). Such response types have, in turn, made new analytic approaches necessary. Research has examined psychometric models and latent-trait estimation for items with multiple correct choices, self-reports using anchoring vignettes, data represented as multinomial choice trees, and responses collected from interactive and simulation tasks (Anguiano-Carrasco et al. 2015; Khorramdel and von Davier 2014) , in the last case including analysis of response time and solution process.

Notable methodological publications collected in edited volumes in this period covered linking (von Davier 2011), computerized multistage testing (Yan et al. 2014) , and international large-scale assessment methodology (Rutkowski et al. 2013). In addition, several contributions appeared by ETS authors in a three-volume handbook on IRT (Haberman 2016; von Davier and Rost 2016) . Chapters by other researchers detail methods and statistical tools explored while those individuals were at ETS (e.g., Casabianca and Junker 2016; Moses 2016; Sinharay 2016) .

7.2 Large-Scale Survey Assessments of Student and Adult Populations

In this second decade of the twenty-first century, the work of many research staff members was shaped by the move to computer-based, large-scale assessment. ETS became the main contractor for the design, assessment development, analysis, and project management of both PIAAC and PISA . PIAAC was fielded in 2012 as a multistage adaptive test (Chen et al. 2014b) . In contrast, PISA 2015 was administered as a linear test with three core domains (science, mathematics, and reading), one innovative assessment domain (collaborative problem solving ), and one optional domain (financial literacy).

NAEP also fielded computer-based assessments in traditional content domains and in domains that would not be suitable for paper-and-pencil administration. Remarkable were the delivery of the 2011 NAEP writing assessment on computer (U.S. Department of Education, n.d.-b) and the 2014 Technology and Engineering Literacy assessment (U.S. Department of Education, n.d.-c). The latter assessment contained highly interactive simulation tasks involving the design of bicycle lanes and the diagnosis of faults in a water pump. A large pilot study exploring multistage adaptive testing was also carried out (Oranje and Ye 2013) as part of the transition of all NAEP assessments to administration on computers.

Finally, ETS received the contract for PISA 2018, which will also entail the use of computer-based assessments in both traditional and nontraditional domains.

7.3 Validity and Validation

The work on construct theory in achievement domains for elementary and secondary education that was begun in the prior decade continued with publications in the English language arts (Bennett et al. 2016; Deane et al. 2015; Deane and Song 2015; Sparks and Deane 2015) , mathematics (Arieli-Attali and Cayton-Hodges 2014; Graf 2009) , and science (Liu et al. 2013). These publications detailed the CBAL competency, or domain, models and their associated learning progressions, that is, the pathways most students might be expected to take toward domain competency. Also significant was the Reading for Understanding project, which reformulated and exemplified the construct of reading comprehension for the digital age (Sabatini and O’Reilly 2013) . Finally, a competency model was released for teaching (Sykes and Wilson 2015) , intended to lay the foundation for a next generation of teacher licensure assessment.

In addition to domain modeling, ETS’s work in validity theory was extended in several directions. The first direction was through further development of ECD , in particular its application to educational games (Mislevy et al. 2014) . A second direction resulted from the arrival of Michael Kane , whose work on the argument-based approach added to the research program very substantially (Kane 2011, 2012, 2016). Finally, fairness and validity were combined in a common framework by Xi (2010) .

Concerns for validity and fairness continued to motivate a wide-ranging research program directed at students from special populations. For those with disabilities , topics included accessibility (Hansen et al. 2012; Stone et al. 2016) , accommodations (Cook et al. 2010), instrument and item functioning (Buzick and Stone 2011; Steinberg et al. 2011) , computer-adaptive testing (Stone et al. 2013; Stone and Davey 2011) , automated versus human essay scoring (Buzick et al. 2016) , and the measurement of growth (Buzick and Laitusis 2010a, b) . For English learners, topics covered accessibility (Guzman-Orth et al. 2016; Young et al. 2014) , accommodations (Wolf et al. 2012a, b) , instrument functioning (Gu et al. 2015; Young et al. 2010) , test use (Lopez et al. 2016; Wolf and Farnsworth 2014; Wolf and Faulkner-Bond 2016) , and the conceptualization of English learner proficiency assessment systems (Hauck et al. 2016; Wolf et al. 2016) .

7.4 Constructed-Response Formats and Performance Assessment

As a consequence of growing interest in education, the work on games and assessment that first appeared at the end of the previous decade dramatically increased (Mislevy et al. 2012, 2014, 2016; Zapata-Rivera and Bauer 2012) .

Work on automated scoring also grew substantially. The focus remained on response types from previous periods, such as essay scoring (Deane 2013a, b) , short answer scoring (Heilman and Madnani 2012) , speech scoring (Bhat and Yoon 2015; Wang et al. 2013) , and mathematical responses (Fife 2013) . However, important new lines of work were added. One such line, made possible by computer-based assessment , was the analysis of keystroke logs generated by students as they responded to essays, simulations , and other performance tasks (Deane and Zhang 2015; He and von Davier 2015, 2016; Zhang and Deane 2015). This analysis began to open a window into the processes used by students in problem solving . A second line, also made possible by advances in technology, was conversation-based assessment, in which test takers interact with avatars (Zapata-Rivera et al. 2014). Finally, a work stream was initiated on “multimodal assessment,” incorporating analysis of test-taker speech, facial expression, or other behaviors (Chen et al. 2014a, c) .

7.5 Personal Qualities

While work on emotional intelligence (MacCann et al. 2011; MacCann et al. 2010; Roberts et al. 2010) , and stereotype threat (Stricker and Rock 2015) continued, this period saw a significant broadening to a variety of noncognitive constructs and their applications. Research and product development were undertaken in education (Burrus et al. 2011; Lipnevich and Roberts 2012; Oliveri and Ezzo 2014) as well as for the workforce (Burrus et al. 2013; Naemi et al. 2014) .

7.6 Education Policy Analysis

Although the investigation of economics and education had diminished due to the departure of Carnevale and his colleagues, attention to a wide range of policy problems continued. Those problems related to graduate education (Wendler et al. 2010) , minority representation in teaching (Nettles et al. 2011) , developing and implementing teacher evaluation systems (Goe et al. 2011) , testing at the pre-K level (Ackerman and Coley 2012) , achievement gaps in elementary and secondary education (Barton and Coley 2010) , and parents opting their children out of state assessment (Bennett 2016).

A highlight of this period was the release of two publications from the ETS Opportunity Project. The publications, “Choosing Our Future: A Story of Opportunity in America” (Kirsch et al. 2016) and “The Dynamics of Opportunity in America” (Kirsch and Braun 2016), comprehensively analyzed and directed attention toward issues of equality, economics, and education in the United States.

7.7 Teacher and Teaching Quality

An active and diverse program of investigation continued. Support was provided for testing programs, including an extensive series of job analyses for revising PRAXIS program assessments (Robustelli 2010) as well as work toward the development of new assessments (Phelps and Howell 2016; Sykes and Wilson 2015) . The general topic of teacher evaluation remained a constant focus (Gitomer and Bell 2013; Goe 2013; Turkan and Buzick 2016) , including continued investigation into implementing it through classroom observation (Casabianca et al. 2013; Lockwood et al. 2015; Mihaly and McCaffrey 2014) and value-added modeling (Buzick and Jones 2015; McCaffrey 2013; McCaffrey et al. 2014) . Researchers also explored the impact of teacher characteristics and teaching practices on student achievement (Liu et al. 2010) , the effects of professional development on teacher knowledge (Bell et al. 2010), and the connection between teacher evaluation and professional learning (Goe et al. 2012). One highlight of the period was release of the fifth edition of AERA’s Handbook of Research on Teaching (Gitomer and Bell 2016), a comprehensive reference for the field. A second highlight was How Teachers Teach: Mapping the Terrain of Practice (Sykes and Wilson 2015) , which, as noted earlier, laid out a conceptualization of teaching in the form of a competency model.

8 Discussion

As the previous sections might suggest, the history of ETS research is marked by both constancy and changes in focus. The constancy can be seen in persistent attention to problems at the core of educational and psychological measurement. Those problems have centered on developing and improving the psychometric and statistical methodology that helps connect observations to inferences about individuals, groups, and institutions. In addition, the problems have centered on evaluating those inferences—that is, the theory, methodology, and practice of validation .

The changes in focus across time have occurred both within these two persistently pursued areas and among those areas outside of the measurement core. For example, Kane and Bridgeman (Chap. 16, this volume) documented in detail the progression that has characterized ETS’s validity research, and multiple chapters did the same for the work on psychometrics and statistics. In any event, the emphasis given these core areas remained strong throughout ETS’s history.

As noted, other areas experienced more obvious peaks and valleys. Several of these areas did not emerge as significant research programs in their own right until considerably after ETS was established. That characterization would be largely true, for example, of human development (beginning in the 1970s), educational evaluation (1970s), large-scale assessment/adult literacy/longitudinal studies (1970s), and policy analysis (1980s), although there were often isolated activities that preceded these dates. Once an area emerged, it did not necessarily persist, the best examples being educational evaluation, which spanned the 1970s to 1980s, and human development, which began at a similar time point, declined through the late 1980s and 1990s, and reached its denouement in the 2000s.

Still other areas rose, fell, and rose again. Starting with the founding of ETS, work on personal qualities thrived for three decades, all but disappeared in the 1980s and 1990s, and returned by the 2000s close to its past levels, but this time with the added focus of product development. The work on constructed-response formats and performance assessment also began early on and appeared to go dormant in the 1970s, only to return in the 1980s. In the 1990s, the emphasis shifted from a focus on paper-and-pencil measurement to presentation and scoring by computer.

What drove the constancy and change over the decades? The dynamics were most likely due to a complex interaction among several factors. One factor was certainly the influence of the external environment, including funding, federal education policy, public opinion, and the research occurring in the field. That environment, in turn, affected (and was affected by) the areas of interest and expertise of those on staff who, themselves, had impact on research directions. Finally the interests of the organization’s management were affected by the external environment and, in turn, motivated actions that helped determine the staff composition and research priorities.

Aside from the changing course of research over the decades, a second striking characteristic is the vast diversity of the work. At its height, this diversity arguably rivaled that found in the psychology and education departments of major research universities anywhere in the world. Moreover, in some areas—particularly in psychometrics and statistics—it was often considerably deeper.

This breadth and depth led to substantial innovation, as this chapter has highlighted and the prior ones have detailed. That innovation was often highly theoretical—as in Witkin and Goodenough’s (1981) work on cognitive styles, Sigel’s (1990) distancing theory, Lord and Novick’s (1968) seminal volume on IRT, Messick’s (1989) unified conception of validity, Mislevy’s (1994, 1996) early work on ECD , Deane et al.’s (2015) English language arts competency model, and Sykes and Wilson’s (2015) conceptions of teaching practice. But that innovation was also very often practical—witness the in-basket test (Frederiksen et al. 1957) , LISREL (Jöreskog and van Thillo 1972) , the EM algorithm (Dempster et al. 1977), Lord’s (1980) “Applications of Item Response Theory to Practical Testing Problems,” the application of Mantel–Haenszel to DIF (Holland and Thayer 1988) , the plausible-values solution to the estimation of population performance in sample surveys (Mislevy et al. 1992a) , and e-rater (Burstein et al. 1998) . These innovations were not only useful but used, in all the preceding cases widely employed in the measurement community, and in some cases used throughout the sciences.

Of no small consequence is that ETS innovations—theory and practical development—were employed throughout the organization’s history to support, challenge, and improve the technical quality of its testing programs. Among other things, the challenges took the form of a continuing program of validity research to identify and address construct-irrelevant influences, for example, test anxiety , coaching , stereotype threat , lack of computer familiarity, English language complexity in content assessments, and accessibility —which might unfairly affect the performance of individuals and groups.

A final observation is that research was used not only for the generation of theory and of practical solutions in educational and psychological studies but also for helping government officials and the public address important policy problems. The organization’s long history of contributions to informing policy are evident in its roles with respect to the Equality of Educational Opportunity Study (Beaton 1968) ; the evaluation of Sesame Street (Ball and Bogatz 1970) ; the Head Start , early childhood, and high school longitudinal studies; the adult literacy studies; NAEP , PISA , and PIAAC ; and the many policy analyses of equity and opportunity in the United States (Kirsch et al. 2007; Kirsch and Braun 2016) .

We close this chapter, and the book, by returning to the concept of a nonprofit measurement organization as outlined by Bennett (Chap. 1, this volume). In that conception, the organization’s raison d’être is public service. Research plays a fundamental role in realizing that public service obligation to the extent that it helps advance educational and psychological measurement as a field, acts as a mechanism for enhancing (and routinely challenging) the organization’s testing programs, and helps contribute to the solution of big educational and social challenges. We would assert that the evidence presented indicates that, taken over its almost 70-year history, the organization’s research activities have succeeded in filling that fundamental role.