Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This chapter chronicles ETS research and development contributions related to the use of constructed -response item formats.Footnote 1 The use of constructed responses in testing dates back to imperial China, where tests were used in the selection of civil servants. However, in the United States, the multiple-choice format became dominant during the twentieth century, following its invention and use by the SAT ® examinations created by the College Board in 1926. When ETS was created in 1947, post-secondary admissions testing was largely based on tests consisting of multiple-choice items. However, from the start, there were two camps at ETS: those who believed that multiple-choice tests were sufficiently adequate for the purpose of assessing “verbal” skills and those who believed that “direct” forms of assessment requiring written responses had a role to play. For constructed -response formats to regain a foothold in American education several hurdles would need to be overcome. Research at ETS was instrumental in overcoming those hurdles.

The first hurdle was that of reliability , specifically the perennial issue of low interrater agreement , which plagued the acceptance of constructed -response formats for most of the twentieth century. The second hurdle was broadening the conception of validity to encompass more than predictive considerations, a process that began with the introduction of construct validity by Cronbach and Meehl (1955). Samuel Messick at ETS played a crucial role in this process by making construct validity relevant to educational tests. An inflexion point in the process of reincorporating constructed -response formats more widely in educational tests was marked by the publication of Construction Versus Choice in Cognitive Measurement (Bennett and Ward 1993), following the indictment of the multiple-choice format by Norm Frederiksen (1984) regarding the format’s potentially pernicious influence on education. The chapters in the book made it clear that the choice of format (multiple choice vs. constructed response) includes considerations of validity broadly conceived. Even when there was growing concern about the almost exclusive reliance on the multiple-choice format, there was much more work to be done to facilitate the operational use of constructed-response items since over the preceding decades the profession had come to rely on the multiple-choice format. That work continues to this day at ETS and elsewhere.

Clearly there is more than one way to convey the scope of research and development at ETS to support constructed -response formats. The chapter proceeds largely chronologically in several sections. The first section focuses on the ETS contributions to scoring reliability , roughly through the late 1980s. The next section considers the evolution of validity toward a unitary conception and focuses on the critical contributions by Samuel Messick with implications for the debate around constructed -response formats.

The third section argues that the interest in technology for testing purposes at ETS from early on probably accelerated the eventual incorporation of writing assessment into several ETS admissions tests. That section reviews work related to computer-mediated scoring, task design in several domains, and the formulation of an assessment design framework especially well-suited for constructed -response tasks, evidence-centered design (ECD ) .

The fourth section describes ETS’s involvement in school-based testing, including Advanced Placement ® (AP ® ), the National Assessment of Educational Progress (NAEP ), and the CBAL ® initiative. A fifth section briefly discusses validity and psychometric research related to constructed -response formats. The chapter closes with some reflections on six decades of research.

1 Reliability

The acceptance of the multiple-choice format, after its introduction in the 1926 SAT , together with the growing importance of reliability as a critical attribute of the scores produced by a test, seems to have contributed to the decline of widely used constructed -response forms of assessment in the United States. However, research at ETS was instrumental in helping to return those formats to the assessment of writing in high-stakes contexts. In this section, some of that research is described. Specifically, among the most important ETS contributions are

  1. 1.

    developing holistic scoring

  2. 2.

    advancing the understanding of rater cognition

  3. 3.

    conducting psychometric research in support of constructed responses

Reliability (Haertel 2006) refers to the level of certainty associated with scores from a given test administered to a specific sample and is quantified as a reliability or generalizability coefficient or as a standard error . However, the first sense of reliability that comes to mind in the context of constructed responses is that of interrater reliability , or agreement . Unlike responses to multiple-choice items, constructed responses need to be scored by a process (cf. Baldwin et al. 2005) that involves human judgment or, more recently (Williamson et al. 2006), by an automated process that is guided by human judgment. Those human judgments can be more or less fallible and give rise to concerns regarding the replicability of the assigned score by an independent scorer. Clearly, a low level of interrater agreement raises questions about the meaning of scores.

The quantification of interrater disagreement begins with the work of the statistician F. Y. Edgeworth (as cited by Mariano 2002). As Edgeworth noted,

let a number of equally competent critics independently assign a mark to the (work) … even supposing that the examiners have agreed beforehand as to … the scale of excellence to be adopted … there will occur a certain divergence between the verdicts of competent examiners. (p. 2)

Edgeworth also realized that individual differences among readers could be the source of those errors by noting, for example, that some raters could be more or less severe than others, thus providing the first example of theorizing about rater cognition, a topic to which we will return later in the chapter. Edgeworth (1890) noted,

Suppose that a candidate obtains 95 at such an examination, it is reasonably certain that he deserves his honours. Still there is an appreciable probability that his real mark, as determined by a jury of competent examiners (marking independently and taking the average of those marks) is just below 80; and that he is pushed up into the honour class by the accident of having a lenient examiner. Conversely, his real mark might be just above 80; and yet by accident he might be compelled without honour to take a lower place as low as 63. (emphasis added, p. 470)

The lack of interrater agreement would plague attempts to reincorporate constructed responses into post-secondary admissions testing once multiple-choice items began to supplant them. An approach was needed to solve the interrater reliability problem. A key player in that effort was none other than Carl Brigham (1890–1943), who was the chief architect behind the SAT , which included only multiple-choice items.Footnote 2 Brigham was an atypical test developer and psychometrician in that he viewed the administration of a test as an opportunity to experiment and further learn about students’ cognition. And experiment he did. He developed an “experimental section ” (N. Elliot 2005, p. 75) that would contain item types that were not being used operationally, for example. Importantly, he was keenly interested in the possibility of incorporating more “direct” measures of writing (Valentine 1987, p. 44). However, from the perspective of the College Board , the sponsor of the test, by the 1930s, the SAT was generating significant income, and the Board seemed to have set some limits on the degree of experimentation. According to Hubin (1988),

the growth of the Scholastic Aptitude Test in the thirties, although quite modest by standards of the next decade, contrasted sharply with a constant decline in applicants for the traditional Board essay examinations. Board members saw the SAT ’s growth as evidence of its success and increasingly equated such success with the Board’s very existence. The Board’s perception decreased Brigham’s latitude to experiment with the instrument. (p. 241)

Nevertheless, Brigham and his associates continued to experiment with direct measures of writing. As suggested by the following excerpt (Jones and Brown 1935), there appeared to be progress in solving the rater agreement challenge:

Stalnaker and Stalnaker … present evidence to show that the essay-type test can be scored with rather high reliability if certain rules are followed in formulating questions and in scoring. Brigham … has made an analysis of the procedures used by readers of the English examinations of the College Entrance Examination Board , and believes that the major sources of errors in marking have been identified. A new method of grading is being tried which, he thinks, will lead to greatly increased reliability . (p. 489)

There were others involved in the improvement of the scoring of constructed responses. For example, Anderson and Traxler (1940) argued thatFootnote 3

by carefully formulating the test material and training the readers, it is possible to obtain highly reliable readings of essay examinations. Not only is the reliability high for the total score, but it is also fairly high for most of the eight aspects of English usage that were included in this study. The reliability is higher for some of those aspects that are usually regarded as fairly intangible than for the aspects that one would expect to be objective and tangible. The test makes fair, though by no means perfect, discrimination among the various years of the secondary school in the ability of the pupils to write a composition based on notes supplied to them. The results of the study are not offered as conclusive, but it is believed that, when they are considered along with the results of earlier studies, they suggest that it is highly desirable for schools to experiment with essay-test procedures as means for supplementing the results of objective tests of English usage in a comprehensive program of evaluation in English expression. (p. 530)

Despite these positive results, further resistance to constructed responses was to emerge. Besides reliability concerns, costs and efficiency also were part of the equation. For example, we can infer from the preceding quotation that the scoring being discussed is “analytic” and would require multiple ratings of the same response. At the same time, machine scoring of multiple-choice responses was rapidly becoming a realityFootnote 4 (Hubin 1988, p. 296). The potential efficiencies of machine scoring contrasted sharply with the inefficiencies and logistics of human scoring . In fact, the manpower shortages during World War II led the College Board to suspend examinations relying on essays (Hubin 1988, p. 297).

However, the end of the war did not help. Almost 10 years on, a study published by ETS (Huddleston 1954) concluded that

the investigation points to the conclusion that in the light of present knowledge, measurable “ability to write” is no more than verbal ability . It has been impossible to demonstrate by the techniques of this study that essay questions, objective questions, or paragraph-revision exercises contain any factor other than verbal; furthermore, these types of questions measure writing ability less well than does a typical verbal test. The high degree of success of the verbal test is, however, a significant outcome.

The results are discouraging to those who would like to develop reliable and valid essay examinations in English composition—a hope that is now more than half a century old. Improvement in such essay tests has been possible up to a certain point, but professional workers have long since reached what appears to be a stone wall blocking future progress. New basic knowledge of human capacities will have to be unearthed before better tests can be made or more satisfactory criteria developed. To this end the Educational Testing Service has proposed, pending availability of appropriate funds, a comprehensive factor study in which many types of exercises both new and traditional are combined with tests of many established factors in an attempt to discover the fundamental nature of writing ability . The present writer would like to endorse such a study as the only auspicious means of adding to our knowledge in this field. Even then, it appears unlikely that significant progress can be made without further explorations in the area of personality measurement .Footnote 5 (pp. 204–205)

In light of the limited conception of both “verbal ability ” and “writing ability ” at the time, Huddleston’s conclusions appear, in retrospect, to be unnecessarily strong and overreaching. The evolving conception of “verbal ability ” continues to this day, and it is only recently that even basic skills, like vocabulary knowledge, have become better understood (Nagy and Scott 2000); it was not by any means settled in the early 1950s. Importantly, readily available research at the time was clearly pointing to a more nuanced understanding of writing ability . Specifically, the importance of the role of “fluency” in writing was beginning to emerge (C. W. Taylor 1947) well within the psychometric camp. Today, the assessment of writing is informed by a view of writing as a “complex integrated skill” (Deane et al. 2008; Sparks et al. 2014) with fluency as a key subskill.

By today’s standards, the scope of the concept of reliability was not fully developed in the 1930s and 1940s in the sense of understanding the components of unreliability. The conception of reliability emerged from Spearman’s work (see Stanley 1971, pp. 370–372) and was focused on test-score reliability . If the assignment of a score from each component (item) is error free, because it is scored objectively, then the scoring does not contribute error to the total score, and in that case score reliability is a function of the number of items and their intercorrelations. In the case of constructed responses, the scoring is not error free since the scorer renders a judgment, which is a fallible process.Footnote 6 Moreover, because items that require constructed responses require more time, typically, fewer of them can be administered which, other things being equal, reduces score reliability . The estimation of error components associated with ratings would develop later (Ebel 1951; Finlayson 1951), as would the interplay among those components (Coffman 1971) , culminating in the formulation of generalizability theory (Cronbach et al. 1972).Footnote 7 Coffman (1971), citing multiple sources, summarized the state of knowledge on interreader agreement as follows:

The accumulated evidence leads, however, to three inescapable conclusions: a) different raters tend to assign different grades to the same paper; b) a single rater tends to assign different grades to the same paper on different occasions; and c) the differences tend to increase as the essay question permits greater freedom of response. (p. 277)

Clearly, this was a state of affairs not much different than what Edgeworth had observed 80 years earlier.

1.1 The Emergence of a Solution

The Huddleston (1954) perspective could have prevailed at ETS and delayed the wider use of constructed responses, specifically in writing.Footnote 8 Instead, from its inception ETS research paved the way for a solution to reducing interrater disagreement. First, a groundbreaking investigation at ETS (funded by the Carnegie Corporation) established that raters operated with different implied scoring criteria (Diederich et al. 1961). The investigation was motivated by the study to which Huddleston refers in the preceding quotation. That latter study did not yield satisfactory results, and a different approach was suggested: “It was agreed that further progress in grading essays must wait upon a factor analysis of judgments of a diverse group of competent readers in an unstructured situation, where each could grade as he liked” (Diederich et al. 1961, p. 3) . The motivating hypothesis was that different readers belong to different “schools of thought ” that would presumably value qualities of writing differently. The methodology that made it possible to identify types of readers was first suggested by Torgerson and Green (1952) at ETS. To identify the schools of thought, 53 “distinguished readers” were asked to rate and annotate 300 papers without being given standards or criteria for rating. The factors identified from the interrater correlations consisted of groupings of raters (e.g., raters that loaded highly on a specific factor). What school of thought was represented by a given factor would not be immediately obvious without knowing the specifics of the reasoning underlying a rater’s judgment. The reasoning of the readers was captured by means of the annotations each judge had been asked to make, which then had to be coded and classified.Footnote 9 The results showed that agreement among readers was poor and that the nature of the schools of thought was that they valued different aspects of writing. However, the two most sharply defined groups were those that valued “ideas” or that valued “mechanics.”

The Diederich et al. (1961) study showed that judges, when left to their own analytical devices, will resort to particular, if not idiosyncratic, evaluative schemes and that such particularities could well explain the perennial lack of adequate interrater agreement . Important as that finding was, it still did not formulate a solution to the problem of lack of interrater agreement . That solution took a few more years, also leading to a milestone in testing by means of constructed responses. The study was carried out at ETS and led by Fred I. Godshalk. The study was ambitious and included five 20-minute essays, six objective tests, and two interlinear exercises, administered to 646 12th graders over a period of several weeks. Importantly, the scoring of the essays was holistic. They defined the scoring procedure of the essays as follows (Godshalk et al. 1966):

The readers were asked to make global or holistic, not analytical, judgments of each paper, reading rapidly for a total impression. There were only three ratings: a score of “3” for a superior paper, “2” for an average paper, and “1” for an inferior paper. The readers were told to judge each paper on its merits without regard to other papers on the same topic; that is, they were not to be concerned with any ideas of a normal distribution of the three scores. They were advised that scores of “3” were possible and that the “safe” procedure of awarding almost all “2s” was to be avoided. Standards for the ratings were established in two ways: by furnishing each reader with copies of the sample essays for inspection and discussion, and by explaining the conditions of administration and the nature of the testing population; and by having all readers score reproduced sets of carefully selected sample answers to all five questions and to report the results. The scores were then tabulated and announced. No effort was made to identify any reader whose standards were out of line, because that fact would be known to him and would be assumed to have a corrective effect. The procedure was repeated several times during the first two days of scoring to assist readers in maintaining standards. (p. 10, emphasis added)

Perhaps the critical aspect of the directions was to “to make global or holistic, not analytical, judgments” and the use of what is known today (Baldwin et al. 2005) as benchmark or range finding papers to illustrate the criteria . The authors describe the procedure in the preceding quotation and do not provide a theoretical rationale. They were, of course, aware of the earlier Diederich study, and it could have influenced the conception of the holistic scoring instructions. That is, stressing that the scoring was to be holistic and not analytical could have been seen as way to prevent the schools of thought from entering the scoring process and to make the scoring process that much faster.Footnote 10

Outside of ETS, the development of holistic scoring was well received by teachers of English (White 1984) and characterized as “undoubtedly one of the biggest breakthroughs in writing assessment ” (Huot 1990, p. 201). Interestingly, other concurrent work in psychology, although relevant in retrospect, was not considered at the time as related to scoring of essays. For example, N. Elliot (2005) postulated the relevance of Gestalt psychology to a possible adoption of holistic scoring , although there is no such evidence in the Godshalk et al. (1966) report. Another line of research that was relevant was models of judgment, such as the lens model proposed by Egon Brunswik (Brunswik 1952; Hammond et al. 1964; Tucker 1964).Footnote 11 The lens model, although intended as a perceptual model, has been used primarily in decision-making (Hammond and Stewart 2001). According to the model, the perceiver or decision maker decomposes an object into its attributes and weighs those attributes in arriving at a judgment. The model is clearly applicable in modeling raters (Bejar et al. 2006). Similarly, a theory of personality of the same period, George Kelly’s personal construct theory, included a method for eliciting “personal constructs” by means of analysis of sets of important others.Footnote 12 The method, called the repertory grid technique, was later found useful for modeling idiographic or reader-specific rating behavior (Bejar et al. 2006; Suto and Nadas 2009).

One additional area of relevant research was the work on clinical judgment. Meehl’s (1954) influential monograph concluded that actuarial methods were superior to clinical judgment in predicting clinical outcomes. One reason given for the superiority of actuarial methods, often implemented as a regression equation or even the sum of unweighted variables (Dawes and Corrigan 1974) , is that the actuarial method is provided with variables from which to arrive at a judgment. By contrast, the clinician first needs to figure out the variables that are involved, the rubric, so to speak, and determine the value of the variables to arrive at a judgment. As Meehl stressed, the clinician has limited mental resources to carry out the task. Under such conditions, it is not unreasonable for the clinician to perform inconsistently relative to actuarial methods. The overall and quick impression called for by the holistic instructions could have the effect of reducing the cognitive load demanded by a very detailed analysis. Because of this load, such an analysis is likely to play upon the differences that might exist among readers with respect to background and capacity to carry out the task.

There was such relief once the holistic method had been found to help to improve interrater agreement that no one seems to have noted that the idea of holistic scoring is quite counterintuitive. How can a quick impression substitute for a deliberate and extensive analysis of a constructed response by a subject matter expert? Research on decision making suggests, in fact, that experts operate in a holistic sort of fashion and that it is a sign of their expertise to do so. Becoming an expert in any domain involves developing “fast and frugal heuristics” (Gigerenzer and Goldstein 1996) that can be applied to arrive at accurate judgments quickly.

Eventually, questions would be raised about holistic scoring , however. As Cumming et al. (2002) noted ,

holistic rating scales can conflate many of the complex traits and variables that human judges of students’ written compositions perceive (such as fine points of discourse coherence, grammar, lexical usage, or presentation of ideas) into a few simple scale points, rendering the meaning or significance of the judges’ assessments in a form that many feel is either superficial or difficult to interpret. (p. 68)

That is, there is a price for the increased interreader agreement made possible by holistic scoring , namely, that we cannot necessarily document the mental process that scorers are using to arrive at a score. In the absence of that documentation, strictly speaking, we cannot be sure by what means scores are being assigned and whether those means are appropriate until evidence is presented.

Concerns such as these have given rise to research on rater cognition (Bejar 2012). The Diederich et al. (1961) study at ETS started the research tradition by attempting to understand the basis of lack of agreement among scorers (see also Myers et al. 1966). The range of the literature, a portion of it carried out at ETS, is vast and aims, in general, to unpack what goes on in the minds of the raters as they score (Bejar et al. 2006; Crisp 2010; Elbow and Yancey 1994; Huot and Neal 2006; Lumley 2002; Norton 1990; Pula and Huot 1993; Vaughan 1991), the effect of a rater’s background (Myford and Mislevy 1995; Shohamy et al. 1992) , rater strategies (Wong and Kwong 2007), and methods to elicit raters’ personal criteria (Bejar et al. 2006; Heller et al. 1998). Descriptions of the qualifications of raters have also been proposed (Powers et al. 1998; Suto et al. 2009) . In addition, the nature of scoring expertise has been studied (Wolfe 1997; Wolfe et al. 1998). Methods to capture and monitor rater effects during scoring as a function of rater characteristics are similarly relevant (Myford et al. 1995; Myford and Mislevy 1995; Patz et al. 2002) . Experimental approaches to modeling rater cognition have also emerged (Freedman and Calfee 1983), where the interest is on systematic study of different factors that could affect the scoring process. The effectiveness of different approaches to the training of readers (Wolfe et al. 2010) and the qualifying of raters (Powers et al. 1998) has also been studied. In short, the Diederich et al. study was the first in a long line of research concerned with better understanding and improving the processes in which raters engage.

A second concern regarding holistic scoring is the nature of the inferences that can be drawn from scores. Current rubrics described as holistic, such as those used for scoring the GRE ® analytical writing assessment , are very detailed, unlike the early rubrics. That is, holistic scoring has evolved from its inception, although quietly. Early holistic scoring had as a goal the ranking of students’ responses.

Holistic scoring emerged in the context of admissions testing , which means in a norm-referenced context. In that context, the ranking or comparative interpretation of candidates is the goal. Points along the scale of such a test do not immediately have implications for what a test taker knows and can do, that is, attaching an interpretation to a score or score range. The idea of criterion-referenced (Glaser 1963) measurement emerged in the 1960s and was quickly adopted as an alternative conception to norm-referenced testing, especially in the context of school-based testing. Today it is common (Linn and Gronlund 2000) to talk about standards-based assessments to mean assessments that have been developed following a framework that describes the content to be assessed such that scores on the test can be interpreted with respect to what students know and can do. Such interpretations can be assigned to a single score or, more commonly, a range of scores by means of a process called standard setting (Cizek and Bunch 2007; Hambleton and Pitoniak 2006), where panels of experts examine the items or performance on the items to determine what students in those score regions know and can do.

NAEP had from its inception a standards-based orientation. The initial implementation of NAEP in the 1960s, led by Ralph Tyler, did not report scores, but rather performance on specific items, and did not include constructed responses. When writing was first introduced in the late 1960s, the scoring methodology was holistic (Mullis 1980, p. 2). However, the methodology was not found adequate for NAEP purposes and instead the method of primary traits was developed for the second NAEP writing assessment in 1974 (Cooper 1977, p. 11; Lloyd-Jones 1977) . The inapplicability of holistic scoring to NAEP measurement purposes is given by Mullis (1980):

NAEP needed to report performance levels for particular writing skills, and the rank ordering did not readily provide this information. Also, NAEP for its own charge of measuring change over time, as well as for users interested in comparisons with national results, needed a scoring system that could be replicated, and this is difficult to do with holistic scoring . (p. 3)

The criterion-referenced rationale that Mullis advocated was very much aligned with the standards-based orientation of NAEP . According to Bourque (2009), “by the mid-1980s, states began to realize that better reporting mechanisms were needed to measure student progress” (p. 3). A policy group was established, the National Assessment Governing Board (NAGB), to direct NAEP , and shortly thereafter the “Board agreed to adopt three achievement levels (Basic, Proficient, and Advanced) for each grade and subject area assessed by NAEP ” (Bourque 2009, p. 3).

With respect to writing, Mullis (1984) noted ,

For certain purposes, the most efficient and beneficial scoring system may be an adaptation or modification of an existing system. For example, the focused holistic system used by the Texas Assessment Program … can be thought of as a combination of the impressionistic holistic and primary trait scoring systems. (p. 18)

To this day, the method used by NAEP to score writing samples is a modified holistic method called focused holistic (H. Persky , personal communication, January 25, 2011; see also, Persky 2012) that seems to have first originated in Texas around 1980 (Sachse 1984).

Holistic scoring also evolved within admissions testing for different reasons, albeit in the same direction . N. Elliot (2005, p. 228) gives Paul Ramsey at ETS credit for instituting a “modified holistic” method to mean that the scoring was accompanied by detailed scoring guides. In 1992 the College Board ’s English Composition Test (which would become SAT Writing) began using scoring guides as well. The rationale, however, was different, namely, comparability:Footnote 13

We need a scoring guide for the SAT Writing test because, unlike the ECT [English Composition Test] which gives an essay once a year, the SAT will be given 5 times a year and scoring of each administration must be comparable to scoring of other administrations. Other tests, like TOEFL , which give an essay several times a year use a scoring guide like this. (Memorandum from Marylyn Sudlow to Ken Hartman, August 6, 1992)

Clearly the approach to scoring of constructed responses had implications for score meaning and score comparability. However, the psychometric support for constructed responses was limited, at least compared with the support available for multiple-choice tests. Psychometric research at ETS since the 1950s was initially oriented to dichotomously scored items; a historical account can be found in Carlson and von Davier (Chap. 5, this volume). Fred Lord’s work (Lord 1952) was critical for developing a broadly applicable psychometric framework , item response theory (IRT ), that would eventually include ordered polytomously scored items (Samejima 1969),Footnote 14 a needed development to accommodate constructed responses. Indeed, IRT provided the psychometric backbone for developing the second generation of NAEP (Messick et al. 1983), including the incorporation of polytomously scored constructed-response items at a time when to do so in large-scale testing was rare. (For a detailed discussion of the ETS contributions to psychometric theory and software in support of constructed -response formats, see Carlson and von Davier, Chap. 5, this volume.)

The sense of error of measurement within IRT , as represented by the idea of an information function (Birnbaum 1968), was conditional and sample independent (in a certain sense), an improvement over the conception of error in classical test theory , which was global and sample specific. IRT introduced explicitly the idea that the error or measurement was not constant at all ability levels, although it did not allow for the identification of sources of error. Concurrent developments outside the IRT sphere made it possible to begin teasing out the contribution of the scoring process to score reliability (Ebel 1951; Finlayson 1951; Lindquist 1953), culminating in generalizability theory (Cronbach et al. 1972) . Such analyses were useful for characterizing what portion of the error variability was due to different sources, among them lack of reader agreement . Bock et al. (2002), however, proposed a solution to incorporate that framework into IRT whereby the conditional standard error of measurement derived from IRT could be partitioned to identify the portion due to the rating process. (Briggs and Wilson 2007, provide for a more elaborate integration of IRT and generalizability theory .)

As had been recognized by Edgeworth (1890), readers can differ in the stringency of the scores they assign and such disagreements contribute to the error of measurement . Henry BraunFootnote 15 appears to have been the first one at ETS to introduce the idea of rater calibration, described earlier by Paul (1981), as an approach to compensate for systematic disagreements among raters. The logic of the approach was described as follows (Braun 1988): “This new approach involves appropriately adjusting scores in order to remove the noise contributed by systematic sources of variation; for example, a reader consistently assigning higher grades than the typical reader. Such adjustments are akin to an equating process” (p. 2).

The operational implementation of the idea would prove challenging, however. To implement the idea economically, specialized data collection designs were necessary and needed to be embedded in the operational scoring process over several days. The effects estimated from such an analysis are then used to adjust the raw scores. Along the same lines, Longford (1994) also studied the possibility of adjusting scores by taking into account rater severity and consistency.

An alternative to adjusting scores retrospectively is to identify those readers who appear to be unusually severe or lenient so that they can receive additional training. Bejar (1985) experimented with approaches to identify “biased” readers by means of multivariate methods in the Test of Spoken English. Myford et al. (1995) approached the problem of rater severity by applying FACETS (Linacre 2010) , an extension of the IRT Rasch model that includes rater parameters, as well as the parameters for test takers and items.

1.2 Conclusion

When ETS was formed, the pragmatics of increasingly large scale testing together with psychometric considerations set a barrier to the use of constructed -response formats, which was viewed as unreliability due to inadequate interrater agreement . Carl Brigham , chief developer of the SAT , was also a strong proponent of more direct measures, but a solution to the scoring problem eluded him. After Brigham’s death, there appeared to be no strong proponent of the format, at least not within the College Board , nor in the initial years of ETS. Without Brigham to push the point, and the strong undercurrent against constructed responses illustrated by Huddleston’s (1954) perspective that writing skills do not merit their own construct, the prospects for constructed -response testing seemed dire. However, the ETS staff also included writing scholars such as Paul Diederich and Fred Godshalk , and because of them, and others, ultimately there was significant progress in solving the interrater agreement challenge with the emergence of holistic scoring . That method, which was also widely accepted outside of ETS paved the way for an increase in the use of essays. However, as we will see in the next sections, much more was needed for constructed -response formats to become viable.

2 Validity

Making progress on the scoring of constructed responses was critical but far from sufficient to motivate a wider reliance on constructed -response formats. Such formats necessarily require longer response times, which means fewer items can be administered in a given time, threatening score reliability . The conception of validity prevailing in the mid-twentieth century emphasized predictive validity , which presented a challenge for the adoption of constructed -response formats since their characteristic lower score reliability would attenuate predictive validity . The evolution of validity theory would be highly relevant to decisions regarding the use of response format, as we will see shortly. Research at ETS played a key role and was led by Samuel Messick, who not only would argue, along with others, for a unitary—as opposed to a so-called Trinitarian—conception of validity (Guion 1980) (consisting of content, criterion and construct “validities”) but also, as important, for the relevance of such a unitary conception of validity to educational measurement . First, it is informative to review briefly the historical background.

The notion that eventually came to be known as content validity , and was seen as especially relevant to educational testing, probably has its roots in the idea of the sampling of items as a warrant for score interpretation . That notion was proposed early on by Robert C. Tryon as a reaction to the factor analytic conception of individual differences that prevailed at the time. Tryon (1935) argued,

The significant fact to observe about mental measurement is that, having marked out by definition some domain for testing, the psychologist chooses as a method of measurement one which indicates that he knows before giving the test to any subjects a great deal about the nature of the factors which cause individual differences in the domain. The method is that of sampling behavior, and it definitely presupposes that for any defined domain there exists a universe of causes, or factors, or components determining individual differences. Each test-item attempts to ‘tap’ one or more of these components. (p. 433, emphasis in the original)

Tryon was on track with respect to assessment design by suggesting that the assessment developer should know much about what is to be tested “before giving the test to any subject,” therefore implying the need to explicate what is to be measured in some detail as a first step in the design of an assessment (a principle fully fleshed out in ECD , Mislevy et al. 2003, much later). However, his rejection of the prevailing factor analytic perspective advocated by the prominent psychologists of the day (Spearman 1923; Thurstone 1926) was probably responsible for the lack of acceptance of his perspective.Footnote 16 Among the problems raised about the sampling perspective as a warrant to score interpretation was that, in principle, it seemed to require the preexistence of a universe of items, so that random samples could be taken from it. Such an idea presupposes some means of defining the universe of items. The resistance to the idea was most vocally expressed by Jane Loevinger (1965), who could not envision how to explicate such universes. Nevertheless, the relevance of sampling in validation was affirmed by Cronbach (1980) and Kane (1982), although not as a sufficient consideration, even though the link back to Tryon was lost along the way.

What appears to have been missed in Tryon’s argument is that he intended the universe of items to be isomorphic with a “universe of factors, causes, or components determining individual differences” (p. 433), which would imply a crossing of content and process in the creation of a universe of items. Such an idea foreshadows notions of validity that would be proposed many decades later, specifically notions related to construct representation (Embretson 1983) . Instead, in time, the sampling perspective became synonymous with content validity (Cronbach 1971): “Whether the operations that finally constitute the test correspond to the specified universe is the question of content validity ” (p. 452, emphasis added). The idea of a universe was taken seriously by Cronbach (although using for illustration an example from social psychology , which, interestingly, implies a constructed -response test design):

For observation of sociability, the universe specification presumably will define a category of “social acts” to be tallied and a list of situations in which observations are to be made. Each observation ought to have validity as a sample from this universe. (p. 452)

While sampling considerations evolved into content validity , and were thought to be especially applicable to educational (achievement) testing (Kane 2006), the predictive or criterion notion of “validity ” dominated from 1920 to 1950 (Kane 2006) and served to warrant the use of tests for selection purposes, which in an educational context meant admissions testing. The research at ETS described earlier on writing assessment took place in that context. The predictive view presented a major hurdle to the use of constructed -response formats because, in a predictive context, it is natural to evaluate any modifications to the test, such as adding constructed -response formats, with respect to increases in prediction (Breland 1983):

Because of the expense of direct assessments of writing skill, a central issue over the years has been whether or not an essay adds significantly to the measurement accuracy provided by other available measures-the high school record, objective test scores, or other information. (p. 14)

Breland provided a meta-analysis of writing assessment research showing the incremental prediction of writing samples over measures consisting only of multiple-choice items. Although he presented a fairly compelling body of evidence, a cost-conscious critic could have argued that the increases in prediction could just as easily have been obtained more economically by lengthening the multiple-choice component.

The third conception of validity is construct validity , dating back to the mid-twentieth-century seminal paper introducing the term (Cronbach and Meehl 1955) . In that paper, validation is seen as a process that occurs after the assessment has been completed, although the process is driven by theoretical expectations. However, Cronbach and Meehl did not suggest that those expectations should be used in developing the test itself. Instead, such theoretical expectations were to be used to locate the new test within a nomological network of relationships among theoretically relevant variables and scores. At the time Cronbach and Meehl were writing, developing a test was a matter of writing items as best one could and then pretesting them. The items that did not survive were discarded. In effect, the surviving items were the de facto definition of the construct, although whether it was the intended construct could not be assumed until a conclusion could be reached through validation . In the wrong hands, such an ad hoc process could converge on the wrong test.Footnote 17 Loevinger (1957) argued that “the dangers of pure empiricism in determining the content of a test should not be underestimated” (p. 657) and concluded that

there appears to be no convincing reason for ignoring content nor for considering content alone in determining the validity of a test or individual items. The problem is to find a coherent set of operations permitting utilization of content together with empirical considerations. (p. 658)

Clearly Loevinger considered content important, but the “coherent set of operations” she referred to was missing at the time, although it would appear soon as part of the cognitive science revolution that was beginning to emerge in the 1950s.Footnote 18

Toward the end of that decade, another important article was published that would have important repercussions for the history of research on constructed -response formats. D. T. Campbell and Fiske (1959) made an important distinction: “For the justification of novel trait measures, for the validation of test interpretation, or for the establishment of construct validity , discriminant validation as well as convergent validation is required” (p. 81, emphasis in the original).

The paper is significant for contrasting the evidentiary basis for and against a psychometric claim.Footnote 19 In addition, the paper formalizes the notion of method variance, which would surface later in research about constructed -response formats, especially evaluating the measurement equivalence of multiple-choice and constructed -response formats.

As can be seen, the 1950s was a contentious and productive decade in the conceptual development of testing. Importantly, the foregoing discussion about the nature of validity did not take place at ETS. Nevertheless, it is highly relevant to the chapter: These developments in validity theory may have even been seen as tangential to admissions tests ,Footnote 20 which represented the vast majority of ETS operations at the time. In that context, the normative interpretations of scores together with predictive validity were the accepted practice.

As mentioned earlier, in 1963 a most influential paper was published by Glaser , proposing an alternative approach to score interpretation and assessment design in an educational setting, namely, by reference to the level of proficiency within a very well-defined content domain. Glaser’s intent was to provide an alternative to normative interpretations since norms were less relevant in the context of individualized instruction.Footnote 21 Whereas norms provide the location of a given score in a distribution of scores, a criterion-referenced interpretation was intended to be more descriptive of the test taker’s skills than a normative interpretation. The criterion-referenced approach became aligned early on with the idea of mastery testing (Hambleton and Novick 1973), whereby the objective of measurement was to determine whether a student had met the knowledge requirements associated with a learning objective.

Criterion-referenced tests were thought to yield more actionable results in an educational context not by considering a score as a deviation from the mean of a distribution, the normative interpretation, but by locating the score within an interval whereby all scores in that interval would have a similar interpretation. In the simplest form, this meant determining a cut score that would define the range of pass scores and the range for fail scores, with pass implying mastery. To define those intervals, cut scores along the score scale needed to be decided on first. However, as noted by Zieky (1995), the methodology for setting such cut scores had not yet emerged. In retrospect, it is clear that if the deviation from a mean was not adequate for score interpretation , locating a score within an interval would not necessarily help either; much more was needed. In fact, reflecting on his 1963 paper , Glaser (1994) noted that “systematic techniques needed to be developed to more adequately identify and describe the components of performance, and to determine the relative weighting of these components with respect to a given task” (p. 9).

The “components of performance” that Glaser thought needed to be developed echoed both Tryon’s earlier “components determining individual differences” and the “coherent set of operations permitting utilization of content” that Loevinger called for. That is, there had been an implied consensus all along as to a key ingredient for test meaning, namely, identifying the underlying sources of variability in test performance , which meant a deeper understanding of the response process itself.Footnote 22

2.1 Validity Theory at ETS

With the benefit of hindsight, it seems that by 1970, the conception of validity remained divided, consisting of different “validities,” which had significant implications for the use of constructed -response formats in education. Three developments were needed to further that use:

  • With criterion (and especially, predictive) validity as a primary conception of validity , economics would delay wider use of constructed -response formats. Replacing the Trinitarian view with a unitary view was needed to avoid associating the different “validities” with specific testing contexts.

  • Even under a unitary view, the costs of constructed -response formats would remain an obstacle. An expansion of the unitary conception was necessary to explicitly give the evidential and consequential aspects of validity equal footing. By doing so, the calculus for the deployment of constructed -response formats would balance monetary cost with the (possibly intangible) benefits of using the format.

  • As alluded to earlier, by and large, the broader discussion of validity theory was not directed at educational achievement testing. Thus, the third needed development was to make the evolution of validity theory applicable to educational testing.

These developments were related and enormous. Unlike the earlier evolution of validity , which had taken place outside of ETS, Sam Messick dedicated two decades to explicating the unitary view, bringing its evidential and consequential aspects more into line with one another, and making the view relevant, if not central, to educational testing. These advances, arguably, were essential to wider use of constructed -response formats in education.

Calls for a unitary view in the form of construct validity began early on. Messick (1989) quoted Loevinger that, “since predictive, concurrent, and content validities are all essentially ad hoc, construct validity is the whole of validity from a scientific point of view” (p. 17). Messick elaborated that idea, stating that, “almost any kind of information about a test can contribute to an understanding of construct validity , but the contribution becomes stronger if the degree of fit of the information with the theoretical rationale underlying score interpretation is explicitly evaluated” (p. 17). That is, Messick stressed the need for a theoretical rationale to integrate the different sources of validity evidence.

Importantly, Messick’s (1980, 1989) unitary view explicitly extended to the consequences of test use, with implications for the use of constructed -response formats. Although the message was not well received in some quarters (Kane 2006, p. 54), it was in others. For example, Linn et al. (1991) argued ,

If performance-based assessments are going to have a chance of realizing the potential that the major proponents in the movement hope for, it will be essential that the consequential basis of validity be given much greater prominence among the criteria that are used for judging assessments. (p. 17)

By the 1990s, there had been wider acceptance that consequential evidence was relevant to validity . But that acceptance was one of the later battles that needed to be fought. The relevance of the unitary view to educational testing needed to be established first. In 1975, Messick wondered, “Why does educational measurement , by and large, highlight comparative interpretations, whether with respect to norms or to standards,Footnote 23 and at the same time play down construct interpretations?” (p. 957).

This question was raised in reaction to the predominance that criterion-referenced testing had acquired by the 1970s, Among the possible answers Messick (1975) proposed for the absence of construct interpretations was the “legacy of behaviorism and operationism that views desired behaviors as ends in themselves with little concerns for the processes that produce them” (p. 959, emphasis added). That speculation was later corroborated by Lorie Shepard (1991) who found that, for the most part, state testing directors had a behaviorist conception of student learning.

The positive attitude toward behaviorism among state testing directors is informative because the so-called cognitive revolution had been under way for several decades. Although its relevance was recognized early on, its impact on testing practice was meager. Susan Embretson, who was not associated with ETS, recognized those implications (Whitely and Dawis 1974).Footnote 24 In an important paper, Embretson integrated ideas from cognitive science into testing and psychometric theory by building on Loevinger’s argument and layering a cognitive perspective on it . Embretson (1983) proposed the term construct representation to describe the extent to which performance on a test is a function of mental processes hypothesized to underlie test performance . An approach to documenting construct representation is modeling the difficulty of items as a function of variables representing the response process and knowledge hypothesized to underlie performance.Footnote 25 Modeling of item difficulty is well suited to multiple-choice items, but less so for items requiring a constructed response since there would typically be fewer of them in any given test. Nevertheless, the concept is equally applicable, as Messick (1994) noted with specific reference to performance assessment: “Evidence should be sought that the presumed sources of task complexity are indeed reflected in task performance and that the complex skill is captured in the test scores with minimal construct underrepresentation ” (p. 20).

Embretson’s construct representation fit well with Messick’s calls for a fuller understanding of the response process as a source of validity evidence. But Messick (1990) also understood that format had the potential to introduce irrelevancies:

Inferences must be tempered by recognizing that the test not only samples the task universe but casts the sampled tasks in a test format, thereby raising the specter of context effects or irrelevant method [i.e., format] variance possibly distorting test performance vis-a-vis domain performance. (p. 9)

Independently of the evolution of validity theory that was taking place, the calls for direct and authentic forms of assessment never stopped, as evidenced by the work on portfolio assessments at ETS (Camp 1993; Gentile 1992; Myford and Mislevy 1995) and elsewhere. Following the period of “minimum competency testing” in the 1980s there were calls for testing higher order forms of educational achievement (Koretz and Hamilton 2006), including the use of so-called authentic assessments (Wiggins 1989) . The deployment of highly complex forms of assessment in the early 1990s was intended to maximize the positive educational consequences of constructed -response formats and avoid the negative consequences of the multiple-choice format, such as teaching to the narrow segment of the curriculum that a multiple-choice test would represent. However, despite the appeal of constructed -response formats, such forms of assessment still needed to be evaluated from a validity perspective encompassing both evidential and consequential considerations. As Messick (1994) noted,

some aspects of all testing, even performance testing, may have adverse as well as beneficial educational consequences . And if both positive and negative aspects, whether intended or unintended, are not meaningfully addressed in the validation process, then the concept of validity loses its force as a social value. (p. 22)

Indeed, following the large-scale deployment of performance assessments in K–12 in the 1990s (Koretz and Hamilton 2006), it became obvious that overcoming the design challenges would take time. Although the assessments appeared to have positive effects on classroom practice, the assessments did not meet technical standards, especially with respect to score reliability . As a result, the pendulum swung back to the multiple-choice format (Koretz and Hamilton 2006, p. 535).

Not surprisingly, after the long absence of constructed -response formats from educational testing, the know-how for using such formats was not fully developed. Reintroducing such formats would require additional knowledge and a technological infrastructure that would make the format affordable.

2.2 Conclusion

Arguably, the predictive conception of validity prevalent through most of the twentieth century favored the multiple-choice format. The evolution of validity into a more unitary concept was not seen initially as relevant to educational measurement . Samuel Messick thought otherwise and devoted two decades to explicate the relevance of a unitary conception, incorporating along the way consequential, not just evidentiary, considerations, which was critical to reasoning about the role of response format in educational measurement.

3 The Interplay of Constructs and Technology

The evolution of validity theory may have been essential to providing a compelling rationale for the use of constructed -response formats. However, the cost considerations in an educational setting are still an issue, especially in an educational context: According to Koretz and Hamilton (2006), “concerns about technical quality and costs are likely to dissuade most states from relying heavily on performance assessments in their accountability systems … particularly when states are facing heavy testing demands and severe budget constraints” (p. 536).

An important contribution by ETS to the development of constructed -response formats has been to take advantage of technological developments for educational and professional testing purposes. Among the most salient advances are the following:

  • using computers to deploy constructed -response formats that expand construct coverage

  • taking advantage of technology to enable more efficient human scoring

  • pioneering research on automated scoring in a wide range of domains to improve cost effectiveness and further leverage the computer as a delivery medium

If the scanner enabled the large-scale use of multiple-choice tests, the advent of the computer played a similar role in enabling the large-scale use of constructed -response formats.Footnote 26 Incorporating technological advances into operational testing had been common practice at ETS almost from inception (Traxler 1951, 1954) . However, a far more visionary perspective was apparent at the highest levels of the organization. In 1951, ETS officer William Turnbull coined the term, tailored testing (Lord 1980, p. 151); that is, the idea of adapting the test to the test taker.Footnote 27 Some years later, as the organization’s executive vice president, he elaborated on it (Turnbull 1968):

The next step should be to provide examinations in which the individual questions are contingent on the student’s responses to previous questions. If you will permit the computer to raise its ugly tapes, I would like to put forward the prospect of an examination in which, for each examinee, the sequence of questions is determined by his response to items earlier in the sequence. The questions will be selected to provide the individual student with the best opportunity to display his own profile of talent and accomplishment, without wasting time on tasks either well below or well beyond his level of developed ability along any one line. Looking farther down this same path, one can foresee a time when such tailor-made tests will be part and parcel of the school’s instructional sequence; when the results will be accumulated and displayed regularly as a basis for instruction and guidance; and when the pertinent elements of the record will be banked as a basis for such major choice points as the student’s selection of a college. (p. 1428, emphasis added)

Although Turnbull was not addressing the issue of format, his interest in computer-based testing is relevant to the eventual wider use of constructed -response formats, which perhaps would not have been feasible in the absence of computer-based testing. (The potential of microcomputers for testing purposes was recognized early at ETS; Ward 1984.) That an officer and future president of ETS would envision in such detail the use of computers in testing could have set the stage for an earlier use of computers for test delivery than might otherwise have been the case. And, if as Fowles (2012) argued, computer delivery was in part responsible for the adoption of writing in postsecondary admissions tests like the GRE General Test , then it is possible that the early adoption of computer delivery by ETS accelerated that process.Footnote 28 The transition to computer delivery started with what was later named the ACCUPLACER ® test, a placement test consisting entirely of multiple-choice items developed for the College Board . It was first deployed in 1985 (Ward 1988). It is an important first success because it opened the door for other tests to follow.Footnote 29

Once computer delivery was successfully implemented, it would be natural for other ETS programs to look into the possibility. Following the deployment of ACCUPLACER, the GRE General Test was introduced in 1992 (Mills and Steffen 2000). The 1992 examination was an adaptive test consisting of multiple-choice sections for Verbal Reasoning, Quantitative Reasoning, and Analytical Reasoning. However, the Analytical Reasoning measure was replaced in 2002 by the Analytical Writing section, consisting of two prompts: an issue prompt (45 minutes with a choice between two prompts) and an argument prompt (30 minutes).

The transition to computer delivery in 1992 and the addition of writing in 2002 appear to have flowed seamlessly, but in fact, the process was far more circuitous. The issue and argument prompts that composed the Analytical Writing measure were a significant innovation in assessment design and an interesting example of serendipity, the interplay of formats, technology, and attending to the consequences of testing.

Specifically, the design of the eventual GRE Analytical Writing measure evolved from the GRE Analytical Reasoning (multiple-choice) measure, which was itself a major innovation in the assessment of reasoning (Powers and Dwyer 2003). The Analytical Reasoning measure evolved by including and excluding different item types. In its last incarnation, it consisted of two multiple-choice item types, analytical reasoning and logical reasoning. The logical reasoning item type called for evaluating plausible conclusions, determining missing premises, finding the weakness of a conclusion, and so on (Powers and Dwyer 2003, p. 19). The analytical reasoning item type presented a set of facts and rules or restrictions. The test taker was asked to ascertain the relationships permissible among those facts, and to judge what was necessary or possible under the given constraints (Chalifour and Powers 1989).

Although an extensive program of research supported the development of the Analytical Reasoning measure, it also presented several challenges especially under computer delivery. In particular, performance on the logical reasoning items correlated highly with the verbal reasoning items, whereas performance on the analytical reasoning items correlated highly with quantitative reasoning items (Powers and Enright 1987) , raising doubts about the construct it assessed. Moreover, no conclusive validity evidence for the measure as a whole was found when using an external criterion (Enright and Powers 1991). The ambiguous construct underpinnings of the Analytical Reasoning measure were compounded by the presence of speededness (Bridgeman and Cline 2004), which was especially harmful under computer delivery. Given the various challenges encountered by the Analytical Reasoning measure, it is no surprise that it ultimately was replaced by the Analytical Writing measure, which offered a well-balanced design.

The issue prompt has roots in the pedagogy of composition. As D’Angelo (1984) noted , textbooks dating back to the nineteenth century distinguish four genre: narration, description, exposition, and argumentation. Argumentation was defined as “the attempt to persuade others of the truth of a proposition” (p. 35, emphasis added). There is less precedent, if any, for the GRE argument prompt, which presents the task of critiquing an argument. The germ for the idea of an argument-critique prompt was planted during efforts to better prepare minority students for the GRE Analytical Reasoning measure, specifically, the logical reasoning item type (Peter Cooper, personal communication, November 27, 2013):

The Logical Reasoning items … took the form of a brief stimulus passage and then one or more questions with stems such as “Which of the following, if true, weakens the argument?,” “The argument above rests on which of the following assumptions,” and so forth, with five options. At a workshop in Puerto Rico, a student commented that he would prefer questions that allowed him to comment on the argument in his own terms, not just pick an answer someone else formulated to a question someone else posed. I thought to myself, “Interesting concept … but be careful what you wish for” and did nothing for a couple of years, until [Graduate Management Admission Test] GMAT … told us in the summer of 1993 that it wanted to add a constructed -response measure, to be operational by October 1994, that would get at analytical reasoning—i.e., not just be another writing measure that rewarded fluency and command of language, although these would matter as well. Mary Fowles had discussed “Issue”-like prototypes with the [Graduate Management Admission Council] GMAC’s writing advisory committee, which liked the item type but seemed to want something more “analytical” if possible. I recalled the student’s comment and thought that a kind of constructed -response Logical Reasoning item could pair well with the Issue-type question to give a complementary approach to analytical writing assessment : In one exercise, students would make their own argument, developing a position on an issue, and in the other exercise they would critically evaluate the line of reasoning and use of evidence in an argument made by someone else. Both kinds of skills are important in graduate-level work.

Mary Fowles (2012) picked up the story from there: “What caused this seemingly rapid introduction of direct writing assessment for admission to graduate and professional programs?” (pp. 137–138). She cited factors such as the “growing awareness [of the relationship] between thinking and writing”; the availability of the computer as a delivery medium, which “enabled most examinees to write more fluently” and “streamlined the process of collecting written responses”; and “essay testing programs [that] now had the advantage of using automated scoring ” (pp. 137–138).

Although the genesis of the argument prompt type came from attempts to help prepare students of diverse backgrounds for the multiple-choice GRE Analytical Reasoning section, the analytical writing measure comprising issue and argument prompts was used first by the GMAT . In 1994, that measure was offered in paper-and-pencil form, and then moved to computer when GMAT converted to an adaptive test in 1997. The GRE first used the measure as a stand-alone test (the GRE Writing Assessment ) in 1999 and incorporated it into the General Test in 2002, as noted earlier.

The transition to computer delivery in the 1990s was not limited to the GRE and GMAT . The TOEFL ® test transitioned as well. It evolved from a test conceived in the 1960s to a measure rooted in the communicative competence construct (Canale and Swain 1980; Duran et al. 1987) . The earlier efforts to bolster TOEFL by introducing stand-alone writing and speaking tests—the Test of Written English (TWE® test) and the Test of Spoken English (TSE® test)— were seen as stopgap measures that led to an “awkward” situation for the “communication of score meaning” (C. A. Taylor and Angelis 2008, p. 37). Importantly, communicative competence called for evidence of proficiency in productive skills, which meant the assessment of writing and speaking proficiency in academic settings. In the case of speaking, these requirements meant that ultimately complex multimodal tasks were needed where students would read or listen to a stimulus and provide a spoken response. The construct of communicative competence was unpacked in frameworks corresponding to the four skills thought to compose it: reading (Enright et al. 2000) , listening (Bejar et al. 2000), writing (Cumming et al. 2000), and speaking (Butler et al. 2000). The frameworks served as the basis for experimentation, after which the blueprint for the test was set (Pearlman 2008a).

Computer delivery would prove critical to implementing such an ambitious test, especially the measurement of the productive skills. The inclusion of writing was relatively straightforward because there was already experience from GRE and GMAT . In fact, when it first transitioned to computer in 1998, TOEFL CBT used the TWE prompt as either a typed or handwritten essay. Nevertheless, there were still significant challenges, especially technological and assessment design challenges. The assessment of computer-delivered speaking on an international scale was unprecedented, especially considering the test security considerations.Footnote 30 The first generation of computer delivery that had served GRE and TOEFL CBT was less than ideal for effectively and securely delivering an international test administered every week. For one, testing that required speaking had the potential to interfere with other test takers. In addition, the quality of the speech captured needed to be high in all test centers to avoid potential construct-irrelevant variance. These requirements meant changes at the test centers, as well as research on the best microphones to capture spoken responses. On the back end, written and spoken responses needed to be scored quickly to comply with a turnaround of no more than 10 days. These requirements influenced the design of the next-generation test delivery system at ETS, iBT (Internet-based testing), and when the latest version of TOEFL was released in 2005, it was called the TOEFL iBT ® test (Pearlman 2008a).

In addition to the technological challenges of delivering and scoring a secure speaking test, there were several assessment design challenges. To accommodate the international volume of test takers, it was necessary to administer the test 50 times a year. Clearly, the forms from week to week needed to be sufficiently different to prevent subsequent test takers from being able to predict the content of the test. The central concept was that of reusability, a key consideration in ECD , which was implemented by means of item templates (Pearlman 2008a).

3.1 Computer-Mediated Scoring

Once tests at ETS began to transition to computer delivery, computer-mediated scoring became of interest. Typically, faculty, in the case of educational tests, or practitioners, in the case of professional assessments, would congregate at a central location to conduct the scoring. As volume grew, best practices were developed, especially in writing (Baldwin 2004) , and more generally (Baldwin et al. 2005; McClellan 2010). However, the increase in testing volumes called for better utilization of technology in the human scoring process.

Perhaps anticipating the imminence of larger volumes, and the increasing availability of computers, there was experimentation with “remote scoring” fairly earlyFootnote 31 (Breland and Jones 1988). In the Breland and Jones study, the essays were distributed via courier to the raters at home. The goal was to evaluate whether solo scoring was feasible compared to centralized or conference scoring. Not surprisingly this form of remote scoring was not found to be as effective as conference scoring. All the affordances of computer technology were not taken advantage of until a few years later (Bejar and Whalen 2001; Driscoll et al. 1999; Kuntz et al. 2006).

The specialized needs of NAEP motivated a somewhat different use of the computer to mediate the human scoring process. In the early 1990s the NAEP program started to include state samples, which led to large increases in the volume of constructed responses. Such responses were contained in a single booklet for each student. To avoid potential scoring bias that would result from a single reader scoring all the constructed responses from a given student, a system was developed where the responses would be physically clipped and scanned separately. The raters would then score the scanned responses displayed on a terminal, with each response for a student routed to a different rater. The scoring of NAEP constructed responses was carried out by a subcontractor (initially NCS, and then Pearson after it acquired NCS) under direction from ETS. Scoring was centralized (all the raters were at the same location), but computer images of the work product were presented on the screen and the rater entered a score that went directly into a database.Footnote 32

3.2 Automated Scoring

Though technology has had an impact on human scoring , a more ambitious idea was to automate the scoring of constructed responses. Page, a professor at the University of Connecticut, first proposed the idea for automated scoring of essays (Page 1966). It was an idea ahead of its time, because for automated scoring to be maximally useful, the responses need to be in digital form to begin with; digital test delivery was some decades away. However, as the computer began to be used for test delivery, even if it was limited to multiple-choice items, it was natural to study how the medium might be leveraged for constructed-response scoring purposes. Henry Braun , then vice president for research management, posed precisely that question (personal communication, July 9, 2014). Although a statistician by training, he was familiar with the literature on expert systems that had proliferated by the 1980s as a means of aiding and even automating expert judgment. In contrast to earlier research on actuarial judgment (Bejar et al. 2006), where the clinician and a regression equation were compared, in expert systems the role of the computer is more ambitious and consists of both analyzing an object (e.g., a doctor’s course of treatment for a patient, an architectural design) and, based on that analysis, making a decision about the object, such as assigning a score level.

Randy Bennett took the lead at ETS in exploring the technology for scoring constructed responses in concert with theory about the relevant constructs, including mathematics (Bennett and Sebrechts 1996; Bennett et al. 1999, 2000a; Sandene et al. 2005; Sebrechts et al. 1991, 1996), computer science (Bennett and Wadkins 1995), graphical items (Bennett et al. 2000a; b), and formulating hypotheses (Bennett and Rock 1995). The scoring of mathematics items has reached a significant level of maturity (Fife 2013), as has the integration of task design and automated scoring (Graf and Fife 2012).

Much of the research on automated scoring was experimental, in the sense that actual applications needed to await the delivery of tests by computer. One ETS client, the National Council of Architectural Registration Boards (NCARB), was seriously considering on its own the implications of technology for the profession. The software used in engineering and architecture, computer-assisted design (CAD), was transitioning during the 1980s from minicomputers to desktop computers. A major implication of that transition was that the cost of the software came down significantly and became affordable to an increasingly larger number of architecture firms, thereby changing, to some extent, the entry requirements for the profession. Additionally, the Architectural Registration Examination introduced in 1983 was somewhat unwieldy, consisting of many parts that required several years to complete, since they could not all be taken together over the single testing window that was made available every June. A partnership between ETS and NCARB was established to transition the test to computer delivery and allow continuous testing, revise the content of the test, and take advantage of computer delivery, including automated scoring .

Management of the relationship between ETS and NCARB was housed in ETS’s Center for Occupational and Professional Assessment (COPA), led by vice president Alice Irby, who was aware of the research on the utilization of computers for test delivery and scoring under Henry Braun. A project was initiated between ETS and NCARB that entailed developing new approaches to adaptive testing with multiple-choice items in a licensing context (Lewis and Sheehan 1990; Sheehan and Lewis 1992) and that had the more ambitious goal of delivering and scoring on computer the parts of the examination that required the demonstration of design skills.

The paper-and-pencil test used to elicit evidence of design skills included a very long design problem that took some candidates up to 14 hours to complete. Scoring such a work product was a challenge even for the practicing architects, called jurors. The undesirability of a test consisting of a single item from a psychometric perspective was not necessarily understood by the architects. However, they had realized that a single-item test could make it difficult for the candidate to recover from an early wrong decision. That insight led to an assessment consisting of smaller constructed -response design tasks that required demonstrations of competence in several aspects of architectural practice (Bejar 2002; Bejar and Braun 1999) . The process of narrowing the test design to smaller tasks was informed by practice analyses intended to identify the knowledge, skills, and abilities (so-called KSAs) required of architects, and their importance. This information was used to construct the final test blueprint, although many other considerations entered the decision, including considerations related to interface design and scorability (Bennett and Bejar 1998).

Reconceptualizing the examination to better comply with psychometric and technological considerations was a first step. The challenge of delivering and scoring the architectural designs remained. The interface and delivery, as well as supervising the engineering of the scoring engines, was led by Peter Brittingham, while the test development effort was led by Dick Devore. The scoring approach was conceived by Henry Braun (Braun et al. 2006) and Bejar (1991). Irv Katz contributed a cognitive perspective to the project (Katz et al. 1998). The work led to operational implementation in 1997, possibly the first high-stakes operational application of automated scoring .Footnote 33

While ETS staff supported research on automated scoring in several domains, perhaps the ultimate target was essays, especially in light of their increasing use in high-volume testing programs. Research on automated scoring of textual responses began at ETS as part of an explicit effort to leverage the potential of technology for assessment. However, the first thorough evaluation of the feasibility of automated essay scoring was somewhat fortuitous and was carried out as a collaboration with an external partner. In the early 1990s, Nancy Petersen heard Ellis B. Page discuss his system, PEG, for scoring essaysFootnote 34 at an AERA reception. Petersen suggested to Page the possibility of evaluating the system in a rigorous fashion using essays from 72 prompts taken from the PRAXIS ® program, which had recently begun to collect essays on computer. The report (Page and Petersen 1995) was optimistic about the feasibility of automated scoring but lacked detail on the functioning of the scoring system. Based on the system’s relatively positive performance, there was discussion between ETS and Page regarding a possible licensing of the system for nonoperational use, but the fact that Page would not fully revealFootnote 35 the details of the system motivated ETS to invest further in its own development and research on automated scoring of essays. That research paid off relatively quickly since the system developed, the e-rater ® engine, was put into operation in early 1999 to score GMAT essays (Burstein et al. 1998) . The system has continued to evolve (Attali and Burstein 2006; Burstein et al. 2004; Burstein et al. 2013) and has become a major ETS asset. Importantly, the inner workings of e-rater are well documented (Attali and Burstein 2006; Quinlan et al. 2009) , and disclosed through patents.

The e-rater engine is an example of scoring based on linguistic analysis, which is a suitable approach for essays (Deane 2006) . While the automated scoring of essays is a major accomplishment, many tests rely on shorter textual responses, and for that reason approaches to the scoring of short textual responses have also been researched. The basic problem of short-answer scoring is to account for the multiple ways in which a correct answer can be expressed. The scoring is then a matter of classifying a response, however expressed, into a score level. In the simplest case, the correct answer requires reference to a single concept, although in practice a response may require more than one concept. Full credit is given if all the concepts are present in the response, although partial credit is also possible if only some of the concepts are offered.

Whereas the score humans would assign to an essay can be predicted from linguistic features that act as correlates of writing quality, in the case of short responses, there are fewer correlates on which to base a prediction of a score. In a sense, the scoring of short responses requires an actual understanding of the content of the response so that it can be then be classified into a score level. The earliest report on short-answer scoring at ETS (Kaplan 1992) was an attempt to infer a “grammar” from a set of correct and incorrect responses that could be used to classify future responses. The approach was subsequently applied to scoring a computer-delivered version of a task requiring the generation of hypotheses (Kaplan and Bennett 1994) . A more refined approach to short-answer scoring, relying on a more robust linguistic representation of responses, was proposed by Burstein et al. (1999), although it was not applied further.

As the complexities of scoring short answers became better understood, the complexity and sophistication of the approach to scoring grew as well. The next step in this evolution was the c-rater ™ automated scoring engine (Leacock and Chodorow 2003).Footnote 36 The system was motivated by a need to lower the scoring load of teachers. Unlike earlier efforts, c-rater requires a model of the correct answer such that scoring a response is a matter of deciding whether it matches the model response. Developing such a model is not a simple task given the many equivalent ways of expressing the same idea. One of the innovations introduced by c-rater was to provide an interface to model the ideal response. In effect, a model response is defined by a set of possible paraphrases of the correct answer that are then represented in canonical or standard form. To evaluate whether a given response is in the set requires linguistic processing to deal with spelling and other issues so that the student response can be recast into the same canonical form as the model. The actual scoring is a matter of matching the student response against the model, guided by a set of linguistic rules. Because student responses can contain many spelling and grammatical errors, the matching process is “fairly forgiving” (Leacock and Chodorow 2003, p. 396). The c-rater engine was evaluated in studies for NAEP (Sandene et al. 2005) , and in other studies has been found useful for providing feedback to students (Attali 2010; Attali and Powers 2008, 2010). The most recent evaluation of the c-rater approach (Liu et al. 2014) took advantage of some refinements introduced by Sukkarieh and Bolge (2008). O. L. Liu et al. (2014) concluded that c-rater cannot replace human scores , although it has shown promise for use in low-stakes settings.

One limitation of c-rater is scalability. A scoring model needs to be developed for each question, a rather laborious process. A further limitation is that it is oriented to scoring responses that are verbal. However, short answers potentially contain numbers, equations, and even drawings.Footnote 37

More recent approaches to short answer scoring have been developed including one referred to as Henry ML. Whereas c-rater makes an attempt to understand the response by identifying the presence of concepts, these newer approaches evaluate low-level aspects of the response, including “sparse features” like word and character n-grams, as well as “dense features” that compare the semantic similarity of a response to responses with agreed upon-scores (Liu et al. 2016; Sakaguchi et al. 2015).

The foregoing advances were followed by progress in the scoring of spoken responses. An automated approach had been developed during the 1990s by the Ordinate Corporation based on “low-entropy” tasks, such as reading a text aloud (Bernstein et al. 2000). The approach was, however, at odds with the communicative competence perspective that was by then driving the thinking of TOEFL developers. ETS experimented with automated scoring of high-entropy spoken responses (Zechner, Bejar, & Hemat, 2007) . That is, instead of reading a text aloud, the tasks called for responses that were relatively extemporaneous and therefore more in line with a communicative perspective. The initial experimentation led rather quickly to an approach that could provide more comprehensive coverage of the speaking construct (Zechner et al. 2007b, 2009a). The current system, known as the SpeechRater SM service, is used to score the TOEFL Practice Online (TPO™) test , which is modeled after the speaking component of the TOEFL. Efforts continue to further expand the construct coverage of the scoring engine by integrating additional aspects of speaking proficiency, such as content accuracy and discourse coherence (Evanini et al. 2013; Wang et al. 2013; Yoon et al. 2012) . Additionally, the scope of applicability has been expanded beyond English as a second language (ESL ) to also include the assessment of oral reading proficiency for younger students by means of low-entropy tasks (Zechner et al. 2009b, 2012). Importantly, the same underlying engine is used in this latter case, which argues well for the potential of that engine to support multiple types of assessments.

3.3 Construct Theory and Task Design

Technology was as important to the adoption of constructed -response formats as it was for the multiple-choice format, where the scanner made it possible to score large volumes of answer sheets. However, much more was needed in the case of constructed -response formats besides technology. Invariably, progress was preceded or accompanied by work on construct definition.

3.3.1 Writing

The publication that may have been responsible for the acceptance of holistic scoring (Godshalk et al. 1966) was, in fact, an attempt to empirically define the writing construct. Over the years, many other efforts followed, with various emphases (Breland 1983; Breland and Hart 1994; Breland et al. 1984, 1987) . Surveys of graduate faculty identified written argumentation, both constructing and critiquing arguments, as an important skill for success in graduate school (Enright and Gitomer 1989) . Summaries of research through 1999 (Breland et al. 1999) show convergence on various issues, especially the importance of defining the construct, and then designing the test accordingly to cover the intended construct, while simultaneously avoiding construct-irrelevant variance. In the case of the GMAT and GRE ,Footnote 38 a design consisting of two prompts, creating and evaluating arguments, emerged after several rounds of research (Powers et al. 1999a). The design remains in GMAT and GRE.

Writing was partially incorporated into the TOEFL during the 1980s in the form of the TWE . It was a single-prompt “test.” A history of the test is provided by Stansfield (1986a). With plans to include writing in the revised TOEFL, more systematic research among English language learners began to emerge, informed by appropriate theory (Hamp-Lyons and Kroll 1997) . Whereas the distinction between issue and argument is thought to be appropriate for GRE and GMAT , in the case of TOEFL the broader construct of communicative competence has become the foundation for the test. With respect to writing, a distinction is made between an independent and an integrated prompt. The latter requires the test takers to refer to a document they read as part of the prompt. (See TOEFL 2011, for a brief history of the TOEFL program.)

Understandably, much of the construct work on writing has emphasized the postsecondary admissions context. However, in recent years, K-12 education reform efforts have increasingly incorporated test-based accountability approaches (Koretz and Hamilton 2006). As a result, there has been much reflection about the nature of school-based testing. The research initiative known as CBAL (Cognitively Based Assessment of, for, and as Learning ) serves as an umbrella for experimentation on next-generation K–12 assessments. Under this umbrella, the writing construct has expanded to acknowledge the importance of other skills, specifically reading and critical thinking, and the developmental trajectories that underlie proficiency (Deane 2012; Deane and Quinlan 2010; Deane et al. 2008, 2012). In addition to expanding the breadth of the writing construct, recent work has also emphasized depth by detailing the nature of the evidence to be sought in student writing, especially argumentative writing (Song et al. 2014) . Concomitant advances that would enable automated scoring for rich writing tasks have also been put forth (Deane 2013b).

3.3.2 Speaking

The assessment of speaking skills has traditionally taken place within an ESL context. The TSE (Clark and Swinton 1980) was the first major test of English speaking proficiency developed at ETS. Nevertheless, Powers (1984) noted that among the challenges facing the development of speaking measures were construct definition and cost. With respect to construct definition, a major conference was held at ETS in the 1980s (Stansfield 1986b) to discuss the relevance of communicative competence for the TOEFL . Envisioning TOEFL from that perspective was a likely outcome of the conference (Duran et al. 1987) . Evidence of the acceptance of the communicative competence construct can be seen in its use to validate TSE scores (Powers et al. 1999b), and in the framework for incorporating a speaking component in a revised TOEFL (Butler et al. 2000). The first step in the development of an operational computer-based speaking test was the TOEFL Academic Speaking Test (TAST ), a computer-based test intended to familiarize TOEFL test takers with the new format. TAST was introduced in 2002 and served to refine the eventual speaking measure included in TOEFL iBT . Automated scoring of speaking as discussed above, could help to reduce costs, but is not yet sufficiently well developed (Bridgeman et al. 2012). The TOEIC ® Speaking and Writing test followed the TOEFL (Pearlman 2008b) in using ECD for assessment design (Hines 2010) as well as in the inclusion of speaking (Powers 2010; Powers et al. 2009).

3.3.3 Mathematics

Constructed-response items have been standard in the AP program since inception and were already used in NAEP by 1990 (Braswell and Kupin 1993). The SAT relied on multiple-choice items for much of its history (Lawrence et al. 2002) but also introduced in the 1990s a simple constructed -response format, the grid-in item, that allowed students to enter numeric responses. Because of the relative simplicity of numeric responses, they could be recorded on a scannable answer sheet, and therefore scored along with the multiple-choice responses. Various construct-related considerations motivated the introduction of the grid-in format, among them the influence of the standards produced by the National Council of Teachers of Mathematics (Braswell 1992) but also considerations about the response process. For example, Bridgeman (1992) argued that in a mathematics context, the multiple-choice format could provide the student inadvertent hints and also make it possible to arrive at the right answers by reasoning backward from the options. He evaluated the SAT grid-in format with GRE items and concluded that the multiple-choice and grid-in versions of GRE items behaved very similarly. Following the adoption of the grid-in format in the SAT , a more comprehensive examination of mathematics item formats that could serve to elicit quantitative skills was undertaken, informed by advances in the understanding of mathematical cognition and a maturing computer-based infrastructure (Bennett and Sebrechts 1997; Bennett et al. 1997, 1999, 2000a, b; Sandene et al. 2005; Sebrechts et al. 1996) . More recently, the mathematics strand of the CBAL initiative has attempted to unpack mathematical proficiency by means of competency models, the corresponding constructed -response tasks (Graf 2009), and scoring approaches (Fife 2013).

3.3.4 History

A design innovation introduced by the AP history examinations was the document-based question (DBQ). Such questions require the test taker to incorporate, in a written response, information from one or more historical documents.Footnote 39 The idea for the format was based on input from a committee member who had visited libraries in England and saw that there were portfolios of primary historical documents, which apparently led to the DBQ. The DBQ was first used with the U.S. History examination, and the European History examination adopted the format the following year, as did World History when it was introduced in 2002. The scoring of document-based responses proved to be a challenge initially, but since its rationale was so linked to the construct, the task has remained.

3.3.5 Interpersonal Competence

Interpersonal competence has been identified as a twenty-first-century educational skill (Koenig 2011) as well as a workforce skill (Lievens and Sackett 2012) . The skill was assessed early on at ETS by Larry Stricker (Stricker 1982; Stricker and Rock 1990) in a constructed -response format by means of videotaped stimuli, a relatively recent invention at the time. The recognition of the affordances of technology appears to have been the motivation for the work (Stricker 1982): “The advent of videotape technology raises new possibilities for assessing interpersonal competence because videotape provides a means of portraying social situations in a comprehensive, standardized, and economical manner” (p. 69).

3.3.6 Professional Assessments

Historically, ETS tests have been concerned with aiding the transition to the next educational level and, to a lesser extent, with tests designed to certify professional knowledge. Perhaps the earliest instance of this latter line of work is the “in-basket test” developed by Frederiksen et al. (1957). Essentially, the in-basket format is used to simulate an office environment where the test taker plays the role of school principal or business executive, for example. The format was used in an extended study concerned with measurement of the administrative skills of school principals in a simulated school (Hemphill et al. 1962) . Apart from the innovative constructed -response format, the assessment was developed following what, in retrospect, was a very sophisticated assessment design approach. First, a job analysis was conducted to identify the skills required of an elementary school principal. In addition, the types of problems an elementary school principal is confronted with were identified and reduced to a series of incidents. This led to a universe of potential items by combining the problems typically confronted with the skills assumed to be required to perform as a principal based on the best research at the time (Hemphill et al. 1962, p. 47). Three skills were assumed to be (a) technical, (b) human, and (c) conceptual. The four facets of the jobs were taken to be (a) improving educational opportunity , (b) obtaining and developing personnel, (c) maintaining effective interrelationships with the community, and (d) providing and maintaining funds and facilities. The crossing of skill and facets led to a 4 × 3 matrix. Items were then written for each cell.

While the research on the assessment of school principals was highly innovative, ETS also supported the assessment of school personnel with more traditional measures. The first such assessment was bequeathed to the organization when the American Council on Education transferred the National Teacher Examination (NTE ) in 1948 to the newly founded ETS. However, in the early 1980s, under President Greg AnrigFootnote 40 a major rethinking of teacher testing took place and culminated in the launching, in 1993, of the PRAXIS SERIES ® tests. The PRAXIS I ® and PRAXIS II ® tests were concerned with content and pedagogical knowledge measured by multiple-choice items, as well as some types of constructed -response tasks. However, the PRAXIS III ® tests were concerned with classroom performance and involved observing teachers in situ, a rather sharp departure from traditional measurement approaches. Although classroom observation has long been used in education, PRAXIS III appears to be among the first attempts to use observations-as-measurement in a classroom context. The knowledge base for the assessment was developed over several years (Dwyer 1994) and included scoring rubrics and examples of the behavior that would be evidence of the different skills required of teachers. The PRAXIS III work led to the Danielson Framework for Teaching,Footnote 41 which has served as the foundation for school-leader evaluations of teachers in many school districts, as well as for video-based products concerned with evaluation,Footnote 42 including those of the MET project (Bill and Melinda Gates Foundation 2013).

Whereas PRAXIS III was oriented toward assessing beginning teachers, ETS was also involved with the assessment of master teachers as part of a joint project with the National Board of Professional Teaching Standards (NBPTS ). The goal of the assessment was to certify the expertise of highly accomplished practitioners. Pearlman (2008a, p. 88) described the rich history as “a remarkable journey of design, development, and response to empirical evidence from practice and use,” including the scoring of complex artifacts. Gitomer (2007) reviewed research in support of NBPTS .

COPA was devoted to developing assessments for licensing and certification in fields outside education. In addition to that of architects mentioned earlier, COPA also considered the licensing of dental hygienists (Cameron et al. 2000; Mislevy et al. 1999, 2002b) , which was one of the earliest applications of the ECD framework that will be discussed next.

3.3.7 Advances in Assessment Design Theory

For most of the twentieth century, there did not exist a comprehensive assessment design framework that could be used to help manage the complexity of developing assessments that go beyond the multiple-choice format. Perhaps this was not a problem because such assessments were relatively few and any initial design flaws could be remedied over time. However, several factors motivated the use of more ambitious designs, including the rapid technological innovations introduced during the second half of the twentieth century, concerns about the levels of achievement and competitiveness of U.S. students, the continued interest in forms of assessment beyond the multiple-choice item, and educational reform movements that have emphasized test-based accountability . A systematic approach to the design of complex assessments was needed, including ones involving the use of complex constructed responses.

ECD is rooted in validity theory. Its genesis (Mislevy et al. 2006) is in the following quote from Messick (1994) concerning assessment design, which, he argued,

would begin by asking what complex of knowledge, skills, and other attributes should be assessed, presumably because they are tied to explicit or implicit objectives of instruction or are otherwise valued by society. Next, what behaviors or performances should reveal those constructs, and what task or situations should elicit those behaviors? Thus, the nature of the construct guides the selection or construction of relevant tasks as well as the rational development of construct-based scoring criteria and rubrics. (p. 17, emphasis added)

ECD is a fleshing out of the quote into a comprehensive framework consisting of interlocking models. The student model focuses on describing the test taker, whereas the evidence model focuses on the nature and analysis of the responses. The evidence model passes its information to the student model to update the characterization of what the examinee knows and can do. Finally, the task model describes the items. Thus, if the goal is to characterize the students’ communicative competence, an analysis of the construct is likely to identify writing and speaking skills as components, which means the student model should include characterizations of these student skills. With that information in hand, the details of the evidence model can be fleshed out: What sort of student writing and speaking performance or behavior constitutes evidence of students’ writing and speaking skills? The answer to that question informs the task models, that is, what sorts of tasks are required to elicit the necessary evidence? ECD is especially useful in the design of assessments that call for constructed responses by requiring the behavior that constitutes relevant evidence of writing and speaking skills, for example, to be detailed and then prescribing the task attributes that would elicit that behavior. The evidence model, apart from informing the design of the tasks, is also the basis for scoring the responses (Mislevy et al. 2006).

ECD did not become quickly institutionalized at ETS, as Zieky (2014) noted. Nevertheless, over time, the approach has become widely used. Its applications include science (Riconscente et al. 2005), language (Mislevy and Yin 2012) , professional measurement (Mislevy et al. 1999), technical skills (Rupp et al. 2012), automated scoring (Williamson et al. 2006), accessibility (Hansen and Mislevy 2008; T. Zhang et al. 2010) , and task design and generation (Huff et al. 2012; Mislevy et al. 2002a). It has also been used to different degrees in the latest revisions of several ETS tests, such as TOEFL (Pearlman 2008b), in revisions of the College Board ’s AP tests (Huff and Plake 2010), and by the assessment community more generally (Schmeiser and Welch 2006, p. 313) . Importantly, ECD is a broad design methodology that is not limited to items as the means of eliciting evidence. Games and simulations are being used with increasing frequency in an educational context, and ECD is equally applicable in both cases (Mislevy 2013; Mislevy et al. 2014, 2016).

3.4 Conclusion

It is clear that at ETS the transition to computer-based test delivery began early on and was sustained. The transition had an impact on constructed responses by enabling their use earlier than might have otherwise been the case. As Table 18.1 shows, online essay scoring and automated essay scoring were part of the transition for three major admissions tests: GMAT , GRE and TOEFL .

Table 18.1 Writing assessment milestones for GMAT , GRE and TOEFL tests

4 School-Based Testing

Although postsecondary admissions tests have been the main form of operational testing at ETS, school-based testing has been and continues to be an important focus. The Sequential Tests of Educational Progress (STEP) was an early ETS product in this domain. At one point it included a writing test that consisted of multiple-choice questions and an essay,Footnote 43 although it is no longer extant. By contrast, ETS involvement in two major twentieth-century school-based assessments, the Advanced Placement Program ® examinations and the NAEP assessments , as well as in state assessments has grown. Constructed-response formats have played a major role, especially in AP and NAEP . In addition, the CBAL initiative has been prominent in recent years. They are discussed further in this section.

4.1 Advanced Placement

While the use of constructed responses encountered resistance at ETS in the context of admissions testing , the same was not true for the AP program, introduced in the mid-1950s. From the start, the AP program was oriented to academically advanced students who would be going to college, specifically to grant college credit or advanced placement by taking an examination. The seeds for the program were two reports (Lacy 2010), one commissioned by Harvard president James Bryant Conant (Committee on the Objectives of a General Education in a Free Society 1945), the other (General Education in School and College 1952) also produced at Harvard. These reports led to a trial of the idea in an experiment known as the Kenyon Plan.Footnote 44

The eventual acquisition of the program by the College Board was not a given. Valentine (1987) noted that “Bowles [College Board president at the time] was not sure that taking the program was in the Board’s interest” (p. 85). Initially, some of the AP examinations were entirely based on constructed responses, although eventually all, with the exception of Studio Art, included a mix of constructed -response and multiple-choice items. A program publication, An Informal History of the AP Readings 1956–1976 (Advanced Placement Program of the College Board 1980), provides a description of the scoring process early in the program’s history.

Interestingly, in light of the ascendancy of the multiple-choice format during the twentieth century, the use of constructed responses in AP does not appear to have been questioned. Henry Dyer, a former ETS vice president (1954–1972), seems to have been influential in determining the specifications of the test (Advanced Placement Program of the College Board 1980, p. 2). Whereas Dyer did not seem to have been opposed to the use of constructed responses in the AP program, he was far more skeptical of their value in the context of another examination being conceived at about the same time, the Test of Developed Ability.Footnote 45 In discussing the creation of that test, Dyer (1954) noted that

there may be one or two important abilities which are measureable only through some type of free response question. If an examining committee regards such abilities as absolutely vital in its area, it should attempt to work out one or two free response questions to measure them. Later on, we shall use the data from the tryouts to determine whether the multiple-choice sections of the test do not in fact measure approximately the same abilities as the free response sections. If they do, the free response section will be dropped, if not, they will be retained. (p. 7)

Thus, there was realization that the AP program was unique with respect to other tests, in part because of its use of constructed responses. In An Informal History of the AP Readings 1956–76 (Advanced Placement Program of the College Board 1980), it was noted that

neither the setting nor the writing of essay examination was an innovation. The ancient Chinese reputedly required stringent written examinations for high government offices 2,500 years ago. European students have long faced pass-or-perish examinations at the end of their courses in the Lycée, Gymnasium, or British Secondary system. In this country, from 1901 to 1925, the College Board Comprehensives helped to determine who would go to the best colleges. But the Advanced Placement Program was new, and in many ways unique. (p. 2)

As the College Board ’s developer and administrator for the AP program, ETS has conducted much research to support it. The contributions focused on fairness (e.g., Breland et al. 1994; Bridgeman et al. 1997; Dorans et al. 2003; Stricker and Ward 2004) , scoring (e.g., Braun 1988; Burstein et al. 1997; Coffman and Kurfman 1968; Myford and Mislevy 1995; Zhang et al. 2003), psychometrics (e. g., Bridgeman et al. 1996a, b; Coffman and Kurfman 1966; Lukhele et al. 1994; Moses et al. 2007), and validity and construct considerations (e. g., Bennett et al. 1991; Bridgeman 1989; Bridgeman and Lewis 1994).

4.2 Educational Surveys

As noted earlier, NAEP has been a locus of constructed -response innovation at ETS. NAEP was managed by the Education Commission of the States until 1983 when ETS was awarded the contract to operate it. With the arrival of NAEP , ETS instituted matrix sampling, along with IRT (Messick et al. 1983); both had been under development at ETS under Fred Lord,Footnote 46 and both served to undergird a new approach to providing the “Nation’s Report Card” in several subjects, with extensive use of constructed -response formats. To NAEP ’s credit, explicating the domain of knowledge to be assessed by means of “frameworks ” had been part of the assessment development process from inception. Applebee (2007) traced the writing framework back to 1969. Even before that date, however, formal frameworks providing the rationale for content development were well documented (Finley and Berdie 1970) . The science framework refers early on to “inquiry skills necessary to solve problems in science, specifically the ability to recognize scientific hypotheses” (p. 14). The assessment of inquiry skills has since become standard in science assessment but was only recently implemented operationally with the redesigned AP exams.

NAEP has been the source of multiple content and psychometric innovations (Mazzeo et al. 2006), including the introduction of mixed-format assessment consisting of both multiple choice and the large-scale use of constructed -response items. Practical polytomous IRT was developed in a NAEP context as documented by Carlson and von Davier (Chap. 5, this volume), and, as described earlier, NAEP introduced innovations concerned with the scoring of written responses. Finally, ETS continues to collaborate with NAEP in the exploration of technological advances to testing (Bennett et al. 2010).Footnote 47 The transition to digital delivery is underway as of this writing. In fact, the 2017 writing assessment was administered on tablets supplied by NAEP and research into the use of mixed-format adaptive in mathematics has also been carried out (Oranje et al. 2014) .

4.3 Accountability Testing

The start of K–12 testing in the United States dates back to the nineteenth century, when Horace Mann, an educational visionary, introduced several innovations into school testing, among them the use of standardized (written constructed -response) tests (U.S. Congress and Office of Technology Assessment 1992, chapter 4). The innovations Mann introduced were, in part, motivated by a perception that schools were not performing as well as could be expected. Such perceptions have endured and have continued to fuel the debate about the appropriate use of tests in K–12. More recently, the Nation at Risk report (National Commission on Excellence in Education 1983) warned that “the educational foundations of our society are presently being eroded by a rising tide of mediocrity that threatens our very future as a Nation and a people” (para. 1). Similarly, the linking of the state of education to the nation’s economic survivalFootnote 48 was behind one effort in the early 1990s (U.S. Department of Labor and Secretary’s Commission on Achieving Necessary Skills 1991; known as SCANS), and it had significant implications for the future of testing. As Linn (1996) noted, the system of assessment expected to emerge from the SCANS effort and to be linked to instruction, “would require direct appraisals of student performance” (p. 252) and would serve to promote the measured skills.

The calls for direct assessment that promotes learning joined the earlier Frederiksen (1984) assault on the multiple-choice item type, which had been heard loudly and clearly, judging by the number of citations to that article.Footnote 49 The idea of authentic assessment (Wiggins 1989) as an alternative to the standardized multiple-choice test, took hold among many educators, and several states launched major performance-based assessments. ETS participated in exploring these alternatives, especially portfolio assessment (Camp 1985, 1993), including in their evaluation in at least one state, California (Thomas et al. 1998).

Stetcher (2010) provided a detailed review of the different state experiments in the early 1990s in Vermont, Kentucky, Maryland, Washington, California, and Connecticut. A summary of a conference (National Research Council 2010, p. 36) noted several factors that led to the demise of these innovative programs:

  • Hurried implementation made it difficult to address scoring, reliability , and other issues.

  • The scientific foundation required by these innovative assessments was lacking.

  • The cost and burden to the school was great, and questions were raised as to whether they were worth it.

  • There were significant political considerations, including cost, time, feasibility of implementation, and conflicts in purpose among constituencies.

Not surprisingly, following this period of innovation, there was a return to the multiple-choice format. Under the No Child Left Behind (NCLB ) legislation,Footnote 50 the extent of federally mandated testing increased dramatically and once again the negative consequences of the predominant use of multiple-choice formats were raised. In response, a research initiative was launched at ETS known as CBAL (Bennett and Gitomer 2009) .Footnote 51 Referring to the circumstances surrounding accountability testing under NCLB, Bennett and Gitomer noted,

In the United States, the problem is … an accountability assessment system with at least two salient characteristics. The first characteristic is that there are now significant consequences for students, teachers, school administrators, and policy makers. The second characteristic is, paradoxically, very limited educational value. This limited value stems from the fact that our accountability assessments typically reflect a shallow view of proficiency defined in terms of the skills needed to succeed on relatively short and, too often, quite artificial test items (i.e., with little direct connection to real-world contexts). (p. 45)

The challenges that needed to be overcome to develop tests based on a deeper view of student achievement were significant and included the fact that more meaningful tests would require constructed -response formats to a larger degree, which required a means of handling the trade-off between reliability and time. As Linn and Burton (1994), and many others, have reminded us regarding constructed -response tests, “a substantial number of tasks will still be needed to have any reasonable level of confidence in making a decision that an individual student has or has not met the standard” (p. 10). Such a test could not reasonably be administered in a single seating. A system was needed in which tests would be administered at more than one occasion. A multi-occasion testing system raises methodological problems of its own, as was illustrated by the California CLAS assessment (Cronbach et al. 1995) . Apart from methodological constraints, increasing testing time could be resented, unless the tests departed from the traditional mold and actually promoted, not just probed, learning. This meant that the new assessment needed to be an integral part of the educational process. To help achieve that goal, a theory of action was formulated (Bennett 2010) to link the attributes of the envisioned assessment system to a set of hypothesized action mechanisms leading to improved student learning. (Of course, a theory of action is a theory, and whether the theory is valid is an empirical question.)

Even with a vision of an assessment system, and a rationale for how such a vision would lead to improved student learning, considerable effort is required to explicate the system and to leverage technology to make such assessments scalable and affordable. The process entailed the formulation of competency models for specific domains, including reading (Sheehan and O’Reilly 2011), writing (Deane et al. 2012), mathematics (Graf 2009), and science (Liu et al. 2013); the elaboration of constructs, especially writing (Song et al. 2014) ; and innovations in automated scoring (Deane 2013a, b; Fife 2013) and task design (Bennett 2011; Sheehan and O’Reilly 2011).

The timing of the CBAL system coincided roughly with the start of a new administration in Washington that had educational plans of its own, ultimately cast as the Race to the Top initiative .Footnote 52 The assessments developed under one portion of the Race to the Top initiative illustrate a trend toward the use of significant numbers of items requiring constructed responses. In addition, technology is being used more extensively, including adaptive testing by the Smarter Balanced Assessment Consortium, and automated scoring by some of its member states.

4.4 Conclusion

Admissions testing has been the primary business at ETS for most of its existence. Constructed-response formats were resisted for a long time in that context, although in the end they were incorporated. By contrast, the same resistance was not encountered in some school assessments, where they were used from the start in the AP program as well as in the NAEP program. The CBAL initiative has continued and significantly expanded that tradition by conceiving of instructionally rich computer-based tasks grounded in scientific knowledge about student learning.

5 Validity and Psychometric Research Related to Constructed-Response Formats

The foregoing efforts occurred in the context of a vigorous validity and psychometric research program over several decades in support of constructed -response formats. It is beyond the scope of this chapter to review the literature resulting from that effort. However, the scope of the research is noteworthy and is briefly and selectively summarized below.

5.1 Construct Equivalence

The choice between multiple-choice or constructed -response format, or a mix of the two, is an important design question that is informed by whether the two formats function in similar ways. The topic has been approached conceptually and empirically (Bennett et al. 1990, 1991; Bridgeman 1992; Enright et al. 1998; Katz et al. 2000; Messick 1993; Wainer and Thissen 1993; Ward 1982; Ward et al. 1980).

5.2 Predictive Validity of Human and Computer Scoring

The predictive validity of tests based on constructed responses scored by humans and computers has not been studied extensively. A study (Powers et al. 2002) appears to be one of the few on the subject. More recently, Bridgeman (2016) showed the impressive psychometric predictive power of the GRE and TOEFL writing assessments.

5.3 Equivalence Across Populations and Differential Item Functioning

The potential incomparability of the evidence elicited by different test formats has fairness implications and not surprisingly has received much attention (e.g., Breland et al. 1994; Bridgeman and Rock 1993; Dorans 2004; Dorans and Schmitt 1993; Schmitt et al. 1993; Zwick et al. 1993, 1997). The challenges of differential item functioning across language groups have also been addressed (Xi 2010). Similarly, the role of different response formats when predicting external criterion measures has been investigated (Bridgeman and Lewis 1994) , as have the broader implications of format for the admissions process (Bridgeman and McHale 1996) .

5.4 Equating and Comparability

The use of constructed -response formats presents many operational challenges. For example, ensuring the comparability of scores from different forms is equally applicable to tests comprising constructed -response items as it is for multiple-choice tests. The primary approach to ensuring score comparability is through equating (Dorans et al. 2007) , a methodology that had been developed for multiple-choice tests. As the use of constructed -response formats has grown, there has been an increase in research concerning equating of tests composed entirely, or partly, of constructed responses (Kim and Lee 2006; Kim and Walker 2012; Kim et al. 2010). Approaches to achieving comparability without equating, which rely instead on designing tasks to be comparable, have also been studied (Bejar 2002; Bridgeman et al. 2011; Golub-Smith et al. 1993).

5.5 Medium Effects

Under computer delivery, task presentation and the recording of responses is very different for multiple-choice and constructed -response items. These differences could introduce construct-irrelevant variance due to the testing medium. The investigation of that question has received significant attention (Gallagher et al. 2002; Horkay et al. 2006; Mazzeo and Harvey 1988; Powers et al. 1994; Puhan et al. 2007; Wolfe et al. 1993).

5.6 Choice

Students’ backgrounds can influence their interest and familiarity with the topics presented in some types of constructed -response items, which can lead to an unfair assessment. The problem can be compounded by the fact that relatively few constructed -response questions can be typically included in a test since responding to them is more time consuming. A potential solution is to let students choose from a set of possible questions rather that assigning the same questions to everyone. The effects of choice have been investigated primarily in writing (Allen et al. 2005; Bridgeman et al. 1997; Lukhele et al. 1994) but also in other domains (Powers and Bennett 1999).

5.7 Difficulty Modeling

The difficulty of constructed -response items and the basis for, and control of, variability in difficulty have been studied in multiple domains, including mathematics (Katz et al. 2000), architecture (Bejar 2002), and writing (Bridgeman et al. 2011; Joe et al. 2012).

5.8 Diagnostic and Formative Assessment

Diagnostic assessment is a broad topic that has much in common with formative assessment because in both cases it is expected that the provided information will lead to actions that will enhance student learning. ETS contributions in this area have included the development of psychometric models to support diagnostic measurement based on constructed responses. Two such developments attempt to provide a psychometric foundation for diagnostic assessments. Although these efforts are not explicitly concerned with constructed responses, they support such assessments by accommodating polytomous responses. One approach is based on Bayesian networks (Almond et al. 2007) , whereas the second approach follows a latent variable tradition (von Davier 2013).

6 Summary and Reflections

The multiple-choice item format is an early-twentieth-century American invention. Once the format became popular following its use in the Army Alpha and SAT , it became difficult for constructed -response formats to regain a foothold. The psychometric theory that also emerged in the early twentieth century emphasized score reliability and predictive validity . Those emphases presented further hurdles. The interest in constructed -response formats, especially to assess writing skills, did not entirely die, however. In fact, there was early research at ETS that would be instrumental in eventually institutionalizing constructed -response formats, although it was a journey of nearly 50 years. The role of ETS in that process has been significant. The chapter on performance assessment by Suzanne Lane and Clement Stone (Lane and Stone 2006) in Educational Measurement is an objective measure. Approximately 20% of the chapter’s citations were to publications authored by ETS staff. This fact is noteworthy, because the creation of an ETS-like organization had been objected to by Carl Brigham on the grounds that an organization that produced tests would work to preserve the status quo, with little incentive to pursue innovation. As he noted in a letter to Conant (cited in Bennett, Chap. 1, this volume):

one of my complaints against the proposed organization is that although the word research will be mentioned many times in its charter, the very creation of powerful machinery to do more widely those things that are now being done badly will stifle research, discourage new developments, and establish existing methods, and even existing tests, as the correct ones. (p. 6)

His fears were not unreasonable in light of what we know today about the potential for lack of innovation in established organizations (Dougherty and Hardy 1996) . However, according to Bennett (2005), from its inception, the ETS Board of Trustees heeded Brigham’s concerns, as did the first ETS president (from 1947 until 1970), Henry Chauncey . That climate was favorable to conducting research that would address how to improve and modernize existing tests.Footnote 53 Among the many areas of research were investigations related to the scoring of writing. That early research led to a solution to what has been the long-standing problem of operationally scoring essays with acceptable scoring reliability .

Even if the scoring agreement problem was on its way to being solved, it was still the case that tasks requiring a longer constructed response would also take more time and that therefore fewer items could be administered in a given period. With predictive validity as the key metric for evaluating the “validity ” of scores, the inclusion of constructed -response tasks continued to encounter resistance. An exception to this trend was the AP program , which relied on constructed -response tasks from its inception. There was also pioneering work on constructed -response assessments early in ETS’s history (Frederiksen et al. 1957; Hemphill et al. 1962). However, in both of these cases, the context was very different from the admissions testing case that represented the bulk of ETS business.

Thus a major development toward wider use of constructed -response formats was the evolution of validity theory away from an exclusive focus on predictive considerations. Messick’s (1989) work was largely dedicated to expanding the conception of validity to include not only the psychometric attributes of the test, the evidentiary aspect of validation , but also the repercussions that the use of the test could have, the consequential aspect. This broader view did not necessarily endorse the use of one format over another but provided a framework in which constructed -response formats had a greater chance for acceptance.

With the expansion of validity , the doors were opened a bit more, although costs and scalability considerations remained. These considerations were aided by the transition of assessment from paper to computer. The transition to computer-delivered tests at ETS that started in 1985 with the deployment of ACCUPLACER set the stage for the transition of other tests—like the GMAT , GRE , and TOEFL —to digital delivery and the expansion of construct coverage and constructed -response formats, especially for writing and eventually speaking.

Along the way, there was abundant research and implementation of results in response to the demands resulting from the incorporation of expanded constructs and the use of the computer to support those demands. For example, the psychometric infrastructure for mixed-format designs, including psychometric modeling of polytomous responses, was first used in 1992 (Campbell et al. 1996, p. 113) and developed at ETS. The use of constructed -response formats also required an efficient means of scoring responses captured from booklets. ETS collaborated with subcontractors in developing the necessary technology, as well as developing control procedures to monitor the quality of scoring. Online scoring systems were also developed to accommodate the transition to continuous administration that accompanied computer-based testing. Similarly, automated scoring was first deployed operationally in 1997, when the licensing test for architects developed by ETS became operational (Bejar and Braun 1999; Kenney 1997). The automated scoring of essays was first deployed operationally in 1999 when it was used to score GMAT essays.

Clearly, by the last decade of the twentieth century, the fruits of research at ETS around constructed -response formats were visible. The increasingly ambitious assessments that were being conceived in the 1990s stimulated a rethinking of the assessment design process and led to the conception of ECD (Mislevy et al. 2003). In addition , ETS expanded its research agenda to include the role of assessment in instruction and forms of assessment that, in a sense, are beyond format. Thus the question is no longer one of choice between formats but rather whether an assessment that is grounded in relevant science can be designed, produced, and deployed. That such assessments call for a range of formats and response types is to be expected. The CBAL initiative represents ETS’s attempt to conceptualize assessments that can satisfy the different information needs of K–12 audiences with state-of-the-art tasks grounded in the science of student learning, while taking advantage of the latest technological and methodological advances. Such an approach seems necessary to avoid the difficulties that accountability testing has encountered in the recent past.

6.1 What Is Next?

If, as Alphonse De Lamartine (1849) said, “history teaches us everything, including the future” (p. 21), what predictions about the future can be made based on the history just presented? Although for expository reasons I have laid the history of constructed -response research at ETS as a series of sequential hurdles that appear to have been solved in an orderly fashion, in reality it is hard to imagine how the story would have unfolded at the time that ETS was founded. While there were always advocates of the use of constructed -response formats, especially in writing, Huddleston’s views that writing was essentially verbal ability , and therefore could be measured with multiple-choice verbal items, permeated decision making at ETS.

Given the high stakes associated with admissions testing and the technological limitations of the time, in retrospect, relying on the multiple-choice format arguably was the right course of action from both the admissions committee’s point of view and from the student’s point of view. It is well known that James Bryant Conant instituted the use of the SAT at Harvard for scholarship applicants (Lemann 2004), shortly after his appointment as president in 1933, based on a recommendation by his then assistant Henry Chauncey , who subsequently became the first president of ETS.Footnote 54 Conant was motivated by a desire to give students from more diverse backgrounds an opportunity to attend Harvard, which in practice meant giving students from other than elite schools a chance to enroll. The SAT , with its curriculum agnostic approach to assessment, was fairer to students attending public high schools than the preparatory-school-oriented essay tests that preceded it. That is, the consequential aspect of validation may have been at play much earlier than Messick’s proposal to incorporate consequences of test use as an aspect of validity , and it could be in this sense partially responsible for the long period of limited use into which the constructed-response format fell.

However well-intentioned the use of the multiple-choice format may have been , Frederiksen (1984) claimed that it represented the “real test bias.” In doing so, he contributed to fueling the demand for constructed -response forms of assessment. It is possible to imagine that the comforts of what was familiar, the multiple-choice format, could have closed the door to innovation, a fear that had been expressed by Carl Brigham more generally.Footnote 55 For companies emerging in the middle of the twentieth century, a far more ominous danger was the potential disruptions that could accrue from a transition to the digital medium that would take place during the second half of the century. The Eastman Kodak Company, known for its film and cameras, is perhaps the best known example of the disruption that the digital medium could bring: It succumbed to the digital competition and filed for bankruptcy in 2012. However, this is not a classic case of being disrupted out of existence,Footnote 56 because Kodak invented the first digital camera! The reasons for Kodak’s demise are far more nuanced and include the inability of the management team to figure out in time how to operate in a hybrid digital and analog world (Chopra 2013) . Presumably a different management team could have successfully transitioned the company to a digital world.Footnote 57

In the testing industry, by contrast, ETS not only successfully navigated the digital transition, it actually led the transition to a digital testing environment with the launching of ACCUPLACER in 1985.Footnote 58 The transition to adaptive testing must have been accompanied by a desire to innovate and explore how technology could be used in testing, because in reality, there were probably few compelling business or even psychometric reasons to launch a computer-based placement test in the 1980s. Arguably, the early transition made it possible for ETS to eventually incorporate constructed -response formats into its tests sooner, even if the transition was not even remotely motivated by the use of constructed -response formats. By building on a repository of research related to constructed -response formats motivated by validity and fairness considerations, the larger transition to a digital ecosystem did not ultimately prove to be disruptive at ETS and instead made it possible to take advantage of the medium, finally, to deploy assessments containing constructed -response formats and to envision tests as integral to the educational process rather than purely technological add-ons.

In a sense, the response format challenge has been solved: Admissions tests now routinely include constructed -response items, and the assessments developed to measure the Common Core State Standards also include a significant number of constructed -response items. Similarly, NAEP , which has included constructed -response items for some time, is making the transition to digital delivery via tablets. ETS has had a significant role in the long journey. From inception, there was a perspective at ETS that research is critical to an assessment organization (Chauncey, as cited by Bennett, Chap. 1, this volume). Granted that the formative years of the organization were in the hands of enlightened and visionary individuals, it appears that the research that supported the return of constructed -response formats was not prescribed from above but rather the result of intrapreneurship,Footnote 59 or individual researchers largely pursuing their own interests.Footnote 60 If this is the formula that worked in the past, it could well continue to work in the future, if we believe De Lamartine . Of course, Santayana argued that history is always written wrong and needs to be rewritten. Complacency about the future, therefore, is not an option—it will still need to be constructed .