JR correctly emphasize the importance of understanding what it is that the test measures; that is, the underlying construct that delineates the cognitive skills that are the target of inference. To say that the test measures first year high school math is insufficient as there are many ways to conceptualize this construct, leading to very different tests. For example, both PISA and TIMSS assess topics appearing in early secondary mathematics curricula. However, the assessment frameworks (OECD 2013; Mullis and Martin 2009) they have developed are quite different, as are the assessment instruments that are aligned to those frameworks. Consequently, it is not surprising that the relative performance of subpopulation groups and administrative jurisdictions can vary considerably across these two LSAS (e.g. Wu 2010). Related issues regarding careful definition of the underlying constructs and how they affect test development and test use are treated in Braun and Mislevy (2005) and Mislevy and Haertel (2006).
Traditionally, the statistical models employed in the analysis of test data are those associated with classical test theory (CTT) or item response theory (IRT) (Lord and Novick 1968). Briefly, CTT conceives of an observed test score as the sum of a “true score” and a random disturbance. Under reasonable assumptions, CTT leads to definitions and calculation formulas for such familiar quantities as test reliability. It is still an important tool in day-to-day test analysis, especially if a single test form with a simple (linear) test design is used.
By contrast, in its simplest incarnation, IRT begins with the notion of a latent trait that represents the construct and the assumption that each individual has some (unknown) value with respect to that trait. The goal of the test is to estimate as well as possible the unknown value for each individual. For each item in the test it is assumed that there is a “dose–response” curve that describes the probability of obtaining the correct answer (the response) as a function of the values of the latent trait (the dose).
JR describe different types of estimators of proficiency used in test analyses with IRT. However, their statements regarding these estimators are at odds with the literature. Specifically, JR state that “…the typical test has relatively few items…” and directly below that “…student ability [is] …estimated directly via maximum likelihood …[and] resulting estimate is (approximately) unbiased in most cases…” (p. 95) However, the maximum likelihood estimator (MLE) for the latent trait in IRT models is not unbiased (Kiefer and Wolfowitz 1956; Andersen 1972; Haberman 1977). The extent of the bias is directly related to the number of items, so that MLEs for tests with “relatively few items” will exhibit a more pronounced bias, as well as considerable noise. The bias of ML estimates can be examined formally, and even reduced, by the methods proposed by Warm (1989) for IRT and by Firth (1992, 1993) for a more general class of latent variable models. The weighted likelihood estimate (WLE), as the estimator proposed by Warm (1989) came to be known, eliminates the first order bias of the MLE (see also Firth 1993). It turns out that these bias corrections of the MLE are equivalent to Bayes modal estimators using the Jeffreys (1946) prior for a number of commonly used IRT models. This assertion holds for the Rasch model and the 2PL model (Warm 1989), and estimators of this type are available for several polytomous IRT models, including some polytomous Rasch models (von Davier and Rost 1995, von Davier 1996). Recently Magis (2015) verified this equivalency for ‘divide by total’ and ‘difference’ type polytomous IRT models.
The use of IRT in the construction of score scales is now considered best practice in many areas of test analysis (van der Linden 2016; Carlson and von Davier 2013; Embretson and Reise 2000). It has gained further support from research that shows IRT to be a special case of a much larger class of latent variable models that is commonly used in applied statistics (Takane and de Leeuw 1987; Moustaki and Knott 2000; Skrondal and Rabe-Hesketh 2004).
In IRT, the dose–response curve is referred to as the item response function and is usually modeled as a logistic function with one, two, or three parameters. A test of \(K\) dichotomously scored items administered to N examinees yields a \(NxK\) matrix of zeros and ones from which one can estimate the parameters of the \(K\) items and the values of the latent trait for the \(N\) examinees. The former is referred to as item calibration and the latter as ability estimation. Note that this estimation problem is an order of difficulty greater than that typically encountered in biostatistics where the doses are known.
Estimation can be done by some variant of maximum likelihood or by Bayesian techniques. Maximum likelihood estimation, either in the form of marginal maximum likelihood (MML) or conditional maximum likelihood (CML) estimation of item parameters in IRT, is the method of choice in LSAS (e.g. Adams et al. 2007; von Davier et al. 2007, 2013). Ability estimation is often done either using (weighted or bias-corrected) maximum likelihood or Bayesian approaches.
For use in LSAS and other more complex test designs, IRT models have been extended to allow for items that are not dichotomously scored and for multi-dimensional latent traits. In particular, LSAS test batteries represent the focal skill domains very comprehensively by employing a large number of test items. Consequently, inferences are based on an item pool that would typically require several hours of testing. Each respondent only takes a carefully selected, small subset of the item pool. However, through an application of an appropriate experimental design (balanced incomplete block designs), overall each item in the pool is administered to a random sample of respondents. Although the formulas and estimation techniques are necessarily more complicated for this extended IRT approach, the basic ideas remain the same.
JR correctly point out that there is an essential indeterminacy in IRT estimation: The quality of the fit of the model to the data is the same under monotone transformations of the underlying scale. Secondary analysis results can differ with the choice of scale (Ballou 2009). Consequently, appropriate cautions in interpretation are in order. However, the reporting scales are typically set in the first round of assessment and establish a mean, standard deviation, and range that are based on the score distribution of a well-defined initial set of populations.
For example, the PISA scale, with a mean of 500 and a standard deviation of 100, was set with respect to the group of OECD countries that participated in the first administration. One can call this an arbitrary or a pragmatic approach, but it is certainly not an approach that is uncommon, or not used elsewhere. Temperature scales differ in how they set reference points, and even the metric versus imperial system of measures show by their mere existence and simple exchangeability that neither inches nor centimeters are more rational or more arbitrary than the other.
JR also correctly point out that there is no evidence that these psychometric proficiency scales have the interval scale properties that are implicit in many secondary analyses (e.g. regression modeling). This is indeed a point of concern, as the properties of the scale are an assumption, as is also the case with wages (log-wage is often used), time (and age) measures, as well as in medical measures. (Is a change in heart rate from 80 to 140 of the same medical concern as a change from 140 to 200? Is the change in physical strength or vocabulary between ages 1 and 6 the same as between 24 and 29? Aging 5 years may indeed mean very different things at different initial ages.) There are many measures in the physical as well as behavioral sciences that are represented as real values or integers, employed as such in models and that, at the same time, may not have the desired ‘interval’ properties (or, better, the same interpretation of scale score differences) on related scales of interest. On this point, a remark by Tukey (1969, p.87) seems apropos:
Measuring the right thing on a communicable scale lets us stockpile information about amounts. Such information can be useful, whether or not the chosen scale is an interval scale. Before the second law of thermodynamics—and there were many decades of progress in physics and chemistry before it appeared—the scale of temperature was not, in any nontrivial sense, an interval scale. Yet these decades of progress would have been impossible had physicists and chemists refused either to record temperatures or to calculate with them.
In point of fact we cannot verify the scale properties of many variables that we use in our analyses, but we can perform a variation of what one may call a scaling sensitivity analyses. As an example, educational attainment is another variable frequently seen in regressions used by labor economists, either measured in years of schooling (Do we start counting at kindergarten, or pre-school?) or in universally defined levels of education described by the International Standard Classification of Education codes (ISCED; UNESCO 2011). While the ISCED codes do not perfectly cross-classify all national education systems, they do yield an ordered set of educational attainment levels that attempts to represent equivalent types of education rather than just the number of years someone remained at school.
Although the choice of a particular numerical scale in an application of IRT is arbitrary, there are mathematical results on monotonicity properties that describe how increasing scale values are associated with increased expected outcomes on task performance and other variables (e.g. Junker and Sijtsma 2000). Moreover, similar to the ISCED levels, most test score scales are accompanied by a categorization of levels; for example, they can be based on typical tasks that are carried out correctly with high probability by individuals in these levels. Other examples are scales that are anchored (i.e. given meaning) by relevant variables linked to the scale. The types of variables that are used for anchoring can be job categories or ISCED levels (What is the average score of test takers whose parents have a high-school degree as their highest degree of education? What is the average score of test takers with one or two parents with masters or PhD level degrees?). Other variables used for anchoring can include, for example, the average scores achieved by developed countries, by developing countries, by student populations defined by school type, by native language, etc. In assessments such as PIAAC, scales can be described further in terms of expected scores on the scale for individuals in different job categories, at different levels of educational attainment, or at different income levels for that matter.
The reporting scales in LSAS are usually anchored by a process called proficiency scaling, which defines contiguous intervals on the scale associated with typical classes of problems or activities that individuals scoring at that level can master on a consistent basis. Finally, repeated use of a particular scale, together with the associated validity data and descriptions of what students at different ability levels can typically do, does achieve a certain interpretive familiarity over time.
JR refer to an example by Bond and Lang (2013) in which three different skills are measured in black and white student groups and the gap in test scores depends on how the skills are weighted. We could not agree more with the statement that the values of derived variables, such as group differences, will depend on how the scale is constructed through the weights assigned to the different component skill subscales. This is similar to what one sees in stock market indices: Different companies from different segments of the economy are included or excluded, and indices will be differentially sensitive to different events that may trigger market reactions.
What underlies the Bond and Lang (2013) example is an issue that occurs whenever one measures a variable that is not directly observable. Whether it is the literacy skills of students, or the health of a market segment, the degree to which a measure is sensitive to differences or changes depends on the choice of indicators (tasks in student assessments and stocks in indices).
To some extent, we cannot escape this scaling (and weighting and selection of indicators) problem in any science, whether it is the use of adjustments in C14 carbon dating of archeological artifacts, or the measurement of cosmic distances based on signals collected by radio telescopes, or stock indices or educational tests as measures of underlying constructs. More importantly, the different indicators will be differentially sensitive to underlying group differences in the construct, as each indicator was selected based on the goal of representing different aspects of the underlying phenomenon. Thus, ‘varying gaps’ are not an indicator of a deficiency but, rather, a consequence of the different ways reasonable indices (or tests) can be constructed. The phenomenon is often referred to as the reliability/validity dilemma: A test (or index) that maximizes reliability will contain only very similar components and will hence not be sensitive to the differences on outcome variables that are either caused by or cause, or are just correlated with, a much broader range of other measures.
Indeed, one could turn the question around and argue that there is absolutely no reason why the black–white gap should be the same across different indicators of literacy. Different aspects of literacy are by definition distinguishable attributes of a broader construct. Changes in pedagogy or policy may affect one attribute more than another. To that point, the expectation, for example, that the gap should be time-invariant when the measurement instrument changes to account for more complex literacy related activities in higher grade levels seems somewhat counter-intuitive, but would at least need a rather elaborate theoretical explanation why that should be the case.
Coming back to the scaling and scale level issue raised by JR, it is possible—and one could argue even necessary—to exploit this essential indeterminacy of the latent variable scale by utilizing scale transformation and linking methods to make proficiency scales comparable across cycles. This means that once a scale has been set (and anchored by proficiency level descriptors), then it can be used as the reference for future assessments. The methods that help to ensure the comparability of the scales of future assessments typically utilize common blocks of items over time. Measurement invariance models based on factor analysis, and their IRT equivalents, are then used to align the results of the current assessment cycle to the reference scale (Yamamoto and Mazzeo 1992; Bauer and Hussong 2009; Mazzeo and von Davier 2008, 2013). This is done on an ongoing basis for NAEP, as well as for international assessments such as PISA, TIMSS, PIRLS, and PIAAC. Of course the defensibility of these linking procedures depends on the validity of certain invariance assumptions with respect to how the tasks on the test are responded to across different populations and cohorts (Mazzeo et al. 2008, 2013).
In addition, as JR point out (p. 92), it has to be understood that any transformation, whether based on small or large samples, arbitrarily applied to test scores may not yield comparable test scores even if they are numerically transformed onto the same scale. The linking methods and comparability/measurement invariance approaches cited above do not apply such transformations; rather they utilize (and test) invariance assumptions in the form of parameter constraints that lead to linked test scores on the same scale, thereby allowing comparisons across countries and over time.