Reliability Issues in High-Stakes Educational Tests

High-stakes tests and examinations often give rise to rather specific measurement problems. Though nowadays item response theory (IRT) has become the standard theoretical framework for educational measurement, in practice, number-correct scores are still prominent in the definition of standards and norms. Therefore, in this chapter methods are developed for relating standards on the number-correct scale to standards on the latent IRT scale. Further, this chapter focuses on two related issues. The first issue is estimating the size of standard errors when equating older versions of a test to the current version. The second issue is estimating the local reliability of number-correct scores and the extra error variance introduced through number-correct scoring rather than using IRT proficiency estimates. It is shown that the first issue can be solved in the framework of maximum a posteriori (MAP) estimation, while the second issue can be solved in the framework of expected a posteriori (EAP) estimation. The examples that are given are derived from simulations studies carried out for linking the nation-wide tests at the end of primary education in the Netherlands.


Outline of the Problem
The problem addressed here is that the standard scoring rule in much educational measurement, that is, the number-correct score, is not the same one as the optimal scoring rule that is derived from the IRT model that fits the data. In this chapter, a method is outlined for how to evaluate the consequences of this discrepancy for an important inference that is often made using IRT, that is, the consequences for test equating. To explain this further, we first introduce an IRT model and outline the principle of test equating.
The IRT models used in this chapter are the one-, two-and three-parameter Logistic models. The data are responses of students labeled with an index n = 1, …, N to items labeled with an index i = 1, …, K. To indicate whether a response is available, we define a variable d ni = 1 if a response of student n to item i is available 0 if this is not the case. (11.1) The responses will be coded by a stochastic variable Y ni . In the sequel, upper-case characters will denote stochastic variables and lower-case characters will denote realizations. In the present case, there are two possible realizations, defined by y ni = ⎧ ⎨ ⎩ 1 if d ni = 1 and student n gave a correct response to item i 0 if d ni = 1 and student n did not give a correct response to item i c if d ni = 0, where c is an arbitrary constant unequal 0 or 1.
(11.2) Define the logistic function (.) as: In the 3-parameter logistic model (3PLM, Birnbaum 1968) the probability of a correct response depends on three item parameters, a i , b i and c i which are called the discrimination, difficulty and guessing parameter, respectively. The parameter θ n is the latent proficiency parameter of student n. The model is given by . (11.3) The 2-parameter logistic model (2PLM, Birnbaum 1968) follows by setting the guessing parameter equal to zero, so by introducing the constraint c i = 0. The 1-parameter logistic model (1PLM, Rasch 1960) follows by introducing the additional constraint a i = 1. Note that in the application of the models in high-stakes situations, the number of proficiency parameters θ n can become very large. Besides the practical problem of computing estimates of all model parameters concurrently, this also leads to theoretical problems related to the consistency of the estimates (see, Neyman and Scott 1948;Kiefer and Wolfowitz 1956). Therefore, it is usually assumed that the proficiency parameters are drawn from one or more normal proficiency distributions, indexed g = 1, …, G, which are often also referred to as population distributions. That is, θ n has the density function g(θ n ; μ g(n) , σ 2 g(n) ) = 1 σ √ 2π exp − 1 2 (θ n − μ n(g) ) 2 σ 2 , (11.4) where g(n) is the population to which student n belongs.
Test equating relates the scores on one test to the scores on another test. Consider a simulated example based on the estimates displayed in Table 11.1. The estimates emanate from two tests. A sample of 5000 students of a population A was given a test A consisting of the items i = 1,…,10, while a sample of 5000 other students of a population B was given a test B consisting of the items i = 6,…,15. So the anchor between the two tests, that is, the overlap between the two tests, consists of 5 items. The anchor supports the creation of a common scale for all parameter estimates. The responses were generated with the 2PLM. The difficulties of the two tests differed: test A had a mean difficulty parameter, b A , of 0.68, while the difficulty level of test B, b B , was equal to −0.92. The mean of the proficiency parameters θ n of sample A, μ A was equal to −0.25, while the mean of the proficiency parameters of sample B, μ B was equal to 0.25. The variances of the proficiency parameters and the mean of the discrimination parameters were all equal to one.
Suppose that test A has a cutoff score of 4, where 4 is the highest number-correct score that results in failing the test. In the fourth column of Table 11.1, the column labeled θ , it can be seen that the associated estimate on the latent θ -scale is 0.02. We chose this point as a latent-cutoff point, that is, θ 0 = 0.02. If the Rasch model would hold for these data, the number-correct score would be the sufficient statistic for θ . In the 2PLM, the relation between a number-correct score and a θ -estimate is more complicated; this will be returned to below. Through searching for number-correct scores on Test B with θ -estimates closest to the latent cutoff point, we find that a cutoff score 6 on Test B best matches a cutoff score 4 on Test A. This conclusion is consistent with the fact the average difficulty of test A was higher than the average difficulty of test B. On the other hand, the sample administered test B was more proficient than the sample of test A. The columns labeled "Freq" and "Prob" give the frequency distributions of the number-correct scores and the associated cumulative proportions, respectively. Note that 73% of sample A failed their test, while 39% of sample B failed theirs. Again, this is as expected.
The next question of interest is the reliability of the equating procedure. This can be translated into the question how precise the two cutoff scores can be distinguished. If we denote the cutoff scores by S A and S B , and denote the estimates of the positions on the latent scale associated with these two cutoff points byθ S A andθ S B , then Se(θ S A − θ S B ) can be used as a measure of the precision with which we can distinguish the two scores. The estimatesθ S A andθ S B are not independent. Firstly, they both depend on the same linked data set and, secondly, they both depend on a concurrent estimate of all item-parameters, a i , b i and c i , and (functions of) all latent proficiency parameters θ n . Therefore, the standard error Se(θ S A −θ S B ) cannot be merely computed as the square root of V ar(θ S A −θ S B ) = V ar(θ S A ) + V ar(θ S B ), but the covariance of the estimates must be taken into account also. The method to achieve this is outlined below, after the outline of a statistical framework and considering the problem of test scoring with number-correct scores when these are not sufficient statistics.

Preliminaries
Nowadays, marginal maximum likelihood (MML, see, Bock and Aitkin 1981) and fully Bayesian estimation (Albert 1992;Johnson and Albert 1999) are the prominent frameworks for estimating IRT models. Mislevy (1986, also see, Glas 1999) point out that they are closely related, because MML estimation is easily generalized to Bayes modal estimation, an estimation method that seeks the mode of the posterior distribution rather than the mode of the likelihood function. In this chapter, we adopt the MML and Bayes modal framework. In this framework, it is assumed that the θ -parameters are drawn from a common distribution, say, a population proficiency distribution as defined in Formula (11.4). Estimates of the item parameters and the parameters of the population proficiency distribution are obtained by maximizing a likelihood function that is marginalized with respect to the θ -parameters.
An important tool for deriving the estimation equations is Fisher's identity (Efron 1977;Louis 1982). For this identity, we distinguish N independent observations y n and unobserved data z n . The identity states that the first order derivatives of the parameters of interest δ with respect to the log-likelihood function L(.) are given by , . . . , log p(y n , z n ; δ) ∂δ p(z n |y n ; δ)dz n , where p(y n , z n ; δ) is the likelihood if z n would be observed, ∇ n (δ) is the firstorder derivative of its logarithm, and p(z n |y n ; δ) is the posterior distribution of the unobserved data given the observations. Bock and Aitkin (1981) consider the θ -parameters as unobserved data and use the EM-algorithm (Dempster et al. 1977) for maximum likelihood estimation from incomplete data to obtain estimates of the item and population parameters. In this framework, Glas (1999Glas ( , 2016 uses Fisher's identity to derive estimation and testing procedures for a broad class of IRT models. Standard errors can be obtained as the square roots of the covariance matrix of the estimates Cov(δ,δ) which can be obtained by inverting the observed Fisher information matrix, say, Cov(δ,δ) = I (δ,δ) −1 . Louis (1982) shows that this matrix is given by where ∇ n (δ, δ t ) stands for the second-order derivatives of log p(y n , z n ; δ) with respect to δ. Evaluated at the MML estimates, the information matrix can be approximated by (see Mislevy 1986). In the next sections, this framework will be applied to the issues addressed in this chapter: the reliability of tests scored with number-correct scores and to equating errors. Glas (1999Glas ( , 2016 shows how the estimation equations for the item and population parameters of a broad class of IRT models can be derived using Fisher's identity. This identity can also be applied to derive an estimation equation for a proficiency estimate based on a number-correct score (11.8) with d i and y i as defined in (11.1) and (11.2) dropping the subscript n. The application of Fisher's identity is based on viewing a response pattern as unobserved and the number-correct score as observed. Define L s (θ ) as the product of the normal prior distribution g(θ ; λ) with λ = (μ, σ 2 ) and the probability of a number-correct score s given θ . Define {y|s} as the set of all response patterns resulting in a number correct score s. Then the probability of a number-correct score s given θ is equal to the sum over {y|s} of the probabilities of response patters P(y|θ, β) given item parameters β and proficiency parameters θ . Application of Fisher's identity results in a first order derivative

MAP Proficiency Estimates Based on Number-Correct Scores
. (11.9) Equating this expression to zero gives the expression for the MAP estimate. Computation of the summation over {y|s} can be done using the recursive algorithm by Lord and Wingersky (1984). The algorithm is also used by Orlando and Thissen (2000) for the computation of expected a-posteriori estimates of θ given a number-correct score s. Note that in expression (11.9), the prior g(θ ; λ) cancels in the posterior, so p(y|s; θ, β, λ) ≡ p(y|s; θ, β).
As an example, consider the 2PLM, given by expression (11.3) with c i = 0. The probability of a response pattern becomes and The estimation equation can be solved by either the Newton-Raphson algorithm, or by the EM algorithm. Standard errors can be based on observed information as defined in expression (11.7). One way of estimating θ and computing the standard errors is to impute the item parameters as known constants. However, when we want to compare the estimated proficiencies obtained for two tests through their difference, say, Se(θ S A −θ S B ), we explicitly need to take the precision of the estimates of all item and population parameters into account. How this is accomplished is outlined in the next section.

Equating Error
Suppose θ 0 is a cutoff point on the latent scale and we want to impose this cutoff point on several test versions. Further, we want to estimate the reliability of the created link. Three procedures for the computation of equating errors will be discussed, using some possible data collection designs displayed in Fig. 11.1.
To introduce the first method, consider the design displayed in Fig. 11.1a. In this design, students were administered both test versions, that is, Version A and Version B. The first measure for the strength of the link is based on the standard error of the difference between the average difficulties of the two versions, say, where b A is the estimate of the mean difficulty of Version A and b B the estimate of the mean difficulty of Version B. The strength of the link is mainly determined by the number of students, but also by the number of item parameters making up the two means. Since the estimates are on a latent scale that is subject to linear transformations, we standardize the standard error with the standard deviation of the proficiency distribution. This leads to the definition of the index The standard error can be computed as the square root of V ar(b A − b B ), which can be computed by pre-and post-multiplying the covariance matrix by a vector of weights, that is, w t Cov(δ,δ)w, if this is not the case, (11.13) where d i A and d i B are defined by expression (11.1), for a student administered test A and a student administered test B, respectively. Figure 11.1b gives an example of equating two tests via common items (the socalled anchor). The test consisting of the items A1 and A2 is linked to the test consisting of the items B2 and B3, because A2 and B2 consist of the same items. The larger the anchor, the stronger the link. In this design it is usually assumed that the means of the two proficiency distributions are different. This leads to a second definition of an index for equating error, that is: where Sd(θ ) is a pooled estimate of the standard deviations of the proficiency distributions of the two populations. In Fig. 11.1c, the test consisting of parts A1 and A2 and the test consisting of the parts B3 and B4 have no items in common, but a link is forged by the students administered C2 and C3. Again, the standard error can be computed as the square root of the associated variance, which can be computed by pre-and post-multiplying the covariance matrix of the parameter estimates by a vector of weights, that is, w t Cov(δ,δ)w, where w has elements (11.15) A third method to assess a equating error is based on the position of the cutoff point on the latent scale. This approach gives a more precise estimate of the equating error of the cutoff point, but below it becomes clear that it is somewhat more complicated to compute. Suppose θ 0 is the cutoff point on the latent scale. On both tests, we choose an observed cutoff score, say S A and S B , that are associated with the same (mean) proficiency level θ 0 . Then an equating error index can be defined as whereθ S A andθ S B are the estimates of the positions on the latent scale with the two observed cutoffs.
To define this standard error, we augment the log-likelihood given the observed data with two observations, one for each of the sum scores S A and S B . So the complete likelihood becomes L(δ, θ ) = L(δ) + L s (θ ), and the information matrix becomes (11.17) As above, the standard error of the difference betweenθ S A andθ S B can be computed as the square root of the associated variance, which can be computed by pre-and post-multiplying the covariance matrix by a vector of weights, that is, w t Cov(δ,δ)w. In this case, the vector w has elements (11.18) Examples will be given below.

EAP estimates and another approach to the reliability of number-correct scores.
In test theory we distinguish between global reliability and local reliability. Global reliability is related to the precision with which we can distinguish two randomly drawn students from some well-defined population, while local reliability relates to the precision given a specific test score. We discuss these two concepts in the framework of IRT in turn.
One of the ways in which global reliability can be defined is as the ratio of the true variance relative to the total variance. For the framework of IRT, consider the variance decomposition (11.19) where y is an observed response pattern, var(θ ) is the population variance of the latent variable, var[E(θ | y)] is the posterior variance of the expected person parameters (say, the EAP estimates of θ ). So this EAP estimate is the error variance averaged over the values that can be observed weighted with their probability of their occurrence under the model. Further, E[var(θ | y)] is the expected posterior variance of the EAP estimate. Then reliability is given by the ratio Bechger et al. 2003). The middle expression in (11.20) is the variance of the estimates of the person parameters relative to the 'true' variance, and the right-hand expression in (11.15) is one minus the average variance of the estimates of the student parameters, say, the error variance, relative to the 'true' variance. The generalization to number-correct scores s is straightforward. If the observations are restricted from y to s, a student's proficiency can be estimated by the EAP E(θ |s), that is, the posterior expectation of θ given s, and the precision of the estimate is given by the posterior variance var(θ |s). Then global reliability generalizes to If the 1PLM holds, s is a sufficient statistic for θ . Therefore, it is easily verified that E(θ |s) ≡ E(θ | y) and the expressions (11.20) and (11.21) are equivalent. In all other cases, computation of the posterior distribution involves a summation over all possible response patterns resulting in a number-correct score s, and, as already noticed above, this can be done using the recursive algorithm by Lord and Wingersky (1984). If the 1PLM does not hold, there is variance in E(θ | y) conditional on s. This leads to the interesting question how much extra error variance is created by using s as the basis for estimating θ . That is, we are interested in the contribution of V ar(E(θ | y)|s) to the total error variance, that is, to the posterior variance V ar(θ |s). This contribution can be worked out by using an identity analogous to Expression (11.21), that is,

V ar(θ |s) = E(V ar(θ | y)|s) + V ar(E(θ | y)|s)).
( 11.22) Note that E(V ar(θ | y)|s) is the squared measurement error given y averaged over the distribution of y given s, and V ar(E(θ | y)|s)) is the variance of the EAP estimates, also over the distribution of y given s. In the next section, examples of local reliability estimates will be given.

Examples of Reliability Estimates
In this section, two simulated examples are presented to show the kind of results that the local reliability indices presented above produce.
The first example is created by simulating 1000 response patterns on a 20-item test. The data were created with the 2PLM, with the θ -values drawn from a standard normal distribution. The 20 item parameters were the product of a set of four discrimination parameters a = {0.8, 0.9, 1.10, 1.20} and five difficulty parameters b = {−1.0, −0.5, 0.0, 0.5, 1.0}. MML estimates (i.e., Bayes modal estimates) were computed with a standard normal distribution for the θ -values. The results are displayed in Table 11.2.
Note that the MAP estimates and the EAP estimates are very similar, as are their standard deviations displayed in the columns labeled Sd M AP (θ |s) and Sd E AP (θ |s). The last three columns give the variance decomposition as defined in Expression (11.22). It can be seen that V ar(E(θ | y)|s) is relatively small compared to E(V ar(θ | y)|s)). So the potential bias in a student's proficiency estimate when using number-correct scores is much less than the inflation of the precision of the estimate. A final observation that can be made from this simulation study is that the global reliability when switching from scoring using the complete response patters to using the number-correct scores dropped from 0.788 to 0.786. So the loss in global reliability was negligible.
It is expected that if the variability of the discrimination parameters is enlarged, V ar(E(θ | y)|s) increases. The reason is that if the discrimination parameters are considered known, the weighted sum score i d i a i y i is a sufficient statistic for θ . If  all discrimination parameters are equal to 1.0, the 2PLM becomes the 1PLM, and then the number-correct score becomes a sufficient statistic. So the more variance in the discrimination parameters, the greater the violation of the 1PLM and the depreciation of the appropriateness of the scoring rule.

V ar(θ|s) V ar(E(θ| y)|s) E(V ar(θ| y)|s
To investigate this effect, the discrimination parameters of the simulation were changed to parameters a = {0.40, 0.60, 1.40, 1.60}. The results are displayed in Table 11.3. It can be seen that the standard deviations in the columns labeled Sd M AP (θ |s) and Sd E AP (θ |s) blew up a bit, but the effect was not very large. Further, in the column labeled V ar(E(θ | y)|s) the values clearly increased, while this is less the case in the column labeled E (V ar(θ | y)|s)). For instance, if we consider a number-correct score 10, we observe that the initial values 0.01 and 0.19 changed to 0.04 and 0.17. The net effect was a change in V ar(θ |s) from 0.20 to 0.21. So the increase in variance of θ -estimates (that is, of expectations E(θ | y)) was counterbalanced by an increase of the overall precision V ar(θ | y).

Simulation Study of Equating Errors
In this section, two sets of simulation studies will be presented. The first study was based on the design displayed in Panel b of Fig. 11.1, which displays a design with a link via common items. The simulation was carried out to study the effect of the size of the anchor. The second set of simulations was based on the design of Panel c of Fig. 11.1, which displays a design with common students. These simulations were carried out to study the effect of the number of students in the anchor.
The studies were carried out using the 2PLM. To create realistic data, the item parameters were sampled from the pool of item parameters used in the final tests in primary education in the Netherlands. Also the means of proficiency distributions and cutoff scores were chosen to create a realistic representation of the targeted application, that entailed equating several versions and cycles of the tests.
For the first set of simulations, two tests were simulated with 2000 students each. The proficiency parameters for the first sample of students were drawn from a standard normal distribution, while the proficiency parameters for the second sample of students were drawn from a normal distribution that was either standard normal or normal with a mean 0.5 and a variance equal to 1.0. Cutoff points were varied as θ 0 = −0.5 or θ 0 = 0.0. The results are displayed in Table 11.4. The first column gives the length of the two tests; the tests were of equal size. 50 items is considered realistic for a high-stakes test, tests of 20 and 10 items were simulated to investigate the effects of decreasing the test length.
The second column gives the size of the anchor. The total number of items in the design displayed in the third column follows from the length of the two tests and the size of the anchor. 100 replications were made for every one of the 24 conditions. For every replication, the item parameters were redrawn from the complete pool of all item parameters of all (five) test providers. The complete pool consisted of approximately 2000 items. The last three columns give the three equating errors    defined above. Note that Sd(θ ) was always equal to 1.0, so the equating errors were equal to the analogous standard errors. The results are generally as expected. Note first that there was always a substantial main effect of the test length for all three indices. For a test length of 50 items, decreasing the size of the anchor increased the equating errors for the average item difficulties Se(b A − b B ) and the proficiency means Se(μ A −μ B ). The effect on Se(θ S A −θ S B ) was small. This pattern was sustained for a test length of 20 items, but in that case also Se(θ S A −θ S B ) increased slightly when the anchor was decreased from 10 to 5. Finally, there were no marked effects of varying the position of the cutoff points and the differences between the two proficiency distributions.
The second set of simulations was based on the design of panel c of Fig. 11.1, the design with common students. The general setup of the study was analogous to the first one, with some exceptions. All samples of students were drawn from standard normal distributions and the cutoff point was always equal to θ 0 = 0.0. There were three tests in the design: two tests to be equated and a test given to the linking group. As can be seen in the first column of Table 11.5, the tests to be equated had either 40 or 20 items. In the second column, it can be seen that the linking groups were either administered tests of 20, 10, or 4 items. These linking tests always comprised of an equal number of items from the two tests to be equated. The third column shows how the size of the sample of the linking group was varied. The two tests to be equated were always administered to 2000 students. In general, the results are much worse than those displayed in Table 11.4. In fact, only the combination of two tests of 40 items with a linking group of 1600 students administered a test of 20 items comes close to the results displayed in Table 11.4. Note that linking tests of 40 items with linking groups administered 4 items completely breaks down, especially the results for Se(θ S A −θ S B ) with 100, 400 or 800 students in the linking groups become extremely poor.

Conclusion
Transparency of scoring is one of the major requirements for the acceptance of an assessment by stakeholders such as students, teachers and parents. This is probably the reason why number-correct scores are still prominent in education. The logic of such scoring is evident: the higher the number of correct responses, the higher the student's proficiency. The alternative of using the proficiency estimates emanating from an IRT model as test scores is more complicated to explain. In some settings, such as in the setting of computerized adaptive testing, it can be made acceptable that students that respond to more difficult items get a higher proficiency estimate than students with an analogous score on more easy items. However, explaining the dependence of proficiency estimates on item-discrimination parameters is more cumbersome.
A potential solution to the problem is using the 1PLM model, where all items are assumed to have the same discrimination index, and the proficiency estimate only depends on the number of correct responses to the items. However, the 1PLM seldom fits educational test data and using the 1PLM to utilize all the advantages of IRT leads to notable loss of precision. Therefore, the 2PLM and 3PLM have become the standard models for analyzing educational test data. In this chapter, a method to combine number-correct scoring with the 2PLM and 3PLM was suggested and methods for relating standards on the number-correct scale to standards on the latent IRT scale were outlined. Indices for both the global and local reliability of numbercorrect scores were introduced. It was shown that the error variance for numbercorrect scoring can be decomposed into two components. The first component is the variance of the proficiency estimates given the response patterns conditional on number-correct scores. This component can be viewed as a measure for the bias introduced by using number-correct scores as estimates for proficiency rather than estimating the proficiency under the 2PLM or 3PLM based on a student's complete response pattern. The second component can be interpreted as the average error variance when using the number-correct score. The presented simulation studies indicate that, relative to the second component, the first component is small. When equating two tests, say an older version and a newer version, it is not only the standard error of the proficiency estimates on the two tests which is important, but also the standard error of differences between proficiency estimates on the two tests. To obtain a realistic estimate of the standard errors of these differences, the whole covariance matrix of the estimates of all item and population parameters in the model must be taken into account. The size of these standard errors depends on the strength of the link between the two tests, that is, on the number of items and students in the design and the sizes of the overlap between, respectively, items and students. The simulation studies presented in this chapter give an indication of the standard errors of these differences for various possible designs.
The procedure for number-correct scoring was presented in the framework of unidimensional IRT models for dichotomously scored items. It can be generalized in various directions. First of all, a sum score can also be defined for a test with polytomously scored items by adding the scores on the individual items in the test. These sum scores can then be related to a unidimensional IRT model for polytomously scored items such as the generalized partial credit model (Muraki 1992), the graded response model (Samejima 1969) or the sequential model (Tutz 1990) in an manner that is analogous to the procedure presented above. Also multidimensional versions of these models (Reckase 1985) present no fundamental problems: the proficiency distributions and response probabilities introduced above just become multivariate distributions in multivariate θ parameters. For the generalized definitions of reliabilities refer to van Lier et al. (2018).
A final remark concerns the statistical framework of this chapter, which was the related Bayes modal and marginal maximum likelihood framework. In the preliminaries section of this chapter, it was already mentioned that this framework has an alternative in the framework of fully Bayesian estimation supported by Markov chain Monte Carlo computational methods (Albert 1992;Johnson and Albert 1999). Besides with dedicated samplers, the IRT models discussed here can also be estimated using general purpose samplers such as Bugs (Lunn et al. 2009) and JAGS (Plummer 2003). But details of the generalizations to other models and another computational framework remain points for further study.