In recent years a comprehensive body of research on computerized adaptive testing (CAT) has been performed. Kingsbury and Zara (1989) refined item selection with a method for content control. Sympson and Hetter (1985) dealt with overexposure, while Revuelta and Ponsoda (1998) treated the subject of underexposure. Eggen and Verschoor (2006) proposed a method for difficulty control.
Still, the process to develop an operational CAT is an expensive and time consuming task. Many items need to be developed and pretested with relatively high accuracy. Data collection from a large sample of test takers is a complex task, while it is crucial to be able to calibrate these test items efficiently and economically (Stout et al. 2003). Item calibration is especially challenging when developing a CAT targeted to multiple school grades (Tomasik et al. 2018), or when the target population is small and participation low. Several rounds of pretesting may be necessary in those cases to arrive at accurate estimates of item characteristics. Unfortunately, a threat of pretesting is that motivational effects can cause disturbances in the estimation process (e.g., Mittelhaëuser et al. 2015).
There are enough reasons to look for alternative procedures that can be employed to diminish the burden of pretesting for use in digital testing, especially in CAT. In this chapter, strategies for on-line calibration that can replace pretesting entirely, or at least to a certain extent, are evaluated. Ideas of extending a calibrated item pool are not new. They range from extending an existing item pool with a limited number of new items – replenishment – to periodic calibration of all items and persons in real time: on-the-fly calibration. Until recently the latter possibilities were limited by computational potential and infrastructure. But in view of the increased demand for CAT, on-the-fly calibration is drawing attention.
Replenishment Strategies and On-the-Fly Calibration
Replenishment strategies were developed in order to extend existing item pools or replace outdated items. Thus, a precondition for replenishment strategies is that a majority of items have been previously calibrated. Only a relatively small portion of items can be pretested. Test takers will usually take both operational and pretest items, carefully merged, so that they cannot distinguish between the two types. The responses on the pretest items often are not used to estimate the ability of the test takers. This procedure is referred to as seeding. Usually, but not necessarily, responses for the seeding items are collected and only when a certain minimum number of observations has been reached, the item pool will be calibrated off-line. In on-line methods, the seeding items are calibrated regularly in order to optimize the assignment of the seeding items to the test takers, using provisional parameter estimates of those seeding items.
Stocking (1988) was the first to investigate on-line calibration methods, computing a maximum likelihood estimate of a test taker’s ability using the item responses and parameter estimates of the operational items. Wainer and Mislevy (1990) described an on-line MML estimation procedure with one EM iteration (OEM) for calibrating seeding items. All these methods can be applied to replenish the item pool, that is, to add a relatively small set of new items to the pool in order to expand the pool or to replace outdated items.
Makransky (2009) took a different approach and started from the assumption that no operational items are available yet. He investigated strategies to transition from a phase in which all items are selected randomly to phases in which items are selected according to CAT item selection algorithms. Fink et al. (2018) took this approach a step further by postponing the calibration until the end of an assessment period and delaying reporting ability estimates, so that the method is suitable for high stakes testing for specialist target populations like university examinations.
Usually, however, an operational CAT project is not started in isolation. Frequently, several items are being reused from other assessments or reporting must take place in relation to an existing scale. In those circumstances, data from these calibrated items and scales may be used as a starting point to collect data for the CAT under consideration. Using this approach it is interesting to investigate how many and which type of calibrated items are required to serve as reference items. Furthermore, we assume a situation in which instant parameter updating and reporting is a crucial element, and thus the time needed for calibration should be kept at a pre-specified minimum.
On-the-Fly Calibration Methods
In this chapter we concentrate on the 1PL or Rasch model (Rasch 1960) for which the probability of giving a correct answer by a test taker with ability parameter \(\theta \) on an item with difficulty parameter \(\beta \) is given by
$$\begin{aligned} P(X=1|\theta ) = \frac{e^{(\theta - \beta )}}{(1 + e^{(\theta - \beta )})}. \end{aligned}$$
(16.1)
For item calibration, we consider three methods:
-
a rating scheme by Elo (1978) which has been adopted by the chess federations FIDE and USCF in 1960;
-
the JML procedure described by Birnbaum (1968);
-
the MML method proposed by Bock and Aitkin (1981).
Elo Rating
According to the Elo rating system, player A with rating \(r_{A}\) has an expected probability of
$$\begin{aligned} E_{AB} = \frac{10^{(r_{A} - r_{B})/400}}{1 + 10^{(r_{A} - r_{B})/400}} \end{aligned}$$
(16.2)
of winning a chess game against player B with rating \(r_{B}\). After the game has been played, the ratings of the players are updated according to
$$\begin{aligned} r_{A}^{'} = r_{A} + K (S - E_{AB}) \end{aligned}$$
(16.3)
and
$$\begin{aligned} r_{B}^{'} = r_{B} - K (S - E_{AB}), \end{aligned}$$
(16.4)
where S is the observed outcome of the game and K a scaling factor. From Equations (16.2)–(16.4) it can be seen that Elo updates can be regarded as a calibration under the Rasch model, albeit not one based on maximization of the likelihood function. Several variations exist, especially since K has been only loosely defined as a function decreasing in the number of observations. Brinkhuis and Maris (2009) have shown, however, that conditions exist under which the parameters assume a stationary distribution.
It is clear that the Elo rating scheme is computationally very simple and fast, and therefore ideally suited for instantaneous on-line calibration methods. However, little is known about the statistical properties of the method, such as the rate of convergence, or even if the method is capable at all of recovering the parameters at an acceptable accuracy. The Elo rating method is widely used in situations in which the parameters may change rapidly during the collection of responses, such as in sports and games. The method has been applied in the Oefenweb (2009), an educational setting in which students exercise frequently within a gaming environment.
JML
JML maximizes the likelihood L of a data matrix, where N test takers have each responded to \(n_{j}\) items with item scores \(x_{ij}\) and sum score \(s_{j}\), where \(P_{j}\) is the probability of a correct response and \(Q_{j}= 1 - P_{j}\) refers to the probability of an incorrect response. The likelihood function can be formulated as
$$\begin{aligned} L = \prod _{j=1}^{N}\frac{n_{j}!}{s_{j}!(n_{j}-s_{j})!}P_{j}^{s_{j}}Q_{j}^{n_{j}-s_{j}}. \end{aligned}$$
(16.5)
By maximizing the likelihood with respect to a single parameter, while fixing all other parameters, and rotating this scheme over all parameters in turn, the idea is that the global maximum for the likelihood function will be reached. In order to maximize the likelihood, a Newton-Raphson procedure is followed that takes the form
$$\begin{aligned} \beta _{i}^{t+1} = \beta _{i}^{t} - \frac{\sum _{j=1}^{N}{P_{ij}-x_{ij}}}{\sum _{j=1}^{N}P_{ij}Q_{ij}} \end{aligned}$$
(16.6)
for updating item parameter \(\beta _{i}\), and
$$\begin{aligned} \theta _{j}^{t+1} = \theta _{j}^{t} + \frac{\sum _{i=1}^{n_{j}}{P_{ij} - x_{ij}}}{\sum _{i=1}^{n_{j}}P_{ij}Q_{ij}} \end{aligned}$$
(16.7)
for updating person parameter \(\theta _{j}\).
Like the Elo rating, JML is a simple and fast-to-compute calibration method. However, there are a few downsides. In the JML method, the item parameters are structural parameters. The number of them remains constant when more observations are acquired. The person parameters, on the other hand, are incidental parameters, whose numbers increase with sample size. Neyman and Scott (1948) showed in their paradox that, when structural and incidental parameters are estimated simultaneously, the estimates of the structural parameters need not be consistent when sample size increases. This implies that when the number of observations per item grows to infinity, the item parameter estimates are not guaranteed to converge to their true values. In practical situations, JML might still be useful as effects of non-convergence may become apparent only with extreme numbers of observations per item. The biggest issue with the use of JML, however, is a violation of the assumption regarding the ignorability of missing data. Eggen (2000) has shown that under a regime of item selection based on ability estimates, a systematic error will be built up. He has shown that only for MML, violation of the ignorability assumption has no effect.
MML
MML is based on the assumption that the test takers form a random sample from a population whose ability is distributed according to density function \(g(\theta |\tau )\) with parameter vector \(\tau \). The essence of MML is integration over the ability distribution, while a sample of test takers is being used for estimation of the distribution parameters. The likelihood function
$$\begin{aligned} L=\prod _{j=1}^{N}\int P(x_{j}|\theta ,\xi ,\tau )g(\theta _{j}|\tau )d\theta _{j} \end{aligned}$$
(16.8)
is maximized using the expectation-maximization algorithm (EM). \(\xi \) is the vector of item parameters here. EM is an iterative algorithm for maximizing a likelihood function for models with unobserved random variables. Each iteration consists of an E-step in which the expectation of the unobserved data for the entire population is calculated, and an M-step in which the parameters are estimated that maximize the likelihood for this expectation. The main advantage of the MML calibration method is that violation of the ignorability assumption does not incur biased estimators. A disadvantage, however, is that MML is usually more time consuming than the Elo rating or JML, and thus it might be too slow to be employed in situations where an instantaneous update is needed.
In order to replenish the item pool, Ban et al. (2001) proposed the MEM algorithm, where in the first iteration of the EM algorithm, the parameters of the ability distribution for the population are estimated through the operational items only. However, if the number of operational items is very small, this procedure results in highly inaccurate estimates. For the problem at hand, where not many operational items are available, the MEM method is not applicable, and the regular MML might be considered.
The Use of Reference Items in Modelling Bias
Although bias cannot be avoided when calibrating CAT data with JML, previously collected responses for a part of the items can be used to model this bias. We assume that the bias is linear, i.e. that by applying a linear transformation \(\hat{\beta }_{i}^{'} = a \hat{\beta }_{i} + b\), estimators \(\hat{\beta }_{i}^{'}\) eliminate the bias. When JML is used to calibrate all items in the pool, the reference items, whose previously estimated parameters we trust, can be used to estimate transformation coefficients a and b. Let \(\mu _{r}\) and \(\sigma _{r}\) be the mean and standard deviation of the previously estimated parameters of the reference items, while \(\mu _{c}\) and \(\sigma _{c}\) are the mean and standard deviations of the parameter estimates for the same items, but now taken from the current JML-calibration, then the transformation coefficients are \(a = \frac{\sigma _{c}}{\sigma _{r}}\) and \(b=\frac{\mu _{c} - \mu _{r}}{\sigma _{r}}\). All reference items retain their trusted parameter values, while all new items in the calibration are updated to their transformed estimates.
Even though MML calibrations do not incur bias when applied to CAT data, it is technically possible to follow a similar procedure, but this should not result in substantially improved parameter estimates.
The Need for Underexposure Control
A solution must be found for a complication that will generally only occur in the very first phases when few observations are available. For some response patterns, the associated estimates of the JML and MML methods assume extreme values. In the case of perfect and zero scores, the estimations will even be plus or minus infinity. On the other hand, items are usually selected according to the maximization of Fisher information. This means that once an item has assumed an extreme parameter value, it will only be selected for test takers whose ability estimation is likewise extreme. And thus, those items tend to be selected very rarely. If the parameter estimation is based on many observations, this is fully justified. But if this situation arises when only a few observations have been collected, these items will effectively be excluded from the item pool without due cause.
In previous studies, this situation was prevented by defining various phases in an operational CAT project, starting with either linear test forms (Fink et al. 2018) or random item selection from the pool (Makransky 2009). When the number of observations allowed for a more adaptive approach, a manual decision regarding transition to the next phase was taken. Since we assume a situation in which decisions are to be taken on the fly and have to be implemented instantaneously, manual decisions are to be avoided. Therefore, a rigid system to prevent underexposure must be present so that items with extreme estimations but with very few observations will remain to be selected frequently. Similar to the findings of Veldkamp et al. (2010), we propose an underexposure control scheme based on eligibility function
$$\begin{aligned} f(n_{i}) = \left\{ \begin{array}{ll} X - n_{i}(X-1) / M, \qquad &{} n_{i} < M \\ 1, &{} \text {else} \end{array} \right. \end{aligned}$$
(16.9)
where \(n_{i}\) is the number of observations for item i, M is the maximum number of observations for which the underexposure control should be active, and X is the advantage that an item without any observations gets over an item with more than M observations. This means effectively that there is no transition from different phases for the entire item pool, but each transition will take place on the level of individual items.
Overexposure may further improve the estimation procedure, but since items with only few observations form a larger impediment to the cooperation between estimation procedure and item selection than a loss of efficiency caused by overexposure, overexposure control is out of scope for this study.
A Combination of Calibration Methods
Although Joint Maximum Likelihood (JML) calibration would meet operational requirements for running a CAT in terms of computational performance, its bias and its inability to estimate parameters for perfect or zero scores makes it unsuitable as a strategy for on-the-fly calibration. To overcome this issue, we propose a combination of Elo rating and JML as a strategy that meets the operational requirements for running a CAT. The Elo rating is used in the very beginning of the administration to ensure that the JML estimation procedure is converging for all answer patterns. Modelling the bias in JML with the help of a relatively small, but representative set of calibrated items, is proposed to eliminate the bias in JML. Although Marginal Maximum Likelihood (MML) estimation is not known to give biased estimates, it may turn out that this method also benefits from modelling the bias and starting with an Elo rating scheme.
Next to the use of well-calibrated reference items there is the need for underexposure control to ensure that we will collect an acceptable minimum of observations for all items to make calibration feasible.