1 1. Introduction

Unidimensional adaptive testing operates under a response model with a scalar ability parameter. Research on this type of adaptive testing has been ample. Among the topics that have been examined are the statistical aspects of ability estimation and item selection, item selection with large sets of content constraints on the test, randomized control of item exposure, removal of differential speededness, and detection of aberrant response behavior. For reviews of this research, see Chang (2004), van der Linden (2005, Chap. 9), van der Linden and Glas (2000, 2007), and Wainer (2000). Due to this research, methods of unidimensional adaptive testing are well developed, and testing organizations are now fully able to control their implementation in their testing programs.

Multidimensional item response theory (IRT) has developed gradually since its inception (e.g., McDonald, 1967, 1997; Reckase, 1985, 1997; Samejima, 1974). Its statistical tractability has been improved considerably lately, and it is now possible to use several of its models for operational testing. Although multidimensional response models are traditionally considered as a resort for applications in which unidimensional models do not show a satisfactory fit to the response data, their use has been motivated more positively recently by a renewed interest in performance-based testing and testing for diagnosis. Performance-based items typically require a range of more practical abilities. In testing for diagnosis, the goal is to extract as much information on the multiple abilities required to solve the test items as possible (e.g., Boughton, Yao, & Lewis, 2006; Yao & Boughton, 2007). Several admission and certification boards are in the process of enhancing their regular high-stakes tests with web-based diagnostic services that allow candidates to log on and get more informative diagnostic profiles of their abilities. The more informative adaptive testing format is particularly useful for this application because it is low stakes and unlikely to suffer from the security threats typical of admission and certification tests.

The first to address multidimensional adaptive testing (MAT) were Bloxom and Vale (1987), who generalized Owen’s (1969, 1975) approximate procedure of Bayesian item selection to the multidimensional case. Their research did not resonate immediately with others. The only later research on multidimensional adaptive testing known to the authors is reported in Fan and Hsu (1996), Luecht (1996), Segall (1996, 2000), van der Linden (1996, 1999, Chap. 9), and Veldkamp and van der Linden (2002). Luecht and Segall based item selection for MAT on the determinant of either the information matrix evaluated at the vector of current ability estimates or the posterior covariance matrix of the abilities. In van der Linden (1999), the trace of the (asymptotic) covariance matrix of the MLEs of the abilities was minimized and the option of weighing the individual variances to control for the relative importance of the abilities was explored. The possibilities of imposing extensive sets of constraints on the item selection to deal with the content specifications of the test were examined in van der Linden (2005, Chap. 9) and Veldkamp and van der Linden (2002). The former used a criterion of minimum weighted variances for item selection; the latter the posterior expectation of a multivariate version of the Kullback-Leibler information.

Use of the determinant or trace of an information matrix or a covariance matrix as a criterion of optimality in statistical inference are standard practices in the optimal design literature, where they are known as the criteria of D-optimality and A-optimality, respectively (e.g., Silvey, 1980, p. 10). In this more general area of statistics, such criteria are used to evaluate inferences with respect to the unknown parameters in a multiple-parameter problem on a single dimension. Berger and Wong (2005) describe a variety of areas, such as medical research and educational testing, in which optimal design studies have been proven to be useful. The Fisher information matrix plays a central role in these applications because it measures the information about the unknown variables in the observations. In educational testing, for instance, the information matrix associated with a test can be optimized using the criterion of D-optimality to select a set of items from a bank with the smallest generalized variance of the ability estimators for a population of examinees. These items yield the smallest confidence region for the ability parameters. Using A-optimality instead of D-optimality yields a different selection of items because the former focuses only the variances of the ability estimators.

The answer to the question of what choice of criterion would be best is directly related to the goal of testing. As described more extensively later in this paper, different goals of MAT can be distinguished. For example, a test may be designed to measure each of its abilities accurately. But we may also be interested in a subset of the abilities and want to ignore the others. Examples of the second goal are analytic abilities in a test whose primary goal is to measure reading comprehension, or a test of knowledge of physics that appears to be sensitive to mathematical ability. In more statistical parlance, the pertinent distinction between the two cases is between the estimation of intentional and nuisance parameters. A special case of MAT with intentional ability parameters arises when the test scores have to be optimized with respect to a linear combination of them. This may happen, for instance, when the practice of having single test scores summarizing the performances on a familiar scale was established long before the use of an IRT model was introduced in the testing program. If the item domain requires a multidimensional model, it then makes sense to optimize the test scores for a linear combination of the abilities in the model with a choice of weights based on an explicit policy rather than fit a unidimensional model and accept less than satisfactory fit to the response data.

For each of these cases, a different optimal design criterion for item selection in MAT seems more appropriate. For example, as shown later in this paper, when the goal is to estimate an intentional subset of the abilities, application of the criterion of D s -optimality (Silvey, 1980, p. 11) to the Fisher information matrix seems leads to the best item selection. The motivation of this research was to find such matches between the different cases of MAT and the performance of optimal design criteria. In addition, to the D- and A-optimality criteria, we included a few other criteria from the optimal design literature in our research, which are less known but have some intuitive attractiveness for application in adaptive testing.

Another goal of this research was to investigate the preferences of the optimality criteria for items in the pool with specific patterns of parameter values. The results should help to answer such questions as:Will the criterion for selection in a MAT program with nuisance abilities select only items that are informative about the intentional abilities? Or are there any circumstances in which they also select items that are mainly sensitive to a nuisance ability? Understanding the preferences of item-selection criteria for different patterns of parameter values is important for the assembly of optimal item pools for the different cases of MAT when there exists a choice of items. In principle, such information could help us to prevent overexposure and underexposure of the items in the pool and reduce the need of using more conventional measures of item-exposure control (Sympson & Hetter, 1985; van der Linden & Veldkamp, 2007).

Finally, we report some features of the Fisher information matrix and its use in adaptive testing that have been hardly noticed hitherto and also illustrate the use of the criteria empirically using simulated response data.

2 2. Response Model

The response model used in this paper is the multidimensional 3-parameter logistic (3PL) model for dichotomously scored responses. The model gives the probability of a correct response to item i by an examinee with p-dimensional ability vector θ = (θ 1, …, θ p ) as

$$\eqalign{ & {P_i}(\theta ) \equiv P({U_i} = 1|\theta ,{{\text{a}}_i},{b_i},{c_i}) \cr & \equiv {c_i} + \frac{{1 - {c_i}}} {{1 + \exp ( - {{\text{a}}_i} \cdot \theta + {b_i})}} \cr} $$

where a i is a vector with the item-discrimination parameters corresponding to the abilities in θ, b i is a scalar representing the difficulty of item i, and c i is known as the guessing parameter of item i (i.e., the height of the lower asymptote of the response function). Note that b i is not a difficulty parameter in the same sense as a unidimensional IRT model; in the current parameterization, it is a function of both the difficulties and the discriminating power of the item along each of the ability dimensions. Further, note that due to rotational indeterminacy of the θ-space, the components of θ do not automatically represent the desired psychological constructs. However, such issues are dealt with when the item pool is calibrated, and we can assume that a meaningful orientation of the ability space has been chosen. Finally, note that (1) is just a model for the probability of a correct answer by a fixed test taker. Particularly, it is not used as part of a hierarchical model in which θ is a vector of random effects. Therefore, the model should not be taken to imply anything with respect to a possible correlation structure between the abilities in some population of test takers; for instance, it does not force us to decide between what are known as orthogonal and oblique factor structures in factor analysis.

The vector of discrimination parameters, a i , can be interpreted as the relative importance of each ability for answering the item correctly. As is often done, we assume that the item parameters have been estimated with enough precision to treat them as known. The probability of an incorrect response will be denoted as Q i (θ) = 1−P i (θ). Because this model is not yet identifiable, additional restrictions are necessary that fix the scale, origin, and orientation of θ. In practical applications of IRT models, testing organizations maintain a standard parameterization through the use of parameter linking techniques that carry the restrictions from the calibration of one generation of test items to the next. For more details about the model, see Reckase (1985, 1997) and Samejima (1974). Other multidimensional response models are available in the literature, but the model in (1) is a direct generalization of the most popular unidimensional logistic model in the testing industry. In addition, its choice allows us to compare our results with those in a key reference on item selection in MAT as Segall (1996).

The following additional notation will be used throughout this article:

N: size of the item pool;

n: length of the adaptive test;

l = 1,…, p: components of ability vector θ;

i = 1,…, N: items in the pool;

k = 1,…, n: items in the test;

i k : item in the pool administered as the kth item in the test;

S{itk−1}: set of first k −1 administered items;

R k : {1,…, N\S k−1, i.e., set of remaining items in the pool.

For a vector u k−1 of responses on the first k − 1 items, the maximum likelihood estimate (MLE) of the ability, denoted by k−1, is defined as

$${{\hat \theta }^{k - 1}} \equiv \arg \mathop {{\rm{max}}}\limits_\theta f({{\rm{u}}_{k - 1}}|\theta )$$

where

$$f({{\text{u}}_{k - 1}}|\theta ) = \prod\limits_{j = 1}^{k - 1} {P_{{i_j}}^{{u_{{i_j}}}}(\theta )Q_{{i_j}}^{1 - {u_{{i_j}}}}(\theta )} $$

is the likelihood function with the item responses modeled conditionally independent given θ. The MLE can found by setting the derivative of the logarithm of (3) equal to zero and solve the system for θ using a numerical method such as Newton-Raphson (e.g., Segall, 1996) or an EM algorithm (Tanner, 1993, Chap. 4). The likelihood function may not have a maximum (e.g., when only correct or incorrect item responses are observed), or a local instead of a global maximum may be found. Such problems are rare for adaptive tests of typical length, though.

3 3. Fisher Information

The Fisher information matrix is a convenient measure of the information in the observable response variables on the vector of ability parameters θ. For item i, the matrix is defined as

$$\eqalign{ & {{\text{I}}_i}(\theta ) \equiv - E\left[ {\frac{{{\partial ^2}}} {{\partial \theta \partial {\theta ^T}}}\log f({U_i}|\theta )} \right] \cr & = \frac{{{Q_i}(\theta ){{\left[ {{P_i}(\theta ) - {c_i}} \right]}^2}}} {{{P_i}(\theta ){{(1 - {c_i})}^2}}}{{\text{a}}_i}{\text{a}}_i^T \cr} $$

with a T i the transpose of the (column) vector of discrimination parameters. This expression reveals some interesting features of the item information matrix:

  • The item information matrix depends on the ability parameters only through the response function P i (θ).

  • The matrix has rank one.

  • Each element in the matrix has a common factor, which will be denoted as

    $$g(\theta ;{{\text{a}}_i},{b_i},{c_i}) = \frac{{{Q_i}(\theta ){{\left[ {{P_i}(\theta ) - {c_i}} \right]}^2}}} {{{P_i}(\theta ){{(1 - {c_i})}^2}}}$$

    This function of θ will be discussed in the following section.

  • The sum of the elements of the matrix is equal to

    $$g(\theta ;{{\text{a}}_i},{b_i},{c_i}){\left( {\sum\limits_{l = 1}^p {{a_{il}}} } \right)^2}$$

    This equality shows the important role played by the sum of the discrimination parameters in the total amount of information in the response to an item.

The information matrix of a set of S items is equal to the sum of the item information matrices, i.e.,

$${{\text{I}}_S}(\theta ) = \sum\limits_{i \in S} {{{\text{I}}_i}(\theta )} $$

The additivity follows from the conditional independence of the responses given θ already used in (3). Although the item information matrix I i (θ) of each item in S has rank 1, the rank of I S (θ) is equal to p (unless the items in S have the same proportional relationship between the discrimination parameters).

The use of the information matrix is mainly motivated by the large-sample behavior of the MLE of θ, which is known to be distributed asymptotically as

$$\hat \theta \sim N\left( {{\theta _0},{\rm{I}}_S^{ - 1}({\theta _0})} \right)$$

with θ 0 the true ability and I −1 S (θ 0) the inverse of the information matrix evaluated at θ 0. More generally, it holds for the covariance matrix Σ(θ 0) of any unbiased estimator of θ 0 that Σ(θ 0)−I −1 S (θ 0) is positive semi-definite. The inverse information matrix can thus be considered as the multivariate generalization of the Cramér-Rao lower bound on the variance of estimators (Lehmann, 1999, Section 7.6).

In test theory, it is customary to consider I S (θ) and I i (θ) as functions of θ and refer to them as the test and item information matrix, respectively. By substituting for θ in (6), an estimate of these matrices is obtained. When evaluating the selection of the kth item in the adaptive test using (6), the amount of information about θ can be expressed as the sum of the test information matrix for the k − 1 items already administered and the matrix for candidate item i k ,

$${{\rm{I}}_{{S_{k - 1}}}}\left( {{{\hat \theta }^{k - 1}}} \right) + {{\rm{I}}_{{i_k}}}\left( {{{\hat \theta }^{k - 1}}} \right)$$

Criteria of optimal item selection should thus be applied to (8). For example, Segall (1996) proposed to select the item that maximizes the determinant of (8). This candidate gives the largest decrement in volume of the confidence ellipsoid of the MLE k−1 after k − 1 observed responses. As already noted, a maximum determinant of an information matrix is known as D-optimality in the optimal design literature. Before dealing with such criteria in more detail, we take a closer look at the item information matrix.

3.1 3.1. Item Information Matrix in Multidimensional IRT

The item information matrix in (4) can be written as

((9))

, with function g given in (5). Thus, the information matrix consists of two factors: (i) function g and (ii) matrix a i a T i with elements based on the discrimination parameters.

The focus of the next sections is on the comparison of different optimality criteria for item selection in different cases of MAT. Each of these criteria maps the item information matrix onto a one-dimensional scale. Because g(θ;a i , b i , c i ) is a common factor in all elements of the information matrix, the criteria basically differ in how they deal with the second factor a i a T i . On the other hand, g is function of θ, and it is instructive to analyze its shape, which is done in this section. As an example, in Figure 1, g is plotted for an item with parameters a i = (1, 0.3), b i = 0, and c i = 0 over a two-dimensional ability space θ.

Figure 1
figure 1

Surface of g(θ, a, d, c) with a = (1, 0.3), d = 0, and c = 0. (Note: ˜g is a cross-section of g perpendicular to a 1 θ +a 2 θ = 0.)

Observe that g depends on θ only through the response function P i (θ) in (1). Because P i (θ) is constant when θ · a i is, the same applies to g. For example, for the item displayed in Figure 1, the values of g do not depend on the abilities as long as θ 1 + 0.3θ 2 is constant. This feature can be used to reparameterize g(θ;a i , b i , c i ) into a one-dimensional function ˜g(θ;a i , b i , c i ) with a new θ perpendicular to θ · a i = 0. As shown in Figure 1, ˜g is just a cross-section of g.

The new function ˜g is obtained by substituting

$$\theta = ({\theta _1}, \ldots ,{\theta _p}) = \left( {\frac{{{a_{i1}}}} {{\left\| {{{\text{a}}_i}} \right\|}}\theta , \ldots ,\frac{{{a_{ip}}}} {{\left\| {{{\text{a}}_i}} \right\|}}\theta } \right)$$

into the response function in (1), which results in

$${{\tilde P}_i}(\theta ) = {c_i} + \frac{{1 - {c_i}}} {{1 + \exp ( - \left\| {{{\text{a}}_i}} \right\|\theta + {b_i})}}$$

where \(\left\| {{{\text{a}}_i}} \right\| = \sqrt {a_{i1}^2 + \ldots + a_{ip}^2} \) is the Euclidean norm of a i . Thus, the reparameterization leads to a new unidimensional response model which differs from the regular unidimensional 3PL model only in the replacement of its discrimination parameter by the Euclidean norm of the vector with the discrimination parameters from the multidimensional model.

By definition, the maximizer of ˜g, denoted here as θ max, is the ability value for which the item provides most information. It can be determined by solving \(\frac{\partial } {{\partial \theta }}\tilde g(\theta ;{{\text{a}}_i},{b_i},{c_i}) = 0\) for θ. The result is

$$\eqalign{ & {\theta _{\max }} = \left\{ {\matrix{ {{{{b_i} - \log \left( {{{ - 1 + \sqrt {1 + 8{c_i}} } \over {4{c_i}}}} \right)} \over {\left\| {{{\rm{a}}_i}} \right\|}}} & {{\rm{for }}{c_i} > 0,} \cr {{{{b_i}} \over {\left\| {{{\rm{a}}_i}} \right\|}}} & {{\rm{for }}{c_i} = 0} \cr } } \right. \cr & \tilde g\left( {{\theta _{\max }}{\rm{;}}{{\rm{a}}_i},{b_i},{c_i}} \right) = \left\{ {\matrix{ {{{16{c_i}(1 - {c_i})\left( { - 1 + \sqrt {1 + 8{c_i}} } \right)} \over {\left( {3 + \sqrt {1 + 8{c_i}} } \right){{\left( {4{c_i} - 1 + \sqrt {1 + 8{c_i}} } \right)}^2}}}} & {{\rm{for }}{c_i} > 0} \cr {0.25} & {{\rm{for }}{c_i} = 0} \cr } } \right. \cr} $$

. In addition,

((12))

.

These results enable us to use the intuitions developed for the unidimensional 3PL model for the current multidimensional generalization of it. First, from (11), it follows that θ max increases with the guessing parameter of the item. Hence, when an item has a higher chance of guessing the correct answer, it should be used for more able test takers. Second, the difficulty parameter serves as a location parameter for the item in the direction perpendicular to θ · a i = 0. The parameter is scaled by the Euclidean norm of the discrimination parameters of the item. Third, (12) shows that the maximum value of ˜g(θ; a i , b i , c i ), and hence also of g(θ; a i , b i , c i ), only depends on the guessing parameter. Fourth, g(θ max; a i , b i , c i ) decreases with increasing c i . This can be shown by calculating the derivative of ˜g(θ max; a i , b i , c i ) with respect to c i , which is omitted here. Consequently, the maximum values of the elements in the item information matrix depend only on the discrimination parameters through the matrix a i a T i . This conclusion reconfirms the critical role of this matrix as the second factor of (9).

4 4. Item Selection Criteria for MAT

When testing multiple abilities, different cases of multidimensional testing should be distinguished (van der Linden, 2005, 1999, Section 8.1). This article focuses on three cases; the others can be considered as minor variations of them:

  1. 1.

    All abilities in the ability space are intentional. The goal of the test is to obtain the most accurate estimates for all abilities.

  2. 2.

    Some abilities are intentional and the others are nuisances. This case arises, for instance, when a test of knowledge of physics has items that also require language skills, but the goal of this test is not to estimate any language skill.

  3. 3.

    All abilities measured by the test are intentional, but the interest is only in a specific linear combination of them. As already explained, this case occurs in practice when the test is multidimensional, but for historic reasons, the examinees’ performances are to be reported in the form of single scores.

Different optimal design criteria based on the item information matrix rank the same set of items differently for test takers of equal ability. The choice of criterion for item selection in adaptive testing should therefore be in agreement with the goal of the MAT program. However, it is not immediately clear which criterion is best for each of the above cases of MAT. For the first case, both D-optimality and A-optimality seem reasonable choices. The former seeks to minimize the generalized variance, the latter the sum of the variances of the ability estimators. But it is unclear how they will behave in the two other cases of MAT. Both criteria will be analyzed in more detail below. In addition, the usefulness of the less-known criterion of E-optimality, D s -optimality, and c-optimality will be investigated. In order to obtain expressions that are relatively easy to interpret, the criteria are derived for a three-dimensional ability space. The conclusions for ability spaces of higher dimensionality are similar. Wherever possible, for notational simplicity, the argument k−1 in the information matrices in (8) is omitted.

4.1 4.1. All Abilities Intentional

The goal is to obtain accurate estimates of the separate abilities in θ. For this case, the following three optimality criteria are likely candidates.

4.1.1 4.1.1. D-Optimality

This criterion maximizes the determinant of (8). Hence, it selects the kth item to be

$$\arg {\text{ }}\mathop {\max }\limits_{{i_k} \in {R_k}} {\text{ }}\det \left( {{{\text{I}}_{{S_{k - 1}}}} + {{\text{I}}_{{i_k}}}} \right)$$

. Using the factorization in (9), the criterion can be expressed as

$$\arg {\rm{ }}\mathop {\max }\limits_{{i_k} \in {R_k}} g\left( {{{\hat \theta }^{k - 1}};{{\rm{a}}_{{i_k}}},{b_{{i_k}}},{c_{{i_k}}}} \right)\left( {a_{{i_k}1}^2\det \left( {{{\rm{I}}_{{S_{k - 1}}\left[ {1,1} \right]}}} \right) + a_{{i_k}2}^2\det \left( {{{\rm{I}}_{{S_{k - 1}}\left[ {2,2} \right]}}} \right) + a_{{i_k}3}^2\det \left( {{{\rm{I}}_{{S_{k - 1}}\left[ {3,3} \right]}}} \right) - 2{a_{{i_k}1}}{a_{{i_k}2}}\det \left( {{{\rm{I}}_{{S_{k - 1}}\left[ {1,2} \right]}}} \right) - 2{a_{{i_k}2}}{a_{{i_k}3}}\det \left( {{{\rm{I}}_{{S_{k - 1}}\left[ {2,3} \right]}}} \right) - 2{a_{{i_k}1}}{a_{{i_k}3}}\det \left( {{{\rm{I}}_{{S_{k - 1}}\left[ {1,3} \right]}}} \right)} \right)$$

, where \({{\rm{I}}_{{S_{k - 1}}[{l_1},{l_2}]}}\) is the submatrix of \({{\rm{I}}_{{S_{k - 1}}}}\) when omitting row l 1 and column l 2.

In matrix algebra, \({{\rm{I}}_{{S_{k - 1}}[{l_1},{l_2}]}}\) is referred to as a cofactor and its determinant is known as the minor. Observe that the square of the discrimination parameter corresponding to θ1 is multiplied by \({\rm{det}}\left( {{{\rm{I}}_{{S_{k - 1}}[1,1]}}} \right)\), which can be interpreted as the current amount of information about the two other abilities, θ 2 and θ 3. Similar relationships hold for \({a_{{i_k}2}}\) and \({a_{{i_k}3}}\). Consequently, the criterion tends to select items with a large discrimination parameter for the ability with a relatively large (asymptotic) variance for its current estimator. The criterion thus has a built-in “minimax mechanism”: it tends to pick the items that minimize the variance of the estimator lagging behind most. The same behavior has been observed for D-optimal item calibration designs (Berger & Wong, 2005, p. 15). As a result, the difference between the sampling variances of the estimators for the two abilities tend to be negligible toward the end of the test. This is precisely what we may want when both abilities are intended to be measured.

From (14), it can also be concluded that items with large discrimination parameters for more than one ability are generally not informative. Consequently, the criterion of D-optimality tends to prefer items that are sensitive to a single ability over items sensitive to multiple abilities.

Segall (1996, 2000) proposed using a Bayesian version of D-optimality for MAT that evaluates the determinant of the posterior covariance matrix at the posterior modes of the abilities (instead of the determinant of the information matrix at the MLEs). Assuming a multivariate normal posterior, he showed the result to be

$$\arg {\text{ }}\mathop {\max }\limits_{{i_k} \in {R_k}} \det \left( {{{\text{I}}_{{S_{k - 1}}}}\left( {{{\tilde \theta }^{k - 1}}} \right) + {{\text{I}}_{{i_k}}}\left( {{{\tilde \theta }^{k - 1}}} \right) + \Sigma _0^{ - 1}} \right)$$

where Σ 0 is the prior covariance matrix of θ and ˜θ k−1 is the posterior mode after k − 1 items have been administered.

4.1.2 4.1.2. A-Optimality

This criterion seeks to minimize the sum of the (asymptotic) sampling variances of the MLEs of the abilities, which is equivalent to selecting the item that minimizes the trace of the inverse of the information matrix:

$$\arg {\text{ }}\mathop {\min }\limits_{{i_k} \in {R_k}} {\text{ trace}}\left( {{{\left( {{{\text{I}}_{{S_{k - 1}}}} + {{\text{I}}_{{i_k}}}} \right)}^{ - 1}}} \right) = \arg {\text{ }}\mathop {\max }\limits_{{i_k} \in {R_k}} \frac{{\det \left( {{{\text{I}}_{{S_{k - 1}}}} + {{\text{I}}_{{i_k}}}} \right)}} {{\Sigma _{l = 1}^3\det \left( {{{\left[ {{{\text{I}}_{{S_{k - 1}}}} + {{\text{I}}_{{i_k}}}} \right]}_{\left[ {l,l} \right]}}} \right)}}$$

A-optimality results in an item-selection criterion that contains the determinant of the information matrix as an important factor. Its behavior should thus be largely similar to that of D-optimality.

Analogous to Segall’s (1996, 2000) proposal, a Bayesian version of A-optimality could be formulated by adding the inverse of a prior covariance matrix to (16) and evaluating the result at a Bayesian point estimate of θ instead of the MLE. But this option is not pursued here any further.

4.1.3 4.1.3. E-Optimality

The criterion of E-optimality maximizes the smallest eigenvalue of the information matrix, or equivalently, the generalized variance of the ability estimators along their largest dimension. The criterion has gained some popularity in the literature on optimal regression design; for an application to optimal temperature input in microbiological studies, where the criterion has shown to work efficiently, see Bernaerts, Servaes, Kooyman, Versyck, & Van Impe (2002). A disadvantage of the criterion might be its lack of robustness in applications with sparse data. Due to its complexity, the expression of the smallest eigenvalue of the matrix \({{\rm{I}}_{{S_{k - 1}}}} + {{\rm{I}}_{ik}}\) is omitted here.

In spite of the popularity of the criterion in other applications, it may behave unfavorably when used for item selection in MAT. As shown in Appendix A, the contribution of an item with equal discrimination parameters to the test information vanishes when the sampling variances of the ability estimators have become equal to each other. This fact contradicts the fundamental rule that the average sampling variance of an MLE should always decrease after a new observation. Using E-optimality for item selection in MAT may therefore result in occasionally bad item selection, and its use is not recommended. The simulation studies later in this paper confirm this conclusion.

4.1.4 4.1.4. Graphical Example

In order to get a first impression of the behavior of the three optimality criteria, their surfaces for two 2-dimensional items are plotted. As indicated earlier, the discrimination parameters play a crucial role. We therefore ignore possible differences between the difficulty and guessing parameters and consider the following two items:

$${\text<Emphasis Type="Italic">em 1: </Emphasis>}{a_{\text{1}}}{\text{ = (0}}{\text{.5, 1), }}{b_{\text{1}}}{\text{ = 0, and }}{c_{\text{1}}}{\text{ = 0}}$$
$${\text<Emphasis Type="Italic">em 2: </Emphasis>}{a_{\text{2}}}{\text{ = (0}}{\text{.8, 0), }}{b_2}{\text{ = 0, and }}{c_2}{\text{ = 0}}$$

Item 1 is sensitive to both abilities but for Item 2 the second ability does not play any role in answering the item correctly. The current information matrix is fixed at

((19))

for all ability values. This choice enable us to clearly see the difference between the surfaces.

The surfaces for the three criteria are shown in the left-hand panels of Figure 2, whereas the right-hand panels display some of their contours as a function of the discrimination parameters for abilities fixed at θ = 0. (Note that for A-optimality, we plotted the argument of the right-hand side of (16), so that a higher surface is equivalent to a more informative item.) The shapes of the surfaces seem to be entirely determined by the common factor g(θ; a i , b i , c i ) of the elements of the item information matrix. The differences between the three criteria are caused only by the different ways in which they map a i a T i onto a one-dimensional space. Each criterion finds Item 2, which tests a single ability, most informative. But the preference for this item is strongest for E-optimality and weakest for D-optimality. This conclusion follows comparing an item with discrimination parameters (a 1, 0) with one that has (a, a) but the same item information score, for instance, a = (0.5, 0) and a = (0.64, 0.64) (D-optimality) and a = (0.5, 0) and a = (1.36, 1.36) (A-optimality).

Figure 2
figure 2

Surfaces of the criteria of D-, A-, and E-optimality for Item 1 and Item 2 (left-hand panels) and contours of the same criteria as a function of the discrimination parameters (a 1, a 2) for a person with average ability θ = 0 (right-hand panels). (Note: b = 0 and c = 0.)

For the more extreme values of θ = (2,−2) and (2, 2), the contours in Figure 3 show some surprising shapes. For instance, if a 2 = 0, an increase of a 1 does not always result in an increase of the criterion. Thus, for items that do not show any discrimination with respect to one of the abilities, the occurrence of extreme values of the MLEs of θ 1 and θ 2 in the beginning of an adaptive test is likely to result in inappropriate item selection for the criteria of D- and A-optimality. Obviously, such items should not be admitted to the pool. Also, the independence of the criterion of E-optimality of the discrimination parameters when they are equal (a 1 = a 2) is demonstrated by its contours. As already indicated, this behavior of the criterion of E-optimality does not meet our intuitive idea of information in an item.

Figure 3
figure 3

Contours of the criteria of D-, A-, and E-optimality as a function of the discrimination parameters of the item for θ = (−2, 2) (left-hand panels) and θ = (2, 2) (right-hand panels). (Note: b = 0 and c = 0.)

4.2 4.2. Nuisance Abilities

When the first s abilities of the ability vector θ are intentional and the last ps abilities are nuisances, D s -optimality (Silvey, 1980, p. 11) seems to reflect the goal of this case of MAT. In this case, our interest goes to the vector A T θ with A T = [I s 0], where I s is a s × s identity matrix. D s -optimality selects the item

$$\arg {\text{ }}\mathop {\max }\limits_{{i_k} \in {R_k}} \det {\left( {{{\text{A}}^T}{{\left( {{{\text{I}}_{{S_{k - 1}}}} + {{\text{I}}_{{i_k}}}} \right)}^{ - 1}}{\text{A}}} \right)^{ - 1}}$$

Instead of maximizing the determinant of \({\left( {{{\rm{A}}^T}{{\left( {{{\rm{I}}_{{S_{k - 1}}}} + {{\rm{I}}_{{i_k}}}} \right)}^{ - 1}}{\rm{A}}} \right)^{ - 1}}\), the trace of its inverse could be minimized. The criterion would then be called A s -optimality. Below we consider two instances of this case for a three-dimensional ability vector θ.

4.2.1 4.2.1. θ1 and θ2 Intentional and θ3 a Nuisance Ability

Let θ 1 and θ 2 be intentional abilities and θ 3 a nuisance ability. Hence,

The criterion in (20) can then be expressed as

$$\arg {\text{ }}\mathop {\max }\limits_{{i_k} \in {R_k}} {\left( {\det \left( {{{\left[ {{{\left( {{{\text{I}}_{{S_{k - 1}}}} + {{\text{I}}_{{i_k}}}} \right)}^{ - 1}}} \right]}_{\left[ {3,3} \right]}}} \right)} \right)^{ - 1}}$$

Note that θ 3 is not ignored in (21) because the criterion is based on the inverse of the information matrix \({{\rm{I}}_{{S_{k - 1}}}} + {{\rm{I}}_{{i_k}}}\) instead of the matrix itself. As a result of taking the determinant, items that mainly test a single ability are generally most informative.

However, the criterion does not always select items that only discriminate highly with respect to one of the intentional abilities. This point is elaborated in Appendix B, where we show that when the amount of information about the intentional abilities is high relative to the amount of information about all abilities, i.e., \({\rm{det}}\left( {{{\left( {{{\rm{I}}_{{S_{k - 1}}}}} \right)}_{[3,3]}}} \right) > \det \left( {{{\rm{I}}_{{S_{k - 1}}}}} \right)\), the criterion reveals a tendency to select items that discriminate highly with respect to the nuisance ability. Under these conditions, the sampling variance of the estimator of the nuisance abilities is relatively large, and the selection of such items results in the largest decrease of the generalized variance of the intentional abilities. This type of behavior was also observed in a study with simulated data reported later in this paper. Similarly, it can be shown that the behavior of the criterion of A s -optimality is similar to that of D s -optimality.

4.2.2 4.2.2. θ1 Intentional and θ2 and θ3 Nuisance Abilities

Because θ 1 is the only intentional ability, A T = [1 0 0]. Consequently, (20) selects the item that minimizes the sampling variance of θ 1, that is,

$$\arg {\text{ }}\mathop {\min }\limits_{{i_k} \in {R_k}} {\left[ {{{\left( {{{\text{I}}_{{S_{k - 1}}}} + {{\text{I}}_{{i_k}}}} \right)}^{ - 1}}} \right]_{(1,1)}}$$

, where [·](1,1) denotes element (1, 1) of the matrix. In Appendix B, we show that this criterion generally selects items that highly discriminate with respect to the intentional ability, θ 1, except when the amount of information about the nuisance abilities is relatively low, i.e., \({\rm{det}}\left( {{{\left( {{{\rm{I}}_{{S_{k - 1}}}}} \right)}_{[1,1]}}} \right)\) is small. In this case, (22) prefers selecting an item that highly discriminates with respect to the nuisance abilities. Similar behavior was observed for the case of two intentional and one nuisance ability.

Observe that the criterion of A s -optimality selects the same items as (22).

4.3 4.3. Composite Ability

This case of MAT occurs when the items in the item pool measure multiple abilities but only an estimate of a specific linear combination of the abilities,

$${\theta _c} = \lambda \cdot \theta = \sum\limits_{l = 1}^p {{\lambda _l}{\theta _l}} $$

, is required, with λ a vector of (nonnegative) weights for the importance of the separate abilities. In order to maintain a standardized scale for θ c , often Σ p l=1 λ l = 1 is used.

Because the response probability in (1) depends on the abilities only through the linear combination a i · θ, an item is informative for a composite θ c when a i = α i λ, for some large constant α i > 0. This claim is immediately clear when comparing the multidimensional model after substituting a i = α i λ in (1) with the unidimensional IRT model,

$$\eqalign{ & {P_i}(\theta ) \equiv {c_i} + \frac{{1 - {c_i}}} {{1 + \exp ( - {\alpha _i}\lambda \cdot \theta + {b_i})}} \cr & = {c_i} + \frac{{1 - {c_i}}} {{1 + \exp ( - {\alpha _i}{\theta _c} + {b_i})}} \cr} $$

, where α i would the discrimination parameter of the unidimensional IRT model.

According to Silvey (1980, p. 11), c-optimality applies when we wish to obtain an accurate estimate of a linear combination of unknown parameters. For the current application to MAT, the criterion can be shown to be equal to

$$\arg {\rm{ }}\mathop {\min }\limits_{{i_k} \in {R_k}} {{\rm{\lambda }}^T}{\left( {{{\rm{I}}_{{S_{k - 1}}}}\left( {{{\hat \theta }^{k - 1}}} \right) + {{\rm{I}}_{{i_k}}}\left( {{{\hat \theta }^{k - 1}}} \right)} \right)^{ - 1}}{\rm{\lambda }} = \arg {\rm{ }}\mathop {\max }\limits_{{i_k} \in {R_k}} {\left( {{{\rm{\lambda }}^T}{{\left( {{{\rm{I}}_{{S_{k - 1}}}}\left( {{{\hat \theta }^{k - 1}}} \right) + {{\rm{I}}_{{i_k}}}\left( {{{\hat \theta }^{k - 1}}} \right)} \right)}^{ - 1}}\lambda } \right)^{ - 1}}$$

Indeed, this criterion prefers items with discrimination parameters that reflect the weights of importance in the composite ability, i.e., a i λ. The preference is demonstrated for a two-dimensional ability vector with equal weights λ 1 = λ 2 = 1 in Figure 4. (Note that we plotted the argument in the right-hand side of (24), so that a larger outcome can be interpreted as a more informative item.) Item 1 is generally more informative because λ · a 1 is larger than λ · a 2. Furthermore, unlike the criteria of D-, A-, and E-optimality, which yielded concave contours (see Figure 2), the contours in Figure 4 are convex. Thus, for this criterion, an item that tests several abilities simultaneously with a i λ is generally more informative than an item with a preference for a single ability.

Figure 4
figure 4

Surfaces of the criterion of c-optimality for Items 1 and 2 (left-hand panels) and contours of the criterion of c-optimality for the same items for θ = 0 (right-hand panels). (Note: b = 0 and c = 0.)

5 5. Simulation Study

In order to assess the influence of the item-selection criteria on the accuracy of the ability estimates, we conducted a study with simulated data. Each of the three cases of MAT discussed in the previous section were simulated to see whether the proposed selection criteria resulted in the best results. Also, we were interested in seeing if the proposed selection rule for adaptive testing with a nuisance ability would result in more accurate estimation of the intentional ability than when all abilities are to be considered as intentional. Similar interest existed in the estimation of a specific linear combination of the abilities using the criterion in (24).

The second goal of this study was to find out what combinations of discrimination parameters for an item were generally most informative for each of the three cases of adaptive testing. This was done by counting how often each item was administered for each selection criterion. The information is helpful when designing an item pool for a given case of multidimensional adaptive testing and item-selection criterion.

Finally, we were interested in seeing whether each of the optimality criterion resulted in the best value of the specific quantity it optimizes, for instance, whether the determinant of the covariance matrix (that is, the generalized variance) at the end of the test was actually smallest for the criterion of D-optimality.

All simulations were done for the case of a two-dimensional vector θ. Analysis of the higher-dimensional expressions for the optimality criteria in the previous sections shows that the dimensionality of the ability space is reflected only in the order of the information matrices and not in the structure of the expression. For instance, it is easy for the reader to verify that the argument that revealed the peculiar behavior of the criterion of D s -optimality in Appendix B for the case of a three-dimensional space with intentional and nuisance abilities, which forced us reject this criterion, holds equally well for a two-dimensional space with one intentional and one nuisance ability. We therefore assume the behavior of the item-selection criteria to be similar for multidimensional adaptive testing with any number of dimensions.

5.1 5.1. Design of the Study

The behavior of these criteria is further illustrated by the empirical frequencies of item selection plotted against the item discrimination parameters of the items in Figure 5. In these graphs, circle is proportional to the number of times the item was selected for administration. Items that where never selected are symbolized as “×.” The preference of items with a high discrimination parameter for a single ability is stronger for A- than for D-optimality. The difference might explain why the former slightly outperformed the latter in terms of accuracy of ability estimation in this case. The frequencies of the difficulty and guessing parameters are omitted because they were as expected for A-optimality and D-optimality: For both criteria, the distributions of the difficulty parameters were close to uniform. Also, smaller guessing parameters were selected much more frequently than larger parameters. We also prepared the same plots as in Figure 5 for the conditional distributions of the frequencies of item selection given the abilities in this study. But since they were generally similar to the marginal distributions, they are omitted here.

Figure 5
figure 5

Empirical frequencies of the discrimination parameters of the items selected for the different optimality criteria. (Note: the size of the circles is proportional to the frequency of selection; × denotes items that were never selected.)

The item pool consisted of 200 items that were generated according to a 1N(1, 0.3), a 2N(1, 0.3), bN(0, 3), and 10c ∼ Bin(3, 0.5). None of the items had negative discrimination parameters. The MLEs of the abilities were calculated using a Newton-Raphson algorithm. To ensure the existence of the MLEs, each adaptive test began with the five fixed items displayed in Table 1. When a MLE was obtained, 30 items were selected adaptively from the pool using the item-selection rules described in this article. For each combination of θ 1 =−1, 0, 1 and θ 2 =−1, 0, 1, a total of 100 adaptive test administrations were simulated. Hence, a total of 900 tests were administered for each selection rule. The final MLEs of the abilities of interest were compared with the test takers’ true abilities by calculating their average bias

$${\rm<Emphasis Type="BoldItalic">as</Emphasis>}({{\hat \theta }_l}) = {1 \over {100}}\sum\limits_{j = 1}^{100} {\left( {{{\hat \theta }_{j,l}} - {\theta _l}} \right)} $$

and mean squared error (MSE)

$${\rm{MSE}}({{\hat \theta }_l}) = {1 \over {100}}\sum\limits_{j = 1}^{100} {{{\left( {{{\hat \theta }_{j,l}} - {\theta _l}} \right)}^2}} $$

where j,l is the final estimate of ability l = 1,2 of the jth simulated test taker and θ l is this test taker’s true value for ability l.

Table 1 First five items administered for the simulated adaptive tests.

In order to have a baseline for our comparisons, we also simulated test administrations in which the adaptive selection of the 30 items was replaced by random selection from the pool. Table 2 shows the estimated bias and Table 3 shows the MSE for these administrations (columns labeled “R”). The MSEs reveal close to uniform precision for the estimation of θ 1 and θ 2 across their range of values, with (−1,−1) as an exception. This finding indicates that the range of difficulty of the items in the pool was wide enough to cover the abilities in this study. As a consequence of the specific combinations of the randomly generated item parameters, apparently the items in the pool tended to be more effective for estimating θ 2 than θ 1.

Table 2 Bias of the estimates of θ 1 and θ 2 for different item selection rules: Random selection (Column R), D-optimality (Column D), A-optimality (Column A), E-optimality (Column E), and D s -optimality (Column D s ) with θ 1 intentional and θ 2 a nuisance ability.
Table 3 MSE of the estimates of θ 1 and θ 2 for different item selection rules: Random selection (Column R), D-optimality (Column D), A-optimality (Column A), E-optimality (Column E), and D s -optimality (Column D s ) with θ 1 intentional and θ 2 a nuisance ability.

5.2 5.2. θ1 and θ2 Intentional

Our analysis of the behavior of the selection criteria for this case of MAT suggested using either D- or A-optimality. For completeness, we also included the less favorable E-optimality criterion in the study. The results for these criteria are given in Tables 2 and 3 (columns labeled “D,” “A,” “E”). As Table 3 shows, the criteria of D- and A-optimality resulted in substantial improvement of ability estimation over random selection. Furthermore, using the criterion of A-optimality resulted on average in slightly more accurate ability estimation than that of D-optimality. As expected, the criterion of E-optimality performed badly (even worse than the baseline). In fact, this finding definitively disqualifies E-optimality as a criterion for item selection in adaptive testing.

The poor results for E-optimality as an item-selection criterion are explained by the inappropriate behavior of the criterion described in Appendix B. For examinees of extreme ability, the criterion tended to select items of opposite difficulty; hence, its low efficiency.

5.3 5.3. θ1 Intentional and θ2 a Nuisance

For this case, the criterion of D s -optimality selects items minimizing the asymptotic variance of the intentional ability θ 1. Tables 2 and 3 (columns labeled “D s ”) display the results from our study. The MSE for the estimator of θ 1 was much more favorable than for the criteria of A- and D-optimality in the preceding section (θ 1 and θ 2 both intentional). As expected, these results were obtained at a much larger MSE for the estimator of θ 2. This finding points at the fact that the presence of a second intentional ability introduces a trade-off between their two estimators, and consequently to less favorable behavior for either of them.

The preferences of the current criterion for the pairs of discrimination parameters for the items was as expected (see Figure 5). The majority of the items selected mainly tested the intentional ability; only few had a preference for the nuisance ability. Generally, the preferences for the difficulty and guessing parameters by this criterion appeared to be very similar to those for the earlier criteria of D-optimality and A-optimality.

5.4 5.4. Composite Ability

Two different composite ability combinations were addressed. For the first composite ability, the criterion of c-optimality with weights (½, ½) was used as selection rule, i.e., θ c = ½θ 1 + ½θ 2. The bias and MSE of the estimator of this composite ability are given in Table 4. For comparison, we also calculated the bias and MSE of a plug-in estimator with substitution of the MLEs of θ 1 and θ 2 from the earlier simulations with D- and A-optimality into the linear composite, the reason being a similar interpretation of these criteria in terms of weights of importance of θ 1 and θ 2. On average, c-optimality with weights (½, ½) yielded the highest accuracy for the estimates of θ c . Of course, all results for the estimators of the composite were obtained at the price of a larger MSE for the estimators of the separate abilities. (The latter are not shown here; their averages were 0.649 and 0.616 for the estimators of θ 1 and θ 2, respectively.)

Table 4 Bias and MSE of the estimate of θ c = ½θ 1 + ½θ 2 for adaptive testing with D-, A-, and c-optimality with weights (½, ½) as item selection criterion.

Second, a composite ability with unequal weights was considered: θ c = ¾ θ 1 + ¼θ 2. In this composite, the first ability is considered to be more importance than the second. Again, the items were selected using c-optimality with weights (¾, ¼) as criterion. The bias and MSE of the estimator are given in Table 5, which also shows the results for the plug-in estimator with the MLEs of θ 1 and θ 2 from the earlier simulations with D s - and c-optimality with weights (½, ½). Note that D s -optimality is equivalent to c-optimality with weights (1, 0). Table 5 shows that c-optimality with the weights (¾,¼) resulted in the smallest average MSE for this composite ability.

Table 5 Bias and MSE of the estimate of θ c = ¾θ 1 + ¼θ 2 for adaptive testing with D s -optimality, c-optimality with weights (¾, ¼), and c-optimality with weights (½, ½,) as item selection criterion. Note that D s -optimality is equivalent with c-optimality with weights (1, 0).

Figure 5 also displays the empirical frequencies of the discrimination parameters selected in these simulations with the composite abilities. (The distributions for the difficulty and guessing parameters are omitted because they were similar to those for the previous cases.) The criterion of c-optimality with the weights (½, ½) had strong preference for items with a large value for a 1 +a 2. This finding reflects the fact that we tested the simple sum θ 1 +θ 2. Consequently, when the item was sensitive to θ 1 or θ 2 only, it tended to be ignored by the criterion. For the case of c-optimality with the weights (¾, ¼), the distribution of the discrimination parameters was similar to that for D s -optimality. This result makes sense because the weights were now closer to the case of (1, 0) implied by the use of D s -optimality. It also explains the small difference between the MSEs for D s -optimality and c-optimality with the weights (¾, ¼) in Table 5.

5.5 5.5. Average Values of Optimality Criteria

Table 6 shows the average determinant, trace, largest eigenvalue, first diagonal element, the weighted sum with λ 1 = (½, ½)T, and the weighted sum with λ 2 = (¾, ¼)T of the final covariance matrix \({\rm{I}}_{{S_n}}^{ - 1}(\hat \theta )\) at the end of the simulated adaptive tests for each of the selection criteria. Except for E-optimality, each of the criteria produced the smallest average value for the specific quantity optimized by it. For instance, the criterion of D-optimality resulted in the smallest average determinant of the final covariance matrix (= smallest generalized variance) among all criteria.

Table 6 Average value of the quantities optimized by each of the selection criteria at the end of the adaptive tests.

6 6. Conclusions

Both our theoretical analyses and the results from the study with simulated data allow us to draw the following conclusions:

  1. 1.

    When all abilities are intentional, the criterion of A-optimality tends to result in the most accurate MLEs for the separate abilities. But the results for D-optimality were close. The most informative items measure mainly one ability, i.e., have one large discrimination parameter and small parameters for the other abilities. Furthermore, both criteria tend to “minimax”: when the estimator of one of the abilities has a small sampling variance, they develop a preference for items that are highly informative about the other abilities. Consequently, the accuracy of the final estimates of a sufficiently long tests are approximately equal.

  2. 2.

    When one of the abilities is of interest and the others should be considered as a nuisance, item selection based on D s -optimality (or A s -optimality) seems to result in the most accurate estimates for the intentional ability. The accuracy of the estimator of the intentional ability tends to be higher than when all abilities are considered as intentional. The advantage is obtained at the price of less accurate estimation of the nuisance ability. (But, of course, this is something we should be willing to pay.) Again, items that measure only the intentional ability are generally most informative. But when the current inaccuracy of the estimator of a nuisance ability becomes too large relative to that of the intentional abilities, the dependency of the latter on the former becomes manifest and an occasional preference for an item mainly sensitive to the nuisance ability emerges.

  3. 3.

    When the goal is to estimate a linear combination of the abilities, c-optimality with weights λ proportional to the coefficients in the composite ability results in the most accurate MLE of the combination. The criterion has a preference for items when the proportion of the discrimination parameters reflects the weights in the combination.

All conclusions were based on analyses of criteria for a three-dimensional abilities. As already indicated, generalization to higher dimensionality does not involve any new obstacle. However, it should be observed that these conclusions only hold for an item pool that allows free selection from all possible combinations of item parameters (as in our simulation study). For instance, when a two-dimensional item pool would consists only of items with a small discrimination parameter for θ 1 and a large parameter for θ 2, but only the former is intentional, the MSE of its estimator might be larger than that of the nuisance ability, even when our suggestions for the choice of criterion are followed.

In fact, even when the item pool has no constraints, we often are forced to impose constraints on the item selection that may have the same effect. One obvious example is when we need to constrain the item selection to guarantee that the content specifications for the test are satisfied (van der Linden, 2005, Chap. 9) and the content attributes of the items correlate with their statistical parameters. Another example is when item selection is constrained to deal with potential overexposure of some of the items in the pool, for instance, when using the Sympson-Hetter (1985) exposure control method or the selection is constrained more directly using item-ineligibility constraints (van der Linden & Veldkamp, 2007). Because the item-exposure rates are typically correlated with the discrimination parameters of the items, exposure control is expected to have an even stronger impact on our conclusions.

As a next step, it would be interesting to investigate item selection more closely when using other information measures than Fisher’s. The most likely candidate is Kullback-Leibler information (Chang & Ying, 1996; Veldkamp & van der Linden, 2002). For example, it would be interesting to see if an application of the criterion of A-optimality would then also prefer items that mainly test a single ability and c-optimality would prefer items with large sums of discrimination parameters. A confirmation of the findings in this article for other information measures would made them more robust. However, we do not expect the criterion of E-optimality criterion to show improved behavior for other measures. Both theoretically and empirically, we found its behavior to be too erratic to warrant application in real-world adaptive testing.