Cognitive diagnostic computerized adaptive testing (CD-CAT; Cheng, 2009; McGlohen & Chang, 2008) is computerized adaptive testing (CAT) built upon a cognitive diagnostic model (CDM; Rupp & Templin, 2008; Rupp et al., 2010). Cognitive diagnostic models (CDMs) are considered important statistical tools that link item responses to latent cognitive profiles, which capture the strengths and weaknesses of each respondent in terms of their mastery of discrete knowledge points or attributes. Hence, testing programs built on CDMs have both features of model-based measurement and formative assessment (Embretson, 2001).

In a typical adaptive testing system, items are sequentially selected from an item bank, tailored to each respondent according to certain item selection rules, for example, maximizing test information or minimizing the standard error of measurement of the latent trait. In CD-CAT, the goal is to efficiently estimate the latent cognitive profiles by sequentially choosing the most suitable items for each candidate (Cheng, 2009; Dai et al., 2016; Yu et al., 2019; Zheng & Chang, 2016; Zheng & Wang, 2017). Given a well-designed item bank, continuous testing can be offered through CD-CAT, which means that efficient formative assessment can be provided to students continuously.

In real applications, any CAT systems that offer continuous testing need to replenish their item banks periodically. This is because repeated use of items may pose a risk to test security and validity. Therefore, retiring flawed, obsolete, or overexposed items and replacing them with new items that have been calibrated, a process called item replenishment, is important for continuous testing (Chen et al., 2012; Chen et al., 2015; Chen & Xin, 2011; Ren et al., 2017). For this reason, new items constantly need to be developed, reviewed, and calibrated for CAT programs.

Online calibration in CAT refers to estimating the parameters of new items that are administered to respondents during the course of their operational testing along with previously calibrated items (Wainer & Mislevy, 2000). Ren et al. (2017) pointed out several main advantages of online calibration. First, new items are calibrated under the exact same condition as for their future operational use. Second, the item parameters of the new items are calibrated on the same scale as the operational items, which means linking or rescaling is no longer required. Commonly used methods that have been proposed to calibrate new items include Method A and Method B (Stocking, 1988), marginal maximum likelihood estimation with one expectation maximization (OEM) iteration (Wainer & Mislevy, 2000), and marginal maximum likelihood estimation with multiple EM (MEM) iterations (Ban et al., 2001; Ban et al., 2002).

New items for CD-CAT need to be calibrated in terms of both item parameters and the attribute vectors. In contrast, in traditional CAT, item calibration only refers to the estimation of item parameters. Thus, it is even more challenging to conduct online item calibration for CD-CAT than regular CAT. Chen et al. (2012) considered the online calibration of only the item parameters in CD-CAT and proposed three methods, namely Cognitive Diagnostic-Method A (CD-MA), Cognitive Diagnostic-One EM cycle (CD-OEM), and Cognitive Diagnostic-Multiple EM cycle (CD-MEM). These methods assume known attribute vectors and are analogs to methods described in the preceding paragraph. For online calibration of both item parameters and attribute vectors, literature is relatively scarce. Chen and Xin (2011) proposed a joint estimation algorithm (JEA), which considered jointly estimating the attribute vectors and the item parameters based on the DINA (Deterministic Input, Noisy output “AND” gate; see Junker & Sijtsma, 2001; de la Torre, 2009) model. Their results indicated the JEA can have a promising performance. Chen et al. (2015) considered two Bayesian variations of JEA: the SIE (Single Item Estimation), and the SimIE (Simultaneous Item Estimation) method. As their names suggest, in SIE a single new item is calibrated at a time, while in SimIE multiple new items are calibrated at a time. With a sample size larger than 800, Chen et al. (2015) showed that SIE and SimIE methods perform better than the JEA method in the estimation of both attribute vectors and the item parameters. Due to their iterative nature, SIE and SimIE showed very similar performances in estimating attribute vectors and item parameters. For all three methods, JEA, SIE, or SimIE, the estimation of the item parameters is highly dependent on the estimation of the attribute vectors. However, if the sample size is relatively small (e.g., 400 or fewer), item parameters cannot be estimated well even with known attribute vectors, let alone with unknown attribute vectors (Chen et al., 2015).

Given the limitations of existing methods, in this paper we propose an iterative two-step procedure to estimate both attribute vectors and item parameters with relatively small sample sizes. First, we propose to use a residual-based statistic to estimate the attribute vectors in the context of CD-CAT. This step does not require known or precisely estimated item parameters. In the second step, we treat the estimated attribute vector as true, and estimate the item parameters based on CD-Method A, CD-OEM, or CD-MEM. The procedure proceeds iteratively until convergence is reached.

The rest of this paper is organized as follows. First, we provide a literature review for the existing methods on this topic, which involves two main lines of research: online calibration of the item parameters only, and online calibration of both the item parameters and attribute vectors. Next, we introduce in detail a new method of attribute vector estimation using a residual-based statistic, and the iterative two-step procedure for estimating both item parameters and attribute vectors. A simulation study to assess the performance of the proposed estimation methods is then described. A real-data analysis is provided to illustrate the application of RMEM in practice. Discussions and implications of the results are given in the last section.

Online calibration methods in CD-CAT

In this section, we briefly review several existing methods. For the sake of convenience but without loss of generality, we first introduce the following terms and notations that will be used throughout the remainder of the paper. As discussed earlier, new items refer to the items whose attribute vectors and item parameters are unknown, in contrast to the operational items that have been previously calibrated in the item bank. Let’s assume an existing item bank with J operational items. Meanwhile, the item parameters and attribute vectors of M new items need to be estimated. Consider a CD-CAT that targets a total of K attributes. Each of the J operational items require a distinct subset of the K attributes (denoted as qj, j = 1, 2, …, J) for them to be answered correctly. The stacked qjs form the item-attribute associations matrix for the item bank, namely the Q-matrix, which is a binary J × K matrix. The Q-matrix for the m new items is denoted as Qnew. The mastery status of each of N test takers is captured by αi (i = 1, 2, …, N), the attribute mastery pattern vector or, AMP. L refers to the fixed test length, and a N × L matrix X denotes the item response matrix with its binary element Xij, with Xij = 1 indicating a correct response of test taker i on item j, and Xij = 0 an incorrect response. Let nm be the total number of respondents responding to the mth new item.

As a parsimonious and popular CDM model, the DINA model is used here as an example (de la Torre, 2009). An expected or ideal response under the DINA model is characterized by an indicator variable, denoted as \({\eta}_{ij}={\prod}_{k=1}^K{\alpha}_{ik}^{q_{jk}}\), which is used to indicate whether the ith respondent possesses all the required attributes of the jth item or not. Unexpected responses are accounted for by the slipping and guessing parameter, where sj = P(Xij = 0| ηij = 1) and gj = P(Xij = 1| ηij = 0), respectively. The probability of a correct response to the jth item by the ith respondent under the DINA model is therefore defined as

$$P\left({X}_{ij}=1|{\boldsymbol{\alpha}}_i\right)={\left(1-{s}_j\right)}^{\eta_{ij}}{g_j}^{1-{\eta}_{ij}}.$$
(1)

For a new item m, its attribute vector qm and item parameters (sm, gm) are of key interest in online calibration.

Online calibration of item parameters

The following three methods are based on the assumption that the attribute vectors of the new items are known (i.e., qm’s are available, perhaps through content experts who label each item for the attributes they measure), and only their item parameters need to be estimated.

CD-Method A For a new item m, suppose that there are nm respondents responding to the item. The CD-method A treats the estimated AMP \({\hat{\boldsymbol{\alpha}}}_i\) as the true αi, which was obtained based on the operational items answered by the ith respondent, then estimates the slipping and guessing parameters through maximum likelihood (de la Torre, 2009).

$$\frac{\partial {l}_m}{\partial {\textrm{s}}_m}=0,$$
(2)
$$\frac{\partial {l}_m}{\partial {g}_m}=0,$$
(3)

where \({l}_m\left({\textbf{x}}_i|{\textbf{q}}_m,{s}_m,{g}_m\ \right)=\log \left(\prod_{i=1}^{n_m}{P}_{s_m,{g}_m}{\left({\textbf{q}}_m,{\hat{\boldsymbol{\alpha}}}_i\right)}^{x_{im}}{\left[1-{P}_{s_m,{g}_m}\left({\textbf{q}}_m,{\hat{\boldsymbol{\alpha}}}_i\right)\right]}^{1-{x}_{im}}\right)\) is the log-likelihood function, and qm is the attribute vector for item m. xim refers to the score of the mth new item answered by the respondent i (0/1), and \({P}_{s_m,{g}_m}\left({\textbf{q}}_m,{\hat{\boldsymbol{\alpha}}}_i\right)\) refers to the response probability to new item m under the DINA model evaluated at \({\hat{\boldsymbol{\alpha}}}_i\).

CD-OEM. CD-OEM applies a single cycle of an EM algorithm (Chen et al., 2012; de la Torre, 2009) to estimate the item parameters for each new item. For the mth new item, based on the posterior distribution of the AMPs, the CD-OEM method considers one E-step obtaining the expected proportion of respondents who have AMP \({\hat{\boldsymbol{\alpha}}}_v\) among those who answer the new item m, where \({\hat{\boldsymbol{\alpha}}}_v\) refers to one of the 2K possible attribute profiles and \(\sum_{v=1}^{2^K}{P}_m\left({\hat{\boldsymbol{\alpha}}}_v\right)=1\). Next, the M-step finds the \({\hat{s}}_m\) and \({\hat{g}}_m\) that maximize the logarithm of the corresponding expected likelihood.

CD-MEM By allowing multiple EM cycles, the CD-OEM becomes the CD-MEM. The first EM cycle in CD-MEM is the same as in the CD-OEM method, and the obtained item parameters and attribute vectors are regarded as the initial values of the second EM cycle. The CD-MEM method utilizes both the responses of operational items and new items to calculate the posterior distribution of the AMPs for the E-step from the second EM cycle onward, then fixes the item parameters of the operational items, and adopts the same M-step as that of the CD-OEM method to update the item parameters of the new items (refer to Chen et al., 2012 for further details). The EM cycles are repeated till a stop criterion is met.

Results of Chen et al. (2012) showed that CD-Method A, CD-OEM, and CD-MEM are able to recover item parameters accurately with large sample sizes, and CD-Method A performs the best when the items have smaller slipping and guessing parameters, but its performance is largely affected by the item parameter magnitude.

Online calibration of both item parameters and attribute vectors

The Joint Estimation Algorithm (JEA)

Based on the DINA model, Chen and Xin (2011) proposed the JEA to jointly estimate both the attribute vectors and the item parameters, which is the analog of the joint maximum likelihood estimation (JMLE; Baker & Kim, 2004) method in item response theory (IRT). As the extension of CD-Method A, the JEA treats the AMPs estimated from operational items as true, and then estimates the item parameters and the attribute vectors for the new items, one item at a time. For the mth new item, the JEA maximizes lm(qm, sm, gm ) with respect to qm given (sm, gm), then consider the estimated qm as true and optimizes lm(qm, sm, gm ) with respect to (sm, gm). This is done iteratively until convergence is reached. Convergence can be defined as a very small difference of the log-likelihood between one iteration and the next.

To account for the uncertainty of the estimated AMPs, the SIE and SimIE are two Bayesian versions of the JEA.

The Single Item Estimation Method (SIE) Instead of plugging in the estimates of the AMPs of the respondents who answered the mth new item, the SIE method considers the expected log likelihood

$${\displaystyle \begin{array}{l}\textrm{E}\left({l}_m\left({\textbf{x}}_i|{\textbf{q}}_m,{s}_m,{g}_m\ \right)\right)\\ {}=\sum\limits_{i=1}^{n_m}\sum\limits_{{\boldsymbol{\alpha}}_i}{\pi}_i\left({\boldsymbol{\alpha}}_i;{s}_m,{g}_m\right)\left[{x}_{im}\log {P}_{s_m,{g}_m}\left({\textbf{q}}_m,{\boldsymbol{\alpha}}_i\right)+\left(1-{x}_{im}\right)\left(1-\log {P}_{s_m,{g}_m}\left({\textbf{q}}_m,{\boldsymbol{\alpha}}_i\right)\right)\right],\end{array}}$$
(4)

where πi(αi; sm, gm) is the posterior distribution of αi based on the operational items (in the first EM cycle), or both the operational items and new items (in the remaining EM cycles). By doing so, SIE takes the uncertainty of \({\hat{\boldsymbol{\alpha}}}_i\) into account. The SimIE further considers calibrating multiple new items at a time.

The Simultaneous Item Estimation Method (SimIE)

As noted by Chen et al. (2015), the more accurate the information about the AMP is, the better the calibration will be. Therefore, the motivation of the SimIE is to borrow some useful information from the new items to improve the estimation of the unknown AMPs. However, borrowing information from those inadequately calibrated items may have a detrimental effect on the estimation of AMPs. In order to address this issue, Chen et al. (2015) proposed an index, here denoted as ωm (denoted as ηj in the original paper, but as ωm here to avoid confusion), to evaluate the confidence in the fit of \({\hat{\textbf{q}}}_m\). ωm was defined as the difference between the log-likelihood function for the two most probable \({{\hat{\textbf{q}}}_m}^{\prime }s\) for the mth item. Half of the 95th percentile of the χ2 distribution with one degree of freedom, i.e., 1.92, was chosen as the empirical cutoff for the “good” new items in Chen et al. (2015). Then treating the first chosen new item, which has the maximum ωm and ωm > 1.92, as an additional operational item, SimIE updates the posterior distribution of the AMP of the respondents based on all operational items, and recalibrates the second chosen new item. This process is repeated until all the chosen new items are treated as additional operational items. Then, new items which are not selected in the preceding step are calibrated one at a time. This is one estimation cycle. The algorithm proceeds until the chosen items do not change in two consecutive cycles.

Attribute vector estimation based on a residual-based statistic

In this section, we first briefly introduce the residual-based statistic (please refer to Yu and Cheng (2020) for more details) to measure the appropriateness of the attribute vector of an item. Then we present the theoretical proof that under the DINA model, the proposed residual-based statistic can be used to identify the true attribute vector of the mth new item with arbitrarily chosen item parameters under certain assumptions. This may help liberate the dependency on large sample size for existing methods.

Let E(Xim| αi) be the expected score for the ith respondent with AMP αi, and P(Xim = xim| αi), denoted by P(xim| αi) for short, be the probability for the respondent obtaining score xim, xim being 0 or 1. Then the appropriateness index of the attribute vector for the mth item can be defined as

$${R}_m\left(\boldsymbol{\alpha}, {\textbf{q}}_m,{s}_m,{g}_m\right)=\sum_{i=1}^{n_m}\log {\left[\frac{x_{im}-E\left({X}_{im}|{\boldsymbol{\alpha}}_i\right)}{P\left({x}_{im}|{\boldsymbol{\alpha}}_i\right)}\right]}^2, or\ \sum_{i=1}^{n_m}\log \left|\frac{x_{im}-E\left({X}_{im}|{\boldsymbol{\alpha}}_i\right)}{P\left({x}_{im}|{\boldsymbol{\alpha}}_i\right)}\right|,$$
(5)

where α is a matrix of vertically stacked \({\boldsymbol{\upalpha}}_i^{\prime}\textrm{s},\) i.e., attribute profiles of those respondents who answered the mth new item. The squared form is numerically two times the absolute form, so the performance of the method based on these two forms are equivalent. The squared form will be used in all our simulation conditions just for coding consistency. Under the DINA model, according to the values of ηim and the response xim, each respondent is classified into one of the four groups, G1, G2, G3 and G4, where respondents in G1 have ηim = 1 and xim= 1, respondents in G2 have ηim = 1 and xim= 0, respondents in G3 have ηim = 0 and xim= 1, respondents in G4 have ηim = 0 and xim= 0, respectively. Hence, formula 5 can be expanded to

$${\displaystyle \begin{array}{l}{R}_m\left(\boldsymbol{\upalpha}, {\textbf{q}}_m,{s}_m,{g}_m\right)\\ {}=2\sum\limits_{i=1}^{n_m}\log \left[{\eta}_{im}{\left(\frac{s_m}{1-{s}_m}\right)}^{x_{im}}{\left(\frac{1-{s}_m}{s_m}\right)}^{1-{x}_{im}}+\left(1-{\eta}_{im}\right){\left(\frac{g_m}{1-{g}_m}\right)}^{1-{x}_{im}}{\left(\frac{1-{g}_m}{g_m}\right)}^{x_{im}}\right],\end{array}}$$
(6)

where \({\eta}_{im}=\prod_{k=1}^K{\alpha_{ik}}^{q_{mk}}\) is the ideal response of the ith examine (with attribute profile αi) to the mth item (with attribute vector qm). We expect that given \(\hat{\boldsymbol{\alpha}}\) from operational items, \({R}_m\left(\hat{\boldsymbol{\alpha}},{\textbf{q}}_m,{s}_m,{g}_m\right)\) as a function of qm is minimized when qm is at its true value.

Theorem 1. Consider an infinite sample, that is N → ∞, and the true item parameters sm, gm ∈ (0, 0.5). Denote \(\hat{\boldsymbol{\alpha}}\) as the estimate of α. Furthermore, assume its true value α is known in advance. Given the provisional item parameters for the mth item as \(\left({s}_m^0,{g}_m^0\right)\), where \({s}_m^0\), \({g}_m^0\) are two arbitrarily chosen real numbers within the range of (0, 0.5), and denote \({\hat{R}}_m^0\left(\boldsymbol{\alpha}, {\textbf{q}}_m,{s}_m^0,{g}_m^0\right)\) as the value of the residual-based statistic evaluated at \(\left({s}_m^0,{g}_m^0\right)\), then \({\hat{R}}_m^0\left(\boldsymbol{\alpha}, {\textbf{q}}_m,{s}_m^0,{g}_m^0\right)\) reaches its minimum only when qm is correctly specified.

Theorem 1 is the basis of our proposed iterative two-step online calibration method leveraging the residual statistic. According to the Theorem 1, we can obtain the attribute vector for each new item by arbitrarily assigning item parameters to it, e.g., \({s}_m^0=0.25\), \({g}_m^0=0.25\), and minimizing the residual statistic. In other words, it is not necessary to jointly estimate the attribute vector and the item parameters for each new item, and the vector of the new item can be obtained based on the fixed item parameters as long as α is known. Then one can estimate the item parameters based on the vector obtained in the preceding step. This is very useful for situations where existing joint estimation methods suffer, e.g., when the sample size is small, which is oftentimes the case for a diagnostic test. For conciseness, the proof is presented in Appendix A.

The iterative two-step online item calibration method

Based on the preceding theorem, we propose an iterative two-step method for online item calibration. A flow chart describing the procedure is presented in Fig. 1. As we can see, by fixing the new item parameters at 0.25 (or any value between 0 and .5), the estimated attribute vector for the mth new item can be obtained based on the attribute profiles estimated from the responses of the operational items. In the second step, assume the estimated vector for each new item as true, the CD-MA, CD-OEM, and CD-MEM can be applied to calibrate item parameters as described in Chen et al. (2012). Accordingly, the resulting three variations of the iterative two-step online calibration methods based on the residual statistic are denoted as RMA, ROEM, and RMEM, respectively. Let \(\hat{R}\left(\hat{\boldsymbol{\alpha}},{\hat{\boldsymbol{Q}}}_{\boldsymbol{new}},\hat{\boldsymbol{s}},\hat{\boldsymbol{g}}\right)\) denote the sum of the R for all new items, that is, \(\hat{R}\left(\hat{\boldsymbol{\alpha}},{\hat{\boldsymbol{Q}}}_{\boldsymbol{new}},\hat{\boldsymbol{s}},\hat{\boldsymbol{g}}\right)=\sum_{m=1}^M{\hat{R}}_m\left(\hat{\boldsymbol{\alpha}},{\hat{\textbf{q}}}_m,{\hat{s}}_m,{\hat{g}}_m\ \right)\), and let \({\hat{R}}_{{\hat{\textbf{Q}}}_{new}}^t\) be the shorthand of \(\hat{R}\left(\hat{\boldsymbol{\alpha}},{\hat{\boldsymbol{Q}}}_{\boldsymbol{new}},\hat{\boldsymbol{s}},\hat{\boldsymbol{g}}\right)\) in the tth iteration. The iterative algorithm stops till the number of iterations reaches its prespecified maximum or the difference between two adjacent iterations, \({\hat{R}}_{{\hat{\textbf{Q}}}_{new}}^t\) and \({\hat{R}}_{{\hat{\textbf{Q}}}_{new}}^{t-1},\kern0.75em\) is smaller than a preset threshold.

Fig. 1
figure 1

The flow chart of the iterative two-step online item calibration method. Note. \({\hat{\textbf{Q}}}_{new}^t\) is the attribute vector definition of the new items in the tth iteration. \({\hat{R}}_{{\hat{\textbf{Q}}}_{new}}^t\) and \({\hat{R}}_{{\hat{\textbf{Q}}}_{new}}^{t-1}\)refer to the sum of the R statistic for all new items in the tth and the (t − 1)th iteration, respectively. \({\textbf{Q}}_{q_m}\) refers to the set of the possible attribute vectors of the mth new item. and \({\hat{\textbf{q}}}_m\) is the estimate of the attribute vector for item m. \(\hat{\boldsymbol{\alpha}}\) refers to the AMP estimates of those respondents who were administered the new item. \({\hat{s}}_m\) and \({\hat{g}}_m\) refer to estimates of the slipping and the guessing parameters, \({s}_m^0\) and \({g}_m^0\) are their initial values, respectively. In the context of cognitive diagnosis, CD-MA, CD-OEM, CD-MEM refer to the online calibration of item parameters based on method A, OEM, and MEM, respectively

The process of the calibration of the mth item can be described as follows:

  • Step 1: Estimate the attribute vector for the mth new item:

  1. (1)

    Obtain \(\hat{\boldsymbol{\alpha}}\) of each examinee based on their responses to the operational items;

  2. (2)

    Assigning the initial slipping and guessing parameter as 0.25, estimate the attribute vector for each new item based on the proposed R statistic.

  • Step 2: Based on the estimated attribute-vectors obtained from the last step, apply the CD-MA, CD-OEM, or CD-MEM method to update the slipping and guessing parameters for the mth item.

Two practical concerns arise when using the iterative two-step procedure in real applications. One is that the true AMPs are unknown, and the AMPs based on responses to operational items are used in their place. The other is that theorem 1 holds only when N → ∞. Therefore, robustness of the proposed procedure in presence of unknown AMPs and limited sample size remains to be examined. In order to evaluate the performance and the robustness of the proposed two-step method under the condition of unknown AMPs and a relatively small sample size, a simulation study is conducted. According to the results of Chen et al. (2015), SIE and SimIE have almost the same performance with sample sizes smaller than 1600. Since our main goal is to compare the online item calibration methods in the context of CD-CAT with a relatively small sample size, only the JEA, SIE, and the three residual-based methods are involved in the following simulation study. The purpose of this article is twofold: (a) to introduce three residual-based methods implemented in an iterative algorithm for online calibration in CDA, and (b) to examine how the performance of these methods compares to that of the JEA and SIE under a wide range of conditions by means of a simulation study.

Simulation study

Diagnostic assessment sees great promise in classroom assessment, which calls for considerations of a small sample size and short test length. Furthermore, the AMP distributions are most likely different for respondents in different classes. Therefore, in a comprehensive simulation study, we evaluate the performance of the proposed method under various conditions, e.g., different sample sizes, test lengths, distribution of AMPs, and proportion of the new items to the operational items. The performance of the proposed methods is compared against two existing methods, JEA and SIE. For each condition, the simulation is replicated 1000 times. The same as Chen et al. (2012), the number of attributes measured by the test is set as K = 6. Therefore, the number of possible AMPs is 26 = 64. The comparison is made in terms of the accuracy of the estimation of the attribute vectors for the new items, slipping and guess parameters, as well as the respondents’ AMPs.

Sample Size

Six sample sizes (200, 400, 600, 800, and 1000) are considered. The first three are small sample sizes, and the last two are medium sample sizes.

Test Length

Three test lengths (20, 30, and 40) are considered. Each test consists of a certain number of operational items and new items, with the total test length being 20, 30, or 40. For each test length, the rate of new to operational items (denoted by λ) is 1:4, 1:3, or 1:2. For example, at the test length of 30, there could be six new and 24 operational items, or roughly eight new and 22 operational items, or ten new and 20 operational items.

Respondent Generation

We use a similar method to Chen et al. (2012) and Chen et al. (2015) to generate the AMPs of respondents. Two independent groups of respondents are simulated. The first group assumes each respondent has a 50% probability of mastering each attribute, i.e., all attributes are equally “difficult”. The second group assumes that the probability of mastery varies from one attribute to another. More specifically, the probability of mastery is set at 0.65, 0.25, 0.75, 0.45, 0.55, and 0.35 for attribute 1 to 6, where 0.65 and 0.75 refer to low difficulty, 0.45 and 0.55 refer to medium difficulty, and 0.25 and 0.35 refer to high difficulty.

Item Bank Generation

Similar to Chen et al. (2012) and Chen et al. (2015), two item banks are simulated based on the ranges of the item parameters. The slipping and guessing parameters are all randomly drawn from U(0.05, 0.25) for the first item bank, which feature items with high discrimination (Kaplan et al., 2015), and drawn from U(0.15, 0.35) for the second item bank, resulting in an item bank of low discrimination (Kaplan et al., 2015). A total of 360 items with the same Q-matrix as in Chen et al. (2012) are generated. Typically, high discrimination items involve less noise (as represented by slipping and guessing), and lead to better measurement outcomes.

New Item Generation

The same as Chen et al. (2012) and Chen et al. (2015), suppose the number of the new items as 20, which indicates that there are 20 items in the Qnew, the associated attribute vectors for them are randomly drawn from the operational item banks. The set of the new items will be drawn either from the low-discrimination bank or high-discrimination bank, denoted as New1 or New2, respectively. Table 1 presents detailed information of the new items.

Table 1 The settings of the new items

Simulation of CD-CAT and Online Calibration

For each respondent, the CD-CAT and the online calibration proceed as follows: (1) Generate the initial AMP estimate randomly, with each attribute having an equal probability of being mastered or not mastered; (2) Select the next item based on the most recent AMP estimate; (3) Generate the response to the selected item and update the AMP estimate according to the responses to the previously administered items. Steps 2 and step 3 are repeated until the stopping rule is satisfied. During the process, a certain number of new items (1/3, 1/4, or 1/5 of the test length) are randomly seeded in the test of each respondent. Three fixed test lengths L = 20, 30, and 40 are simulated, and the item selection strategy for operational items is the Shannon Entropy method (SHE; Cheng, 2009; Tatsuoka, 2002, Xu et al., 2003). The prior distribution of the AMP is assumed to be the uniform distribution. It should be noted that the AMP estimates of CD-Method A, CD-OEM, and CD-MEM are based on the operational items, while those of SIE and SimIE are based on both the operational items and new items.

Update of the AMP

In the simulation, the Maximum A Posterior (MAP; Huebner & Wang, 2011) method is used to update the AMP estimates of respondents:

$${\hat{\boldsymbol{\alpha}}}_i=\underset{v=1,2,\cdots, {2}^K}{\textrm{argmax}}P\left({\boldsymbol{\alpha}}_v|{\textbf{X}}_i\right),$$
(7)

where Xi refers to the response pattern for the ith respondent. As noted by Chen et al. (2012), the AMP is estimated after each operational item is answered. The test is terminated as soon as the test length reaches L.

Evaluation Criteria

For each condition, the following eight criteria are applied to evaluate the performance of online calibration methods. The first three indices are used to evaluate the estimation of the AMPs, while the remaining indices address the estimation accuracy of the item parameters and the attribute vectors for the new items.

Person Pattern Accuracy Rate (PPAR)

The PPAR represents the proportion of respondents whose AMPs are correctly estimated, which is defined as follows:

$$PPAR=\frac{\sum_{i=1}^NI\left({\boldsymbol{\alpha}}_i={\hat{\boldsymbol{\alpha}}}_i\right)}{N},$$
(8)

where \(I\left({\boldsymbol{\alpha}}_i={\hat{\boldsymbol{\alpha}}}_i\right)\) is an indicator function which equals 1 if the estimate AMP \({\hat{\boldsymbol{\alpha}}}_i\) for the ith respondent equates to its true value αi, and 0 otherwise.

Person Attribute Accuracy Rate (PAAR). The PAARk quantifies the estimation accuracy rate for attribute k:

$${PAAR}_k=\frac{\sum_{i=1}^NI\left({\alpha}_{ik}={\hat{\alpha}}_{ik}\right)}{N}.$$
(9)

Average Person Attribute Accuracy Rate (APAAR)

The APAAR summarizes the average attribute estimation accuracy at the person level for the CD-CAT, which can be determined as follows

$$APAAR=\frac{\sum_{i=1}^N\sum_{k=1}^KI\left({\alpha}_{ik}={\hat{\alpha}}_{ik}\right)}{NK}.$$
(10)

The following five indexes evaluate the estimation of the new items.

Root Mean Squared Error (RMSE)

The RMSE summarizes the overall performance of the calibration accuracy of the slipping and guessing parameters of the M new items (Chen et al., 2012; Chen et al., 2015):

$${s}_{RMSE}=\sqrt{\frac{1}{M}\sum_{m=1}^M{\left({s}_m-{\hat{s}}_m\right)}^2},$$
(11)
$${g}_{RMSE}=\sqrt{\frac{1}{M}\sum_{m=1}^M{\left({g}_m-{\hat{g}}_m\right)}^2}.$$
(12)

Item Pattern Accuracy Rate (IPAR)

The IPAR indicates the calibration accuracy for the attribute vector of the new items, which is defined as follows:

$$IPAR=\frac{\sum_{m=1}^MI\left({\hat{\textbf{q}}}_m={\textbf{q}}_m\right)}{M},$$
(13)

where I(∙) is an indicator function: \(I\left({\hat{\textbf{q}}}_m={\textbf{q}}_m\right)\) returns a value of 1 when \({\hat{\textbf{q}}}_m\) and qm are equal, and returns a 0 otherwise.

Item Attribute Accuracy Number (IAAN)

The IAAN quantifies the average number of attributes per item that are specified correctly for the new items:

$$IAAN=\frac{\sum_{m=1}^M\sum_{k=1}^KI\left({\hat{q}}_{mk}={q}_{mk}\right)}{M}.$$
(14)

Among the preceding indices: The PPAR, PAAR, and APAAR are used to summarize the estimation accuracy of AMPs. Higher value indicates better estimation. sRMSE and gRMSE are used to evaluate the item parameter estimation accuracy for the new items. Smaller sRMSE and gRMSE indicate more accurate estimation of item parameters. The IPAR and IAAN quantify the attribute vector estimation accuracy of the new items, with larger values representing a more accurate estimation of attribute vector.

Results

Figure 2 and Table 2 provide the indices of the AMP estimation accuracy for the CD-CAT, which includes PPAR, PAAR, APAAR, under the condition of the sample size of 200 (Results for other sample sizes show similar patterns and are omitted to save space. They are available upon request). It should be noted that these three indices are calculated only based on the operational items. The two uppercase letters in the first column of the tables refer to the range of item parameters and attribute mastery probability. The letters “L” and “H” denote the low- and high-discrimination items with parameters’ range [0.15, 0.35] and [0.05, 0.25], respectively. The letters “S” and “D” refer to respondents with the same and different mastery probabilities, respectively. Results indicate that the test with highly discriminating items is indeed better for the estimation of respondents’ attribute profile, consistent with expectation. For example, the test with 13 high-discrimination operational items (i.e., in the 20-item highly discriminating test, with 1:2 new to operational item ratio) can reach a comparable PPAR of the test with 22 low-discrimination operational items (i.e., in the low-discrimination test with test length of 30, with 1:4 new to operational item ratio). Similar results between HS and HD, as well as between LS and LD suggest that the attribute mastery probabilities show little effect on the estimation of the respondents’ attribute profiles. Due to the fixed test length of the CD-CAT, the AMP estimation precision will decrease with the number of seeded new items, because AMP estimation depends on the responses to the operational items. For example, at the length of the 20-item test with the rate of new to operational items being 1:4, 1:3, and 1:2, the PPARs are 0.944, 0.906, and 0.810, respectively.

Fig. 2
figure 2

The PPAR (Person Pattern Accuracy Rate) of the new items. Note. The first letter ‘H’ or ‘L’ in the labels for the x-axis refers to items with high- or low-discrimination, the second letter ‘S’ or ‘D’ refer to respondents with the same or different attribute mastery probability (ies). The number after the underscore refers to the test length. For example, HS_20 refers to the test with highly discriminative items and test length of 20. The numbers in the legend refer to the ratio of the number of seeded new items to the number of operational items

Table 2 Estimation accuracy of the respondents under the sample size of 200

The six columns below PAAR in Table 2 are the estimation accuracy index of six attributes, which indicates that the test with high-discrimination items result in a higher PAAR, and the test with more operational items also lead to a higher PPAR, which can be easily seen in Fig. 2. When the test length reaches as high as 40, the difference caused by the ratio of new items and operational item becomes less pronounced (see Fig. 2). On the other hand, the distribution of the attribute mastery probability shows a small effect on the estimation of respondents’ attribute profiles. Table 2 also shows that the PPAR and APAAR indices have the same trend as PAAR.

Tables 3, 4, 5, 6 and 7 present the IPAR index of the new items. Based on the results, more discriminating items, i.e., items with lower guessing and slipping parameters, are beneficial for online calibration. The proposed residual-based (R-based) methods outperform the JEA and SIE method in attribute vector estimation of the new items. When all attributes are equally likely to be mastered, RMEM has the highest IPAR in most cases. Between JEA and SIE, there does not seem to be a consistent winner in terms of the IPAR index, suggesting that the Bayesian version of the JEA could not always borrow enough information to help the item calibration. For the R-based methods, RMA and ROEM have close performances. Results also suggest that a higher IPAR index can be obtained with more seeded new items. For example, consider a sample size of 200 respondents and test length of 20, the IPAR index for RMEM under three different ratios of seeded new items and operational items are 0.464, 0.524, 0.538 (see Table 3). The increase of seeded new items leads to more responses to each new item, and subsequently leads to better estimation of new items’ attribute vectors. Consider the sample size of 400 and a 20-item test, if five new items (corresponding to 1:3 new to operational item ratio) are seeded in the test, about 400×5/20 = 100 respondents answer each new item on average. Nevertheless, if seven new items (i.e., the ratio of new to operational items is 1:2) are seeded, about 400×7/20 = 140 respondents answer each new item on average. Meanwhile, the decrease of the operational items will lead to lower PPAR index, which is harmful to the calibration. Therefore, a trade-off between the number of seeded new items and operational items needs to be considered.

Table 3 The IPAR (Item Pattern Accuracy Rate) for the new items with the sample size of 200
Table 4 The IPAR (Item Pattern Accuracy Rate) for the new items with the sample size of 400
Table 5 The IPAR (Item Pattern Accuracy Rate) for the new items with the sample size of 600
Table 6 The IPAR (Item Pattern Accuracy Rate) for the new items with the sample size of 800
Table 7 The IPAR (Item Pattern Accuracy Rate) for the new items with the sample size of 1000

All five methods have better performances with more discriminating items, which is consistent with the findings of Chen et al. (2012). For example, for the RMEM method in the 20 items test with 200 respondents, the values of the IPAR index for the HS and LS condition with a 1:4 new to operational item ratio are .790 and .464. For the two distributions of respondents’ attribute mastery probability, each of the methods has better performance in terms of the IPAR index under the condition of respondents with the same attribute mastery probability of 0.5. Also, take the 20-item test, 200-respondents condition as an example, under the HS and HD conditions, with a new to operational ratio of 1:4, the IPARs of the RMEM method are .790 and .708, respectively.

Across five samples, the same trend for the IAAN index is observed. Hence, we only provide the results under the condition of the sample size of 200 and 400, which are presented in Tables 8 and 9. For this index, 6 means that all attributes of the item are estimated correctly, and the closer to 6 the better. As we can see, RMEM performs better in most of the conditions. RMA and ROEM have comparable IAAN in some cases. For example, 4897 attributes can be correctly recovered on average under the condition of 20-item test with 1/4 seeded new items, and respondents with uniform attribute mastery probability.

Table 8 The IAAN (Item Attribute Accuracy Number) for the new items with the sample size of 200
Table 9 The IAAN (Item Attribute Accuracy Number) for the new items with the sample size of 400

Consider the item parameter estimation of the new items, the RMA and RMEM lead to comparable RMSEs for both the slipping and guessing parameters, and they together outperform the other three methods. As shown in Tables 10, 11, 12, 13 and 14, ROEM results in higher sRMSE and gRMSE than RMEM and RMA. As discussed before, information borrowed from the respondents’ posterior distribution may not be enough to improve the online item calibration, and in most cases, the JEA has the largest sRMSE and gRMSE. The same as the attribute vector estimation, each method has better or comparable performances when the respondents have the same attribute mastery probability. With more seeded new items, estimation of the new items become better, as more seeded new items for each respondent mean more responses can be collected for each new item. Though the estimation accuracy of the respondents' AMP decreases in the test with more seeded new items, the increase of the respondents for each new item can improve the calibration of the new item, again pointing to a tradeoff.

Table 10 The RMSE (root mean squared error) of the item parameters for the new items with the sample size of 200
Table 11 The RMSE (root mean squared error) of the item parameters for the new items with the sample size of 400
Table 12 The RMSE (root mean squared error) of the item parameters for the new items with the sample size of 600
Table 13 The RMSE (root mean squared error) of the item parameters for the new items with the sample size of 800
Table 14 The RMSE (root mean squared error) of the item parameters for the new items with the sample size of 1000

Figure 3 illustrates the IPAR index of the condition of sample size 200 with different test lengths. As we can see, on one hand, the IPAR becomes better with more seeded new items (with a new to operational item ratio of 1:4 to 1:2 within each specific test length). On the other hand, the IPAR increases with the test length, and the full range of IPAR gets tighter. Figure 4 shows the IPAR index under the 20-item test, and 1:4 new to operational ratio condition with different sample sizes. It is clear that the R-based statistics have higher IPAR indices, JEA outperforms SIE when the sample size is smaller than 600, and SIE has an equal or higher IPAR index than JEA when the sample size is 600 or higher. Figure 5 only provides the IPAR for the 20-item test with different sample sizes for the RMEM method, which shows that the proposed method performs better both when the items are highly discriminative and when the attribute mastery probability is uniform across attributes.

Fig. 3
figure 3

The IPAR (Item Pattern Accuracy Rate) in different test lengths with 200 respondents. Note. The first letter ‘H’ or ‘L’ in the legend refer to items with high- or low-discrimination, the second letter ‘S’ or ‘D’ refer to respondents with the same or different attribute mastery probability (ies), and \(\frac{1}{2}\), \(\frac{1}{3},\) or \(\frac{1}{4}\) denote the rate of new to operational items. RMA, ROEM, and RMEM are variations of CD-MA, CD-OEM, and CD-MEM, respectively. JEA and SIE refer to the joint estimation algorithm and the single item estimation method, respectively

Fig. 4
figure 4

The IPAR (Item Pattern Accuracy Rate) in the 20-item test with 1/4 seeded new items under different sample sizes. Note. The first letter ‘H’ or ‘L’ in the legend refer to items with high- or low-discrimination, and the second letter ‘S’ or ‘D’ refer to respondents with the same or different attribute mastery probability (ies). RMA, ROEM, and RMEM are variations of CD-MA, CD-OEM, and CD-MEM, respectively. JEA, and SIE refer to the joint estimation algorithm and the single item estimation method, respectively

Fig. 5
figure 5

The IPAR (Item Pattern Accuracy Rate) for the RMEM method with different sample sizes in the 20-item test. Note. The first letter ‘H’ or ‘L’ in the legend refer to items with high- or low-discrimination, the second letter ‘S’ or ‘D’ refer to respondents with the same or different attribute mastery probability (ies), and \(\frac{1}{2}\), \(\frac{1}{3},\) or \(\frac{1}{4}\) denotes the rate of new to operational items. RMEM is a variation of the CD-MEM method

It is worth pointing out that although this method has promising performance in calibrating new items in small samples and theoretically does not depend on the initial value of the item parameters, it relies on accurate estimation of respondents’ AMP. Therefore, the premise that the method does not depend on the initial item parameters is that AMPs are estimated sufficiently well based on the operational items. For that reason, the number of operational items taken by respondents and the number of respondents who take each new item should not be too small.

Real data example

Due to the unavailability of a real dataset for CD-CAT, the real data example based on a dataset collected from a non-adaptive test is used for illustrative purposes for the proposed iterative two-step method. It is important to note that this does not mean that the proposed method is restricted to non-adaptive testing. One can view the application to non-adaptive testing as a special case where attribute profiles of test takers can be obtained based on the responses to the items with known attribute vectors (these items correspond to the operational items in adaptive testing), and the items that need to be estimated corresponds to new items in online calibration of CD-CAT. In fact, though the motivation for this approach was to develop an online calibration method for adaptive testing, the method can be used both for adaptive and non-adaptive tests.

The real dataset used here was collected from a learning experiment at the University of Tuebingen in Germany. The dataset contained responses from 504 examinees to 12 elementary probability theory problems that measure the following four attributes: (A1) calculate the classic probability of an event, (A2) calculate the probability of the complement of an event, (A3) calculate the probability of the union of two disjoint events, and (A4) calculate the probability of two independent events. The Q-matrix was initially produced by content experts and response data are available in the R package pks (Heller & Wickelmaier, 2013). Wang et al. (2020) applied several methods to estimate the Q-matrix by treating eight of the 12 items as operational items and the remaining four as new. Here we follow a similar strategy, i.e., we consider eight of 12 items as the operational items, which are items 1, 2, 3, 4, 6, 7, 9, 11, and the remaining four items (items 5, 8, 10, 12) as new. The Q-matrix for the operational items and the original Q-matrix for the new items in the package pks are given in Table 15.

Table 15 The Q-matrix for the operational items, and the original and suggested Q-matrix for the new items

Responses to the eight operational items are referred to as XO, and responses to the four new items are referred to as XN. Based on the two-step on-line item calibration method, we follow the process below to obtain the Q-matrix for the new items:

  1. (1)

    Obtain the estimates of the attribute profile \(\hat{\boldsymbol{\alpha}}\) of each examinee based on the XO,

  2. (2)

    Assign the initial slipping and guessing parameter as 0.25, estimate the attribute vector for each new item based on the proposed R statistic,

  3. (3)

    Based on the attribute-vectors obtained from the last step, apply the CD-MEM method to estimate the slipping and guessing parameters,

  4. (4)

    Repeat step 2 to step 3 till the convergence condition reaches.

The estimated Q-matrix for the new items is presented at the bottom of Table 15. The proposed method suggested four changes to the original Q-matrix, which are all from 1 to 0. This seems to indicate that the proposed method tends to assign fewer attributes to each new item. Take the first new item (its name is p105 in the pks package) as an example, whose stem is “Given a standard deck containing 32 different cards, what is the probability of not drawing a heart?” The RMEM suggests that it only measures the attribute A1, which is to calculate the classic probability of an event. The original attribute specification of this item is A1 and A2, where A2 refers to “calculate the probability of the complement of an event”. Based on our analysis, it does not seem to require mastery of A2 to answer this item. The estimated Q-matrix could serve as a reference for domain experts, who can further review the changes.

Conclusions and further discussion

In this paper, we proposed a method based on a residual-based statistic to estimate attribute vectors of new items in the online calibration of CD-CAT. The rationale of the use of the residual-based statistic in online calibration is presented in Appendix A. Essentially, the residual statistic is minimized when the attribute vector of a new item is at its true value, regardless of the item parameters. An iterative two-step online calibration method was thus developed in the context of CD-CAT in which the attribute vectors and item parameters are estimated in separate steps iteratively. By coupling CD-MA, CD-OEM, and CD-MEM with the residual-based statistic, three new online calibration methods: RMA, ROEM, and RMEM, are developed. The analytical result in Appendix A holds when N → ∞ and the AMPs of respondents are known. When the AMPs need to be estimated, and the sample size is limited, the performance of RMA, ROEM, and RMEM are not guaranteed to be optimal, but could still be superior to existing methods.

The results from the simulation study indicate that the methods based on the proposed statistics do work well in terms of item-parameter recovery, and attribute-vector recovery, even under a small sample size. Compared to the JEA and SIE methods, the methods based on the residual statistic show some advantages, especially in the situation of a small sample size. Results also suggest that RMA and ROEM perform similarly in the estimation of the attribute vector of the new items, and RMA and SIE have similar performance in the estimation of the item parameters of the new items, especially in the test with highly discriminative items. For a CD-CAT system, quality of items (operational items and new items) is very important because it can seriously affect the efficiency and accuracy of the test, and the online calibration as well.

Several future directions for research need to be considered. First, the Q-matrix in this study is generated assuming that attributes are independent. However, in more realistic conditions, some relationships may exist among the attributes, such as hierarchical relationships (Leighton et al., 2004). Non-independence may impact the performance of the proposed methods, which is worthy of investigation in the future. Second, the proposed methods were evaluated under the DINA model, and it should be adapted to many other CDMs, such as RRUM (Hartz, 2002), DINO (Templin & Henson, 2006) and more general models (e.g., Ma & de la Torre, 2016, 2019) such as the G-DINA model (de la Torre, 2011). Under the G-DINA model, each respondent is classified into one of the \({2}^{k_m^{\ast }}\) groups, where \({k}_m^{\ast }=\sum_{k=1}^K{q}_{mk}\). Then the residual statistic defined in Eq. (6) can be adapted as follows for the G-DINA model:

$${R}_m\left(\boldsymbol{\alpha}, {\textbf{q}}_m,{s}_m,{g}_m\right)=2\sum\nolimits_{l=1}^{2^{k_m^{\ast }}}\sum\nolimits_{i=1}^{n_{lm}}\log \left\{{\left[\frac{1-p\left({\alpha}_{lm}^{\ast}\right)}{p\left({\alpha}_{lm}^{\ast}\right)}\right]}^{x_{im}}+{\left[\frac{p\left({\alpha}_{lm}^{\ast}\right)}{1-p\left({\alpha}_{lm}^{\ast}\right)}\right]}^{1-{x}_{im}}\right\},$$
(15)

where nlm refers to the number of respondents with attribute vector \({\boldsymbol{\alpha}}_{lm}^{\ast }\), and \({\boldsymbol{\alpha}}_{lm}^{\ast }=\left({\alpha}_{l{m}_1},\cdots, {\alpha}_{l{m}_{k_m^{\ast }}}\right)\). The probability that respondents with attribute pattern \({\boldsymbol{\alpha}}_{lm}^{\ast }\) will answer item m correctly is denoted by \(p\left({X}_{im}=1|{\boldsymbol{\alpha}}_{lm}^{\ast}\right)=p\left({\boldsymbol{\alpha}}_{lm}^{\ast}\right)\). By defining an appropriate residual statistic, the proposed method in this paper is potentially applicable to other models. That said, it remains to be investigated how well the adapted residual statistic works, and whether nice statistical properties such as what is demonstrated in Theorem 1 still holds true for other models.

Third, the study assumes that the attribute vectors and item parameters of all the operational items are known. In reality, those must have been estimated or specified by content experts at some point. How will the proposed methods perform when the attribute vectors or the item parameters or both for some of the operational items are misspecified? How badly will different methods react to the misspecification? These are issues yet to be investigated. Finally, recent popularity of online learning environments has prompted advances in continuous item calibration that may not require any operational items to begin with (Fink et al., 2018) for CAT. The same philosophy may be applicable to CD-CAT and is certainly an interesting direction to pursue.