Computer adaptive testing (CAT) is a measurement approach that uses item response theory (IRT; Lord & Novick, 1968) to generate tailored tests for examinees in real time on the basis of their responses to previous items (Lord, 1971, 1980). By administering items that are the most informative for an examinee, CATs provide precise measurement of an examinee’s proficiency with relatively few items. However, CATs require the consideration of many practical components, making its implementation relatively complicated when compared to paper-and-pencil administrations (Parshall, Spray, Kalohn, & Davey, 2002; Wainer, Dorans, Flaugher, Green, & Mislevy, 2000). These required CAT components include: (a) an item pool with known item characteristics, (b) a response model appropriate for the item type, (c) an item selection algorithm, (d) an ability estimation procedure, and (e) some termination criteria to end item administration (Dodd, De Ayala, & Koch, 1995; Reckase, 1989; Weiss & Kingsbury, 1984). In addition, the algorithm used for item selection may be constrained to include content balancing and exposure control mechanisms, particularly in high-stakes testing (Boyd, Dodd, & Choi, 2010). Content-balancing methods ensure that each test administered covers multiple domains according to predetermined specifications. Exposure control procedures are designed to enhance test security by protecting items from overexposure. Both procedures place constraints on maximum information item selection, which consequently increases test length and decreases measurement efficiency (Weiss, 2004). Therefore, CAT components must be carefully selected so that appropriate content coverage and test security are provided while maintaining a psychometrically sound estimate of an examinee’s ability (Weiss, 2004).

CATs have two primary advantages over conventional linear test forms, which administer the same items to all examinees. One advantage is that CATs provide precise measurement of all examinees throughout the proficiency range (Lord, 1971). This is due to the item selection algorithm used in CATs, which selects the most informative item for an examinee after each item administration. All examinees have an initial ability estimate (usually the population mean) that is updated after the administration of each item. This interim ability estimate is used in the item selection procedure, so that an appropriate item is selected out of the item pool for the next item to be administered. Another benefit of CATs is increased measurement efficiency in comparison to conventional tests. Efficient measurements use the fewest possible items to gain the most information possible about an examinee. The efficiency of CATs is related to the selection of informative items on the basis of an examinee’s current ability estimate, such that items that are too easy or too hard are not administered. This item selection procedure has the potential to reduce test length by 80% (De Ayala, 2009).

These two fundamental benefits define the goal of CAT administration—to simultaneously maximize measurement precision and efficiency. Though measurement precision and efficiency are both increased by the information provided by selected items, they are inversely related to one another due to their differing relationships with test length. Although measurement precision increases along with test length, efficiency requires that traits be measured with as few items as possible. Measurement precision and efficiency must be balanced so that only as many items needed to gain a sufficiently reliable estimate of an examinee’s proficiency are administered.

Thus, an important consideration in CATs is how many items to administer before estimating the final ability level. This is determined by a termination criterion, or stopping rule, which ends each examinee’s test once they have been assessed equivalently according to some prespecified standard. There are several stopping rules for CATs, which differ in the criteria used to indicate that an examinee has been measured sufficiently; however, their relative performance under the generalized partial-credit model (GPCM; Muraki, 1992) has not been studied. This simulation study provides a comprehensive examination of the performance of several variable-length stopping rules (i.e., standard error, minimum information, and absolute change in theta estimate) under the GPCM. Importantly, in contrast to previous research examining the performance of stopping rules in CAT, we employ content balancing and item exposure controls to simulate high-stakes testing conditions.

Stopping rules

Stopping rules can be categorized as either fixed-length (FL) or variable-length, which refers to whether test length is equal or varied across examinees. FL stopping rules are the most straightforward as they end an examinee’s test after a predetermined number of items are administered, regardless of their ability. However, using a FL test results in a lack of consistent measurement precision (i.e., the standard error of the θ ability estimate) for all examinees across the range of abilities (Leroux & Dodd, 2014). These imprecise ability estimates are problematic, especially in high-stakes testing (e.g., licensure or certification testing) in which they can have detrimental implications. Although administering a greater number of items can provide more precise ability estimates, doing so decreases the efficiency of using CAT and can lead to item overexposure. Similarly, some examinees may be measured with high precision after responding to only a few items, so administering additional items unnecessarily exposes items to examinees and increases examinee burden.

Variable-length stopping rules are designed to provide equivalent measurement precision across examinees by ending item administration after a prespecified measurement standard has been satisfied. These tests are of variable length because examinees may take a different number of items before the criterion for test termination is met. Researchers have developed several variable-length stopping rules, which all aim to administer as few items as needed to obtain a psychometrically sound estimate of an examinee’s ability but differ in the criteria used to indicate that an examinee’s ability has been measured adequately. One variable-length stopping rule is the standard error (SE) stopping rule, which terminates item administration when a prespecified standard error of the present ability estimate has been reached (Dodd, Koch, & De Ayala, 1989). After each interim ability level has been estimated, the standard error associated with the examinee’s current ability estimate is evaluated. If this standard error is below some prespecified value, then item administration ends. If not, item administration continues until the standard error associated with the interim ability estimate falls below the criterion value. Once item administration terminates, the most recent interim ability estimate becomes the final ability estimate.

Rather than evaluating the precision of the ability estimate of the examinee, the minimum-information (MI; Dodd et al., 1989) stopping rule focuses on the quality of the items in terms of the information they provide. This stopping rule determines when a test is completed by evaluating the information of the available items remaining in the pool after each item is administered. If eligible items in the item pool provide some sufficiently high level of information on the basis of the interim ability estimate, then item administration continues. When the information of the remaining items falls below the specified level, item administration ends.

A more recently developed stopping rule is the absolute-change-in-theta (CT) stopping rule. This variable-length stopping rule regulates test length using the absolute change in an examinee’s theta estimate (\( \widehat{\uptheta} \)) after an item is administered (Babcock & Weiss, 2012). During a CAT administration, an examinee’s \( \widehat{\uptheta} \) generally changes with each additional item administered, though the size of this change lessens as the number of administered items increases. The CT stopping rule evaluates the absolute change in the \( \widehat{\uptheta} \) after each item is administered to an examinee. When this expected change falls below a specified value, item administration ends.

FL and variable-length stopping rules are often combined, such that a variable-length termination criterion is used as the stopping criteria until the examinee is administered a maximum number of items. Using a maximum test length in addition to a variable-length stopping rule is beneficial when there is a mismatch between the item pool distribution and the examinee’s ability level. In consequence, the examinee could be administered all items in the item pool because no items remain that can satisfy the variable-length stopping criterion. Thus, in order to keep CATs efficient, as well as reduce examinee burden and item exposure, a secondary FL termination criterion is often used to stop item administration after a certain number of items.

Performance of CAT stopping rules

Previous research has examined the performance of different stopping rules for polytomous IRT models. The performance of the stopping rule can differ on the basis of its interaction with other aspects of CAT administration, such as item pool characteristics, the match of the distribution of the item pool shape to the examinees, and whether item are dichotomously or polytomously scored (Boyd et al., 2010). CAT has greater measurement efficiency when used with polytomous items because each response category within an item provides additional information (Dodd et al., 1995). Because of the greater amount of information polytomous items provide, fewer polytomous items are needed to obtain the same level of measurement precision as dichotomous items.

Much of the previous research on stopping rules with polytomously scored items has examined the performance of the SE and MI stopping rules (with FL as a secondary termination criterion). Dodd, Koch, and De Ayala (1989) examined these stopping rules with the graded-response model (GRM; Samejima, 1969), the partial-credit model (PCM; Masters, 1982) in later research (Dodd, Koch, & De Ayala, 1993), and Dodd (1990) used the Andrich’s Rating Scale Model (RSM; Andrich, 1978). In these studies the SE stopping rule generally outperformed the MI stopping rule—fewer items were administered, the correlations between known and estimated ability levels were higher, and fewer nonconvergent cases resulted. The only simulation conditions in which the MI stopping rule had superior performance to the SE rule were those in which the information distribution of the item pool did not align with the trait distribution of the test taker population. This was due to the SE stopping rule administering more items than the MI stopping rule to examinees with ability levels that were far from the peak of the item pool shapes.

More recently, Babcock and Weiss (2012) examined the performance of 14 different stopping rules on dichotomously scored items using the three-parameter logistic (3PL) IRT model. The researchers examined various cutoff values for the SE stopping rule, the MI stopping rule, combinations of SE and MI stopping rules, the CT stopping rule, and several fixed-length stopping rules. The authors concluded that the SE stopping rule works well in most cases, but recommended including a minimum number of items or using it in combination with an MI stopping rule. The information structure of the item bank had less impact on the performance of the CT stopping rule in comparison to other termination methods, meaning the CT rule might have increased utility when the information distribution of an item bank does not cover the range of examinee abilities in a sample (Babcock & Weiss, 2012).

Content balancing and exposure control

Exposure control and content balancing methods are commonly implemented in large-scale high-stakes testing scenarios, which often require that test-takers be measured across multiple domains within a single ability continuum, such as addition, subtraction, multiplication, and division within an arithmetic achievement assessment (Weiss, 2004). Content balancing methods are implemented to ensure that examinees answer a sufficient number of items from each domain so that items are equally distributed across the domains for all examinees, meaning that their ability estimates will be determined from similar content.

Large-scale assessments frequently use content balancing concurrently with an exposure control procedure. During CAT administration, different items in the item bank will naturally be used at different rates. Reducing item exposure rates is important in high-stakes testing scenarios when there is incentive for examinees to have access to items prior to testing. Exposure control methods improve test security by controlling the exposure rate of items across a group of examinees. Exposure control and content balancing methods are both implemented by modifying the maximum information item selection procedure that places constraints on what items can be administered (Weiss, 2004). The incorporation of these constraints can decrease measurement precision (Davis, 2004; Davis & Dodd, 2003). Implementing exposure control and content balancing may also increase the test length and reduce the efficiency of CATs, as more items are required to obtain the prespecified degree of precision specified by variable-length stopping rules (Leroux & Dodd, 2016; Leroux, Lopez, Hembry, & Dodd., 2013).

Despite the wide use of content balancing and exposure control methods in large-scale assessments, no prior research has been conducted that compares variable-length stopping rules under their implementation. CATs incorporating both exposure control and content balancing have been studied with fixed-length tests (Boyd, Dodd, & Fitzpatrick, 2013). There has also been research comparing exposure control methods using content balancing for dichotomous (i.e., Leroux et al., 2013) and polytomous (i.e., Davis, Pastor, Dodd, Chiang, & Fitzpatrick, 2003; Leroux & Dodd, 2016) IRT models that included separate FL and SEFL stopping rule conditions but did not compare variable-length stopping rules. Variable-length stopping rules have only been compared without the use of content balancing or exposure control (e.g., Choi, Grady, & Dodd, 2010; Dodd et al., 1989, 1993; Leroux & Dodd, 2014). Due to the relationship between stopping rules and exposure control and content balancing mechanisms, it is important to extend previous research of CAT stopping rules to the frequently encountered scenarios in which these constraints are implemented.

Purpose of study

The CT stopping rule is a recently developed approach to CAT termination, which has demonstrated strong performance under a dichotomous IRT model (3PL; Babcock & Weiss, 2012), but the generalization of this finding is limited to the specific set of conditions previously researched. The present study continues this examination of the performance of SE, MI, and CT stopping rules by extending it to a scoring model for polytomous items, the GPCM (Muraki, 1992). Furthermore, we utilize exposure control and content balancing in our study to mirror conditions in high-stakes testing, as previous research examining variable-length stopping rules has excluded both of these CAT components and stopping rules may be sensitive to their use.

Method

Study design overview

In the present study, we compare variable-length CAT stopping rules under the GPCM (Muraki, 1992) as shown in Eq. 1. The GPCM models the probability of scoring in category x on item i out of mi + 1 response categories for an individual with a given trait level, θ as:

$$ {P}_{ix}\left(\uptheta \right)=\frac{\exp \left[\sum \limits_{j=0}^x{a}_i\left(\uptheta -{b}_{ij}\right)\right]}{\sum \limits_{r=0}^{m_i}\exp \left[\sum \limits_{j=0}^r{a}_i\left(\uptheta -{b}_{ij}\right)\right]} $$
(1)

where ai is the discrimination or slope of the item, bij is the step difficulty parameter associated with score category j (j = 1, . . . , mi), and Σ ai (θ – bij) = 0 when j = 0. Item discrimination, ai, varies across items but not within items across categories. The GPCM requires that the steps within an item be completed in order, though the step difficulties of the ordered categories, bij, are not required to be in sequential order.

We examined 21 different stopping rules, which consisted of variations of SE, MI, and CT stopping rules, as well as their combination with a FL stopping rule. These are discussed in detail in a following section. This study used a repeated measures design in which each simulated examinee was administered the CAT 21 times, once using each stopping rule. All data generating procedures and analyses were conducted in SAS statistical software (version 9.4 for Windows).

Item pool description and data generation

The item pool for this study was generated using the item parameter values of a large-scale educational assessment previously calibrated using the GPCM (Davis, 2004). The means (SD, minimum, maximum) of the item discrimination (a) and step difficulty (b1–4) parameters were as follows: a = 0.92 (SD = 0.19, 0.54, 1.52), b1 = – 0.99 (0.90, – 3.13, 1.50), b2 = 0.18 (0.99, – 1.81, 3.57), b3 = – 0.19 (0.76, – 1.48, 1.51), b4 = – 0.12 (0.90, 2.36, 2.34). The pool consisted of 157 polytomously scored items with varying numbers of response categories—99 items with three response categories, 29 items with four categories, and 29 items with five categories. Each item assessed one of three content areas—61 items assessed content area A, 59 items assessed B, and 37 items assessed C. The numbers and proportions of items assessing each content area by the number of item categories are presented in Table 1. Item and test information based on the GPCM was calculated using the SAS macro IRTINFO (Fitzpatrick, Choi, Chen, Hou, & Dodd, 1994). Figure 1 shows the information function of the 157-item pool, which indicates adequate information coverage across the range of θ values, and maximal information at θ = – 0.6.

Table 1 Numbers and proportions of items by number of categories and content area
Fig. 1
figure 1

Item pool information curve

Item responses for a sample of 1,000 simulees were generated from the true generating item parameter values using the SAS macro IRTGEN (Whittaker, Fitzpatrick, Williams, & Dodd, 2003). The true θ levels for each sample of simulees were drawn from a normal distribution with a mean of 0 and a standard deviation of 1. There were 1,000 replications used in this simulation study, and thus, each of the 1,000 generated datasets (of N = 1,000 simulees) underwent 21 CATs (with different stopping rules), for a grand total of 21,000 CAT simulations.

CAT procedure

CATs were administered to the simulees in SAS software using a program similar to that of the commercially available software (SIMPOLYCAT; Chen & Cook, 2009). We used a modified version of a SAS program that simulated CAT implementation according to the GPCM (Davis, 2004), which was altered to implement the 21 stopping rule conditions. For each CAT administration, all simulees started with an initial θ estimate of zero. A variable step size procedure (Koch & Dodd, 1989) was used to estimate θ until the simulee made responses in two different categories, after which point maximum likelihood estimation (MLE) was used to obtain ability estimates. The variable step size approach changed the \( \widehat{\uptheta} \) by half the distance between the initial \( \widehat{\uptheta} \) and either of the two extreme step-difficulty parameter estimates depending on whether the response to the previous item was in the lower or upper half of the response scale. Only items that met the content balancing were used for the variable step size.

Content balancing was employed using Kingsbury and Zara’s (1989) constrained CAT (CCAT) content-balancing method. The goal of this procedure is to have the proportions of items that an examinee answers closely match the prespecified desired proportions of each content area (i.e., target proportions). After an item was administered to a simulee, the proportion of items answered in each content area out of all those administered was computed, and then the discrepancy between the true proportions and the target proportions was calculated. The content area with the largest discrepancy was selected for the subsequent item administration. Two item characteristics were used to define the areas and their target proportions—content area and the number of response categories. The combination of these two factors stratified the item pool into nine target areas. The target proportion for each of the nine item types was set equal to the proportion of items in the 157-item pool that belonged to that combination of the content area and scale length factors, which are presented in Table 1. Item administration then proceeded using the CCAT procedure.

The randomesque exposure control procedure (Kingsbury & Zara, 1989) was also incorporated into item selection. In this method, the item administered is randomly selected from a group of items that are the most informative given an examinee’s current \( \widehat{\uptheta} \). This study used a group size of three, which provides high measurement precision and exposure control (Davis, 2004). Content balancing was given precedence over exposure control. Therefore, the content area of the item to be administered to a simulee was first identified. Then, the three most informative items in that content area based on the simulee’s current \( \widehat{\uptheta} \) were identified. The CAT program then randomly selected one of those three items and administered it to the simulee. The program then calculated the discrepancy in actual and target content area proportions to determine the next content area and selected an informative item for the simulee using the randomesque procedure based on their updated \( \widehat{\uptheta} \). This item selection and administration process continued until the stopping rule was satisfied and the CAT program terminated, giving the simulee a final θ estimate for each of the 21 stopping rule conditions.

Implemented stopping rules

We investigated an FL stopping rule and several variations of three variable-length stopping rules: (a) SE, (b) MI, and (c) CT (Table 2). The FL stopping rule administered 20 items to all simulees. This is a sufficient test length for obtaining an accurate estimate of θ, as was indicated by prior simulation research using the same item pool (Davis, 2004; Leroux & Dodd, 2016) and across CAT research in general (Dodd, 1990; Dodd et al., 1989, 1993; Gorin, Dodd, Fitzpatrick, & Shieh, 2005; Koch & Dodd, 1989; Lee & Dodd, 2012).

Table 2 Summary of stopping rule conditions

To increase the generalizability of our findings, each variable-length stopping rule was implemented twice using different prespecified criterion values for the stopping rule. The SE stopping rule ended a test when the SE of the simulee’s \( \widehat{\uptheta} \) was less than the criterion value—being either 0.30 (in SE[.30] conditions) or 0.35 (in SE[.35] conditions), which are both commonly used SE criteria (e.g., Dodd, 1990; Dodd et al., 1993; Leroux & Dodd, 2014; Leroux et al., 2013). Equivalent SE and MI criteria were used to aid in comparisons between these stopping rules. MI criterion values were selected using the well-known relationship between the standard error of θ and the total information, specifically that the SE is equal to the inverse of the square root of the information for a given θ. The MI stopping rule ended the CAT program when all nonadministered items had Fisher information less than either 0.42 (MI[.42] conditions) or 0.56 (MI[.56] conditions) for the simulee’s current \( \widehat{\uptheta} \), which are equivalent to SE[.35] and SE[.30] conditions, respectively. Finally, the CT stopping rule ended a simulee’s test when the absolute change in the simulee’s \( \widehat{\uptheta} \) was less than either 0.02 (CT[.02] conditions) or 0.05 (CT[.05] conditions). These values were selected due to their usage in previous CT research (i.e., Babcock & Weiss, 2012) to aid in cross-study comparisons. The SE[.30], MI[.42], and CT[.02] represented more stringent criteria, in that they required more items in order to be satisfied in comparison to lenient criteria (SE[.35], MI[.56], CT[.05]), which produced a relatively shorter test.

Variable-length stopping rules were studied both under isolation and in combination with the FL stopping rule (i.e., a maximum number of items). When used in isolation, as previously described, the CAT program continued until the termination criteria was met or until no items remained. Variable-length stopping rules are frequently paired with a maximum number of items in real-world CAT applications to prevent the administration of more items than is necessary to obtain estimates with high measurement precision (Boyd et al., 2010; Dodd et al., 1995). Additional conditions were included that combined each variable-length stopping rule and criterion value with a FL maximum of 20 items (SEFL[.30], SEFL[.35], MIFL[.42], MIFL[.56], CTFL[.02], and CTFL[.05]), which ended the CAT program when either the variable-length criteria were reached or 20 items had been administered.

Preliminary CAT trials revealed that conditions using MI and CT stopping rules had high rates of nonconvergent cases, particularly in the MI conditions. We found that these nonconverged cases had usually only answered an average of four items, indicating that the criteria for these stopping rules were being satisfied before the program could obtain an acceptable θ estimate. Therefore, we included additional conditions that had the requirement that at least nine items be administered (MI9[.42], MIFL9[.42], MI9[.56], MIFL9[.56], CT9[.02], CTFL9[.02], CT9[.05], and CTFL9[.05]). Nine was selected as the minimum test length because it was equal to the number of content areas, meaning that all content areas could potentially be covered before termination of a test. SE stopping rules were not studied using a minimum test length because the SE stopping rules were already delivering at least nine items across all simulees and replications. The addition of these eight minimum test length conditions produced a total of 21 stopping rule conditions.

Data analyses

We compared CAT stopping rules using several criteria that are important in adaptive testing. We recorded the number of nonconvergent cases and calculated descriptive statistics of their frequency for each stopping rule. Only the estimates of converged cases were used in the following analyses. We also examined summary statistics of final trait estimates (\( \widehat{\uptheta} \)), the standard error of the trait estimate (\( {\mathrm{SE}}_{\widehat{\uptheta}} \)), and the number of items administered. The descriptive statistics for these criteria were calculated by finding their average within each replication across simulees, and then calculating the minimum, maximum and mean (i.e., grand mean) of these averages across the 1,000 replications of each stopping rule condition. We computed descriptive statistics for the Pearson correlations between true and estimated θ values, bias, and root mean square error (RMSE). Bias and RMSE were calculated using the following formulas:

$$ \mathrm{Bias}=\frac{\sum_{k=1}^n\left({\widehat{\uptheta}}_k-{\uptheta}_k\right)}{n} $$
(2)

and

$$ \mathrm{RMSE}=\sqrt{\frac{\sum_{k=1}^n{\left({\widehat{\uptheta}}_k-{\uptheta}_k\right)}^2}{n}}, $$
(3)

where \( {\widehat{\uptheta}}_k \) is the estimated trait level for simulee k, θk is the known trait level for simulee k, and n is the total number of simulees. In addition, we examined plots of test length, \( {\mathrm{SE}}_{\widehat{\uptheta}} \), bias, and RMSE conditional on θ to detect whether the stopping rules differed in their parameter recovery depending on a simulee’s true θ. The values on these plots were created by grouping simulees into 13 groups along the continuum of known θ from – 3 to + 3 and plotting the average test length, \( {\mathrm{SE}}_{\widehat{\uptheta}} \), bias, and RMSE for each θ group.

We also evaluated the stopping rules in relation to the exposure control and content balancing constraints imposed across all conditions. Minimum, maximum, and mean item exposure rates were calculated for each item by dividing the number of times an item was administered by the total number of simulees. These were averaged across the 157-item pool to evaluate the relative exposure rates of each stopping rule. Pool utilization was examined by the percentage of items that were never administered to each replicated sample of 1,000 simulees. We report the minimum, maximum, and mean percentage of items not administered across replications. In addition, we calculated the differences between the proportions of items administered in each content area and their targeted content area proportions to examine adherence to content balancing constraints across stopping rule conditions.

Results

Number of nonconvergent cases

Table 3 presents descriptive statistics for the frequency of nonconvergent cases across the 1,000 replications . The lowest nonconvergence rates occurred when using a fixed-length test of 20 items (FL) and in conditions using SE-based stopping rules (i.e., SE and SEFL), where nonconvergence occurred in an average of 1.5% of simulees. As was previously noted, stopping criteria that relied on minimum information (MI and MIFL) or absolute change in \( \widehat{\uptheta} \) (CT and CTFL) produced excessive numbers of inestimable traits when a criterion for minimum number of items administered was not included. The MI and MIFL conditions had the greatest number of nonconvergent cases, with an average of 6.8% of simulees when using information of 0.42, and an average of 11.4% when using 0.56 as the minimum information value. Investigation of these nonconverging cases revealed that the majority of these simulees were only administered three or four items and that maximum likelihood estimation had never been reached.

Table 3 Descriptive statistics of trait estimation and numbers of items administered

The average numbers of nonconvergent cases for CT and CTFL were less than half those for the MI and MIFL conditions, though they were still higher than would usually be expected, at about 2.8% of simulees for both 0.02 and 0.05 change in \( \widehat{\uptheta} \) conditions. Nonconvergent cases in the CT and CTFL conditions usually answered five to six items before the CAT program ended, but all of the simulees’ responses were in either the highest or the lowest response category, meaning that MLE could not be used (indicated by \( \widehat{\uptheta} \) ≤ – 4 or \( \widehat{\uptheta} \) ≥ 4). The addition of a requirement that a minimum of nine items be administered mitigated these convergence issues, reducing the number of nonconvergent cases to an average of 1.8% of simulees in these conditions (i.e., MI9, MIFL9, CT9, and CTFL9).

Trait estimation and number of items administered

Table 3 also presents descriptive statistics for the averages of trait estimates (\( \widehat{\uptheta} \)), standard errors of trait estimates (\( {\mathrm{SE}}_{\widehat{\uptheta}} \)), and numbers of items administered (NIA) by CATs for each stopping rule. The grand means of \( \widehat{\uptheta} \) were uniformly close to 0 across conditions, though the use of MI stopping rules tended to overestimate θ, particularly when using a more lenient MI value (MI = 0.56) and not including a minimum NIA requirement (\( \widehat{\uptheta} \) = 0.08). As would be expected, when more stringent values were used for the stopping rules—meaning when MI was higher (MI = 0.42), CT was lower (CT = 0.02), or SE was lower (SE = 0.30)—the average NIA was higher and \( {\mathrm{SE}}_{\widehat{\uptheta}} \) was lower than we observed under the comparative stopping rule condition using a more lenient value (i.e., MI = 0.56, CT = 0.05, SE = 0.35). The results also indicate that including a fixed-length component to a variable-length stopping rule successfully decreased test length, most dramatically so in MI conditions using a 0.42 information value, due to these conditions having a large drop in NIA (MIFL[.42], NIA = 16.13; MIFL9[.42], NIA = 18.49) from very high values when a maximum of 20 items was not in place (MI[.42], NIA = 45.79; MI9[.42], NIA = 51.13). Except for these two conditions, attaching a restriction of a maximum of 20 items to a variable-length stopping rule only increased \( {\mathrm{SE}}_{\widehat{\uptheta}} \) by about 0.01.

Figure 2 depicts the average NIA (i.e., test length) for each variable-length stopping rule, conditional on the simulees’ known θ. The MI stopping rules administer the largest number of items to simulees with θs in the center of the distribution, where the majority of informative items exist. The SE stopping rules display the opposite behavior, giving more items to simulees with extreme θs, whereas those in the center of the distribution are measured precisely with fewer items. The addition of a fixed-length component curtails these tendencies for delivering long tests for MI and SE stopping rules, though the CT stopping rule performs similarly with and without an FL component. When using a CT criterion of 0.05, the CT stopping rule ends item administration before ever reaching 20 items. The SE and CT stopping rules deliver similar numbers of items across the θ scale when a maximum number of items is incorporated in the stopping rule.

Fig. 2
figure 2

Mean numbers of items administered (NIA), conditional on known trait level (θ) by stopping rule condition, separated by criterion value and whether they included a maximum-NIA component. The FL condition is included in each panel for comparison. Lower criteria: CT = 0.02, MI = 0.42, SE = 0.30. Higher criteria: CT = 0.05, MI = 0.56, SE = 0.35

MI[.56] and MIFL[.56] had the poorest measurement precision out of the stopping rules, as is demonstrated by the standard errors of their θ estimates in Table 3. Though the average \( {\mathrm{SE}}_{\widehat{\uptheta}} \) of the MI conditions was lower when an MI criterion value was used, the \( {\mathrm{SE}}_{\widehat{\uptheta}} \) remained fairly high in the MIFL[.42] condition. Out of all stopping rule conditions, MI9[.42] had the greatest measurement precision, which was due to this stopping criterion delivering a very high number of items. CT stopping rules also had relatively large \( {\mathrm{SE}}_{\widehat{\uptheta}} \)s, with all conditions having grand mean \( {\mathrm{SE}}_{\widehat{\uptheta}} \) ≥ 0.37, with the exception of CT9[.02] and CTFL9[.02]. These two conditions had the highest average NIAs out of the CT conditions, delivering an average of 16 and 15 items, respectively. The CT conditions had the poorest measurement precision when using a 0.05 criterion and no minimum number of items, which is a result of these conditions only administering an average of nine items. CT stopping rules generally produced the shortest tests, and using an MI stopping rule produced the most variable test lengths. SE conditions also produced relatively short tests, particularly when used with using a higher SE criterion. As would be expected, the average \( {\mathrm{SE}}_{\widehat{\uptheta}} \) in the SE conditions were close to the SE criteria used to end the CAT program. The FL condition always administered 20 items, which resulted in a low standard error.

Figure 3 displays the average \( {\mathrm{SE}}_{\widehat{\uptheta}} \) of each variable-length stopping rule conditional on known θ. Although the MI and MI9 conditions produce low average \( {\mathrm{SE}}_{\widehat{\uptheta}} \)s for simulees with θ between – 1.0 and 0.0, all MI conditions increased rapidly, the farther the simulees’ true θ was from the peak of the item pool’s information function (θ = – 0.6) and were excessively high for simulees in the upper and lower regions of the θ range. SE conditions without a maximum number of items were consistently at the minimum SE value used for this stopping rule. When using an FL component, the \( {\mathrm{SE}}_{\widehat{\uptheta}} \) increased when the maximum of 20 items was reached before the minimum standard error criteria was met. CT conditions generally had higher \( {\mathrm{SE}}_{\widehat{\uptheta}} \)s in the center of the θ range than did the other conditions, though \( {\mathrm{SE}}_{\widehat{\uptheta}} \)s were comparable to those in the SE conditions toward θ extremes.

Fig. 3
figure 3

Grand means of the standard errors of trait estimates (\( \widehat{\uptheta} \)), conditional on known trait level (θ) by stopping rule condition, separated by criterion value and whether they included a maximum-NIA component. The FL condition is included in each panel for comparison. Lower criteria: CT = 0.02, MI = 0.42, SE = 0.30. Higher criteria: CT = 0.05, MI = 0.56, SE = 0.35

Latent trait recovery

Table 4 presents descriptive statistics for bias, RMSE, and correlations between the known and estimated θs. The biases of the lower and higher stopping criteria values were nearly identical for all stopping rules, and bias was slightly decreased when an FL component was attached to variable length. All FL, SE, and CT stopping rule conditions produced very low bias. Positive bias was present in all MI stopping rule variations. Overestimation of ability was particularly an issue in MI conditions that did not include a minimum NIA requirement. Including this requirement in MI conditions decreased bias by about 50%, producing lower results when used in combination with a FL stopping rule than when there was no maximum-number-of-items requirement. Though the mean bias was very low across all CT conditions, the greatest negative bias found across all simulated datasets was found in lenient CT and CTFL conditions. There was zero average bias when using an SEFL stopping rule, as well as when using a CT stopping rule when it was paired with a lower criterion or a minimum and/or maximum NIA component (i.e., CT9[.02], CTFL[.02], and CTFL9[.02]). Figure 4 presents plots of the mean bias for each stopping rule conditional on known θ. The MI-based conditions had the greatest fluctuation in bias throughout the θ range. Conditions using SE and CT stopping rules demonstrated low levels of bias across θ, particularly when used with the lower criteria. The conditions with a minimum number of items had smaller bias for CT and MI stopping rules across the θ continuum. Though the FL stopping rule produced very little bias, the CT9 and SE stopping rules produced less bias across the θ scale.

Table 4 Latent trait parameter recovery statistics by stopping rule
Fig. 4
figure 4

Bias conditional on known trait level (θ) by stopping rule condition, separated by criterion value and whether they included a maximum-NIA component. The FL condition is included in each panel for comparison. Lower criteria: CT = 0.02, MI = 0.42, SE = 0.30. Higher criteria: CT = 0.05, MI = 0.56, SE = 0.35

As would be expected because of increased test length, variable-length stopping rules using more stringent (i.e., lower) values as termination criteria had lower average RMSEs than did equivalent conditions using more lenient criteria. The RMSEs of variable-length stopping rules were also lower when they were not used in conjunction with a fixed-length component (i.e., maximum NIA), as well when a minimum-NIA requirement was in place in the MI and CT stopping rule conditions. The lowest RMSEs were observed when using the FL stopping rule or the 0.30 SE stopping rule without a maximum-number-of-items component, though MI9[.42] had a similarly low RMSE. The additional constraint of a maximum of 20 items in MIFL9 increased the RMSE, though it was still low, and RMSE further increased by using a lenient termination criterion. The highest RMSE was seen in MI conditions without a minimum-NIA requirement, particularly when using lenient termination criteria. Though using a lower MI criterion improved parameter recovery, the RMSEs of these MI conditions remained quite high. As in the MI conditions, CT stopping rules had better performance (i.e., lower RMSE) when used with minimum and maximum test length components, particularly when used with lower criteria. SE stopping rules yielded low RMSEs across conditions, with the largest average RMSE being produced in the lenient SEFL condition.

Figure 5 depicts the average RMSE conditional on a simulee’s known theta. As can be seen in the conditional-bias plot (Fig. 4), the θ recovery of the MI stopping rule varied greatly depending on a simulee’s true θ. When used without a maximum number of items, the SE stopping rule produced low and consistent RMSEs across the proficiency range, though it was outperformed by MI9 and FL conditions in the center of the θ distribution, because of their administration of a greater number of items. SE and CT9 had the best performance across θs out of the variable-length stopping rule conditions, though SE led to slightly lower RMSE levels than did CT9.

Fig. 5
figure 5

RMSE conditional on known trait level (θ) by stopping rule condition, separated by criterion value and whether they included a maximum-NIA component. The FL condition is included in each panel for comparison. Lower criteria: CT = 0.02, MI = 0.42, SE = 0.30. Higher criteria: CT = 0.05, MI = 0.56, SE = 0.35

Examination of the correlations between known and estimated θs revealed the same pattern of results detected for RMSE, in terms of relative parameter recovery ability across stopping rule conditions, and the influence of more stringent (i.e., lower) criteria as well as minimum and maximum NIA. Again, it is apparent that the CT and MI stopping rules are improved by a minimum NIA, especially when more lenient termination criteria are used. MI stopping rules without this requirement produced the lowest θ correlations, particularly when using a higher MI criterion. However, when MI was paired with a minimum NIA, it performed equivalently to the SE stopping rules, whose correlations ranged from .94 to .96. The highest correlations found across CT conditions were only marginally lower than those of the best-performing MI and SE conditions. Coinciding with previous results, the highest correlations were seen in the FL condition as well as in the SE[.30], SEFL[.30], and MI9[.42] variable-length conditions.

Item exposure, pool utilization, and content coverage

Table 5 presents descriptives of item exposure, pool utilization, and content area coverage. The majority of stopping rule conditions produced mean and maximum item exposure rates that were less than the commonly used target maximum exposure rates of 0.2 (e.g., Cheng, Diao, & Behrens, 2017; Wang, Chang, & Douglas, 2012) and 0.3 (e.g., Leroux & Dodd, 2014; Moyer, Galindo, & Dodd, 2012). The exposure rates closely aligned with the number of items administered (Table 3). CT stopping rules, which had the shortest average test length, produced the lowest rates of item exposure. The only exceptions to these low exposure rates were under the MI[.42] and MI9[.42] stopping rules, which had maximum exposure rates surpassing 0.30, as well as the longest test lengths of the studied stopping rules.

Table 5 Descriptive statistics of exposure control and content balancing

Pool utilization was assessed by examining the percentage of the item pool that was not administered across all simulees for each stopping rule condition. The SE stopping rule conditions without a fixed-length component had very high item pool utilization and administered virtually all 157 items across examinees. Using an FL stopping rule led to an average of 24.3% of the item pool not being administered. All variable-length stopping rules with a maximum test length exceeded this percentage, because of their shorter average test lengths. SEFL and CTFL conditions behaved similarly, though the SEFL stopping rules had the greatest pool utilization out of the fixed-length stopping rules. MI and CT stopping rules without a maximum test length had low mean percentages of nonadministered items when they were used with lower criterion values. However, the maximum percentages of unutilized items reveal that these conditions (MI[.42], MI9[.42], CT[.02], and CT9[.02]) did not perform uniformly well across replications, and at times used less than 5% of the items in the pool. The stopping rules that led to greatest number of nonadministered items were MIFL conditions without a minimum-number-of-items requirement.

Coverage of content areas was evaluated by finding the discrepancy between the targeted and actual proportions of items to be administered from each content area across all simulees and replications within each stopping rule condition. Table 5 presents the largest discrepancies between the targeted and actual content area proportions across the nine content areas. The average difference between the true and actual content area proportions is not presented because it was zero across conditions. The CCAT content balancing procedure appears to have worked well across stopping rule conditions, since the absolute value of the greatest difference in proportions was less than .05 for each stopping rule. The largest discrepancies in targeted and actual content area coverage were in CT conditions using a more lenient criterion value (i.e., CT = 0.05) and with a minimum test length, which had maximum content area discrepancies around .05. The lowest content area discrepancies were in conditions that had the longest tests, MI[.42] and MI9[.42], which had a maximum difference in content area proportions around .01.

Discussion

This study compared the performance of several stopping rules variations, including minimum information, minimum standard error, and absolute change in \( \widehat{\uptheta} \) under the GPCM. Our results extend prior research of these termination criteria using a dichotomous model (i.e., Babcock & Weiss, 2012) to a polytomous model and include content balancing and exposure control procedures. Developers of high-stakes tests have high motivation to create fair assessments and maintain test security. Therefore, they frequently use content balancing and exposure control procedures in combination, which ensure equal representation of content areas across examinees and limit item use to decrease the likelihood of item disclosure, respectively. The constraints these procedures place on the item selection process prevent the most informative item from always being selected, thereby increasing the number of items needed to satisfy the precision required by variable-length stopping rules. Despite the wide use of these techniques and their effect on measurement precision and efficiency, limited research has implemented both constraints simultaneously, and the present study is the first to our knowledge that compares variable-length stopping rules while controlling for item exposure and content balancing.

In general, most CAT stopping rule procedures demonstrated that they could arrive at precise and accurate estimates of θ, though the MI stopping rule demonstrated either inefficiency or poor θ recovery in the majority of conditions investigated. Furthermore, it is apparent that in order to reach optimal results, the MI- and CT-based stopping rules required an additional prerequisite that a minimum number of items be administered. All variable-length stopping rules were more efficient when paired with a fixed-length component (i.e., maximum number of items), as there were only negligible differences in ability parameter recovery, accompanied by often large decreases in test length. Terminating a test on the basis of either variable- or fixed-length termination criteria allows for equal measurement precision across the majority of examinees, while also preventing item overexposure. Our study used 20 items for FL conditions due to previous research indicating this number to produce low levels of bias and high precision in previous research, while also providing a short and efficient test (Dodd, 1990; Dodd et al., 1989, 1993; Gorin et al., 2005; Koch & Dodd, 1989; Lee & Dodd, 2012). However, if a larger number of items were to be used as the maximum test length, then differences in efficiency would be less pronounced. There are a number of considerations when making the decision on maximum test length, most importantly the relative importance of limiting item exposure for test security and the degree of precision desired by the testing scenario. Appropriate test length will vary across item pools and testing contexts, which may require empirically supported maximum test lengths determined through simulation (see Thompson & Weiss, 2011).

The importance of including a minimum number of items for MI and CT conditions is apparent throughout results. Although this inevitably increased the number of items administered and item exposure, these conditions saw meaningful gains in θ measurement. This was particularly true in MI conditions, which saw dramatic decreases in \( {\mathrm{SE}}_{\widehat{\uptheta}} \), bias, and RMSE and increases in θ correlations. This can be partially attributed to increased average test length, but is also indicative of these stopping rules tending to be satisfied before obtaining an accurate and reliable estimate of an individual’s ability. Though using a higher MI criterion decreased the average test length, this produced unacceptable decreases in measurement precision and accuracy. The MI stopping rule administers fewer items to examinees the farther they are from the center of the θ distribution, since there are increasingly fewer items that provide the information required to meet the MI criteria (Dodd et al., 1989). This proclivity toward shorter tests for examinees with extreme θs is amplified when including exposure control and content-balancing constraints on item selection as the most informative item in the pool may not be available for administration. Item exposure rates, pool utilization, and content area coverage were closely aligned with test length. Although shorter tests produced lower exposure rates, they also left a greater proportion of the item pool unused and greater discrepancies in actual and targeted content area proportions. Ideally, all the items in the pool would be used and have an equal exposure rate, to prevent a waste of resources and enhance test security, respectively. CT and SE stopping rules generally maintained equilibrium between item exposure and pool utilization, though SE stopping rules without a fixed-length component clearly excelled in having the fewest unused items.

As can be seen in the conditional θ plots, measurement efficiency and θ recovery of MI stopping rules fluctuate greatly depending on the location of an examinee’s true ability on the θ scale. Without a minimum number of items requirement, the MI rules frequently ended item administration before obtaining an accurate and precise θ estimate, particularly for examinees with θs further from the center of the θ distribution. These overly short tests for these examinees also contributed to the high nonconvergence rates. Though the inclusion of minimum and maximum numbers of items improved the balance of measurement efficiency and precision on average, there is still great disparity in test lengths and θ recovery for examinees across the θ distribution. The MI stopping rule can have consequences for both examinees in the center of the θ distribution center and those at the extremes. Those at the center have increased testing burden and are administered more items than necessary for measurement and those in the extremes are likely to be receive an inaccurate and imprecise proficiency score. Our results indicate that the MI stopping rule should be used with caution due to difficulty in finding a suitable criterion value to attain balance in efficiency and quality of measurement, especially when using exposure control and content balancing.

CT9 and CTFL9 shared nearly identical results when used with a 0.05 criterion. This indicates that the change in \( \widehat{\uptheta} \) stopping rule was usually reached before 20 items were administered, as was seen in previous research (Babcock & Weiss, 2012). CT conditions with a 0.05 criterion had the shortest average test lengths, though it appears that tests may have been shorter than required for accurate and reliable θ estimates in many cases. CT had a better performance when used with a 0.02 criterion, particularly when used with a minimum number of items component. Attaching a maximum number of items to the CT stopping rule did not have much of an impact and produced only slight differences in measurement when used with the lower criterion. Babcock and Weiss (2012) used the same CT criteria but found very different average test lengths, bias, and RMSE due to their use of a different item pool, dichotomous IRT model, uniformly distributed simulee θ distribution, as well as their lack of content balancing and exposure control constraints on item selection. The authors stated that CT may be preferred over the SE stopping rule for exams that include a variety of item banks with varying information structures, but the differences in our results indicate that the performance of the CT stopping rule is dependent on the components of the CAT, including the response model, item bank, and item selection procedure.

When viewed holistically, SE stopping rules generally maintained the greatest balance of efficiency and precision across conditions. This result is consistent with previous research comparing SE and MI stopping rules with different IRT models (e.g., Babcock & Weiss, 2012; Dodd, 1990; Dodd et al., 1989, 1993), indicating that the SE stopping rule is preferable to the MI stopping rule across multiple scoring models, as well as when exposure control and content balancing are implemented. Unlike the MI and CT conditions, SE and SEFL had no convergence issues that required an additional criterion for a minimum number of items and performed similarly when used in isolation. As expected, administering a fixed length of 20 items provided more precise measurement of θ due to an increased test length in comparison to the variable-length stopping rules. FL had increased precision in the center of the θ distribution in which examinees were administered more items than necessary for an acceptably precise \( {\mathrm{SE}}_{\widehat{\uptheta}} \), meaning that using a FL stopping rule is inefficient for the majority of examinees. SE and SEFL administered fewer items to these examinees and maintained their low predefined \( {\mathrm{SE}}_{\widehat{\uptheta}} \).

Combining SE with a fixed-length component led to an efficient test in which only the small number of examinees with relatively high or low ability saw increases in \( {\mathrm{SE}}_{\widehat{\uptheta}} \). Both SE criteria values performed well, with the 0.30 criterion producing more precise \( \widehat{\uptheta} \) but having slightly increased test length. A researcher’s decision between a higher or lower SE termination criterion should be based on whether efficiency or precision is the more desirable trait. If testing burden is the primary concern, then a higher criterion should be used to produce a shorter average test length. A lower criterion may be used if optimal measurement precision and accuracy is of greater importance. However, the differences between these criteria were quite minimal and both provided excellent balance between efficiency and quality of measurement.

Though the SE stopping rule generally outperformed the CT stopping rule, the CT stopping rule performed well when using a 0.02 criterion paired with minimum and maximum test length constraints. Furthermore, as described by Babcock and Weiss (2012), using a change in θ termination criterion may be more appropriate than a SE stopping rule when maintaining a low standard error of measurement across the majority of examinees is unlikely. Such circumstances could arise from using a small item bank or having mismatch between the item pool and trait distributions. Babcock and Weiss’s work indicated that the CT stopping rule is less affected by changes in the item pool’s information structure, a property not manipulated in this study. Given that CT is a relatively new stopping rule, there is considerable room for research on its utility under different simulated conditions. Another avenue for future research is the study of variable-length stopping rules used in combination. As was suggested by Babcock and Weiss (2012), CT and SE stopping rules may work well when used in combination, with the CT stopping rule ending a test for examinees that are in ranges of θ where the minimum SE criterion cannot be reached.

Some limitations of our study could limit the generalizability of our results. Although numerous stopping rule components were manipulated, we used a single item pool and simulee population across conditions, which had similar information structures. It is possible that the relative performance of the stopping criteria may change when used with different item banks and θ distributions. Future research possibilities include examining the effect of differently shaped and sized item pools on stopping rule performance, as well as the degree of mismatch between the item pool information and the trait distribution.

Though exposure control and content balancing were not a primary focus, the results of our study may be dependent on the specific methods used for constraining item selection. Future research may investigate the effectiveness of SE, MI, and CT stopping rules across various methods designed to reduce item exposure and match targeted content area distributions. Though several stopping rules modifications were used in this study, our findings are limited to the specific criteria used. Although the test information function of an item pool can be helpful in determining the criteria used for MI and SE stopping rules, there is no such possible method for CT. Therefore, it may be difficult to judge the degree of measurement quality and expected test length when choosing between CT criteria. Further investigation should consider the use of alternative CT criteria, as different values may give this index superior performance relative to the SE stopping rule. Future research could conduct a more thorough investigation of the impact of various termination criterion values across different IRT models, item pools, and examinee populations. As was described by Thompson and Weiss (2011), simulation studies are necessary to determine the appropriate CAT properties (e.g., test length and termination criteria) required to ensure the degree of precision and test efficiency required in operational testing scenarios.