Skip to main content
Log in

Does Standard Deviation Matter? Using “Standard Deviation” to Quantify Security of Multistage Testing

  • Published:
Psychometrika Aims and scope Submit manuscript

Abstract

With the advent of web-based technology, online testing is becoming a mainstream mode in large-scale educational assessments. Most online tests are administered continuously in a testing window, which may post test security problems because examinees who take the test earlier may share information with those who take the test later. Researchers have proposed various statistical indices to assess the test security, and one most often used index is the average test-overlap rate, which was further generalized to the item pooling index (Chang & Zhang, 2002, 2003). These indices, however, are all defined as the means (that is, the expected proportion of common items among examinees) and they were originally proposed for computerized adaptive testing (CAT). Recently, multistage testing (MST) has become a popular alternative to CAT. The unique features of MST make it important to report not only the mean, but also the standard deviation (SD) of test overlap rate, as we advocate in this paper. The standard deviation of test overlap rate adds important information to the test security profile, because for the same mean, a large SD reflects that certain groups of examinees share more common items than other groups. In this study, we analytically derived the lower bounds of the SD under MST, with the results under CAT as a benchmark. It is shown that when the mean overlap rate is the same between MST and CAT, the SD of test overlap tends to be larger in MST. A simulation study was conducted to provide empirical evidence. We also compared the security of MST under the single-pool versus the multiple-pool designs; both analytical and simulation studies show that the non-overlapping multiple-pool design will slightly increase the security risk.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1.
Figure 2.
Figure 3.
Figure 4.
Figure 5.
Figure 6.

Similar content being viewed by others

Notes

  1. Suppose we can assemble \(\prod_{t=1}^{T}\prod_{t=1}^{S_{t}}P_{it}\) all possible panels from the item bank by fully mixing-and-matching all possible parallel forms, then in the nonpanel assumption, the probability that one form is assigned to a test taker is \(\frac{1}{N_{t}}\), whereas with a typical panel structure, the probability that the same form is assigned to a test taker is \(\frac{1}{P_{it}}\frac{P_{it}}{N_{t}}=\frac{1}{N_{t}}\), where \(\frac{1}{P_{it}}\) represents the probability that a randomly chosen panel contains that particular form, and \(\frac{P_{it}}{N_{t}}\) denotes the probability that the form will be chosen at stage t. Apparently, the two probabilities are the same.

References

  • Ariel, A., Veldkamp, B.P., & van der Linden, W.J. (2004). Constructing rotating item pools for constrained adaptive testing. Journal of Educational Measurement, 41, 345–359.

    Article  Google Scholar 

  • Barrada, J.R., Olea, J., & Abad, F.J. (2008). Rotating item banks versus restriction of maximum exposure rates in computerized adaptive testing. The Spanish Journal of Psychology, 11, 618–625.

    PubMed  Google Scholar 

  • Breithaupt, K., & Hare, D.R. (2007). Automated simultaneous assembly of multistage testlets for a high-stakes licensing examination. Educational and Psychological Measurement, 67(1), 5–20.

    Article  Google Scholar 

  • Chang, H.-H. (2004). Understanding computerized adaptive testing: from Robbins-Monro to Lord and beyond. In D. Kaplan (Ed.), The Sage handbook of quantitative methodology for the social sciences (pp. 117–133). Thousand Oaks: Sage.

    Google Scholar 

  • Chang, H., Wang, S., & Ying, Z. (1997). Three dimensional visulization of item/test information. Paper presented at the annual meeting of American Educational Research Association, Chicago, IL.

  • Chang, H.-H., & Ying, Z. (1999). Alpha-stratified multistage computerized adaptive testing. Applied Psychological Measurement, 23, 211–222.

    Article  Google Scholar 

  • Chang, H.-H., & Zhang, J. (2002). Hypergeometric family and item overlap rates in computerized adaptive testing. Psychometrika, 67, 387–398.

    Article  Google Scholar 

  • Chang, H., & Zhang, J. (2003, April). Assessing CAT security breaches by the item pooling index—to compromise a CAT item bank, how many thieves are needed? Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago.

  • Cheng, Y., & Chang, H. (2009). The maximum priority index method for severely constrained item selection in computerized adaptive testing. British Journal of Mathematical & Statistical Psychology, 62, 369–383.

    Article  Google Scholar 

  • Chen, S.Y., Ankenmann, R.D., & Spray, J.A. (2003). The relationship between item exposure and test overlap in computerized adaptive testing. Journal of Educational Measurement, 40, 129–145.

    Article  Google Scholar 

  • Davey, T., & Nering, M. (2002). Controlling item exposure and maintaining item security. In C. Mills, M.T. Potenza, J.J. Fremer, & W.C. Ward (Eds.), Computer-based testing: building the foundation for future assessments. Mahwah: Lawrence Erlbaum Associates.

    Google Scholar 

  • Dean, V., & Martineau, J. (2012). A state perspective on enhancing assessment and accountability systems through systematic implementation of technology. In R.W. Lissitz & H. Jiao (Eds.), Computers and their impact on state assessment: recent history and predictions for the future (pp. 55–77). Charlotte: Information Age Publisher.

    Google Scholar 

  • Finkelman, M., Nering, M.L., & Roussos, L.A. (2009). A conditional exposure control method for multidimensional adaptive testing. Journal of Educational Measurement, 46(1), 84–103.

    Article  Google Scholar 

  • Hendrickson, A. (2007). An NCME instructional module on multistage testing. Educational Measurement, Issues and Practice, 26, 44–52.

    Article  Google Scholar 

  • Kim, H., & Plake, B. (1993, April). Monte Carlo simulation comparison of two-stage testing and computerized adaptive testing. Paper presented at the meeting of the National Council on Measurement in Education, Atlanta, GA.

  • Lim, E. (2010). The effectiveness of using multiple item pools to increase test security in computerized adaptive testing. Unpublished doctoral thesis, University of Illinois at Urbana-Champaign.

  • Luecht, R.M., & Nungester, R.J. (1998). Some practical examples of computer-adaptive sequential testing. Journal of Educational Measurement, 35(3), 229–249.

    Article  Google Scholar 

  • Mills, C.N., & Steffen, M. (2000). The GRE computer adaptive test: operational issues. In W.J. van der Linden & C.A.W. Glas (Eds.), Computerized adaptive testing: theory and practice (pp. 75–99). Dordrecht: Kluwer.

    Google Scholar 

  • Stocking, M.L., & Lewis, C. (1998). Controlling item exposure conditional on ability in computerized adaptive testing. Journal of Educational and Behavioral Statistics, 23, 57–75.

    Article  Google Scholar 

  • Wang, C., & Chang, H.-H. (2008, June). Continuous a-stratification index in computerized item selection. Paper presented at the annual meeting of the Psychometric Society, Durham, NH.

  • Way, W.D. (1998). Protecting the integrity of computerized testing item pools. Educational Measurement, Issues and Practice, 17, 17–27.

    Article  Google Scholar 

  • Way, W., Zara, A., & Leahy, J. (1996, April). Modifying the NCLEX TM CAT item selection algorithm to improve item exposure. Paper presented at the Annual Meeting of the American Educational Research Association, New York, NY.

  • Yi, Q., Zhang, J., & Chang, H.-H. (2008). Severity of organized item theft in computerized adaptive testing: a simulation study. Applied Psychological Measurement, 32(7), 543–558.

    Article  Google Scholar 

  • Zhang, J., Chang, H.-H., & Yi, Q. (2012). Comparing single-pool and multiple-pool designs regarding test security in computerized testing. Behavior Research Methods, 44, 742–752.

    Article  PubMed  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chun Wang.

Appendix

Appendix

1.1 A.1 Proof of Theorem 1

(1) MST The average overlap rate for MST when the random route rule is used is as follows:

$$\frac{\sum_{t = 1}^{T} \sum_{f = 1}^{N_{t}} \bigl( \begin{array}{c} \scriptstyle m_{f}\\[-3pt] \scriptstyle 2 \end{array} \bigr)l}{(Tl)\bigl( \begin{array}{c} \scriptstyle p\\[-3pt] \scriptstyle 2 \end{array} \bigr)} = \frac{\sum_{t = 1}^{T} \sum_{f = 1}^{N_{t}} \bigl( \begin{array}{c} \scriptstyle m_{f}\\[-3pt] \scriptstyle 2 \end{array} \bigr)}{T\bigl( \begin{array}{c} \scriptstyle p\\[-3pt] \scriptstyle 2 \end{array} \bigr)} = \frac{\sum_{t = 1}^{T} \sum_{f = 1}^{N_{t}} m_{f}(m_{f} - 1)}{Tp(p - 1)} \mathop{\to}\limits^{p \to \infty} \frac{1}{T}\sum_{t = 1}^{T} \sum _{f = 1}^{N_{t}} r_{f}^{2}, $$

where m f is the number of times a form is administered, l is the number of items within each form, p is the number of examinees in total, and \(r_{f} = \frac{m_{f}}{p}\) is the form exposure rate.

According to this derivation, first, the number of items within each form (i.e., the test length) will not affect the test security.

To decrease the average test overlap rate, the objective is to minimize \(\frac{1}{T}\sum_{t = 1}^{T} \sum_{f = 1}^{N_{t}} r_{f}^{2}\) while subject to the constraint that \(\bar{r}_{f} = \frac{1}{N_{t}}\).

Notice that

$$\begin{aligned} & \frac{1}{T}\sum_{t = 1}^{T} \sum _{f = 1}^{N_{t}} r_{f}^{2} = \frac{1}{T}\sum_{t = 1}^{T} \sum _{f = 1}^{N_{t}} \bigl[ (r_{f} - \bar{r}_{f})^{2} + 2r_{f}\bar{r}_{f} - \bar{r}_{f}^{2} \bigr] \\ &\quad = \frac{1}{T}\sum_{t = 1}^{T} \Biggl[ N_{t}{\operatorname{var}} (r_{f}) + \frac{2}{N_{t}} \sum_{f = 1}^{N_{t}} r_{f} - \frac{1}{N_{t}} \Biggr] = \frac{1}{T}\sum_{t = 1}^{T} \biggl[ N_{t}{\mathop{\operatorname{var}}} (r_{f}) + \frac{1}{N_{t}} \biggr]\quad \mathrm{because}\ \sum_{f = 1}^{N_{t}} r_{f} = 1 \\ &\quad \mathop{\to}\limits^{p \to \infty} \frac{1}{T} \sum_{t = 1}^{T} \frac{1}{N_{t}}\quad \mathrm{because}\ \mbox{if}\ \mathrm{each}\ \mathrm{module}\ \mathrm{within}\ \mathrm{a}\ \mathrm{stage}\ \mathrm{is}\ \mathrm{equally}\ \mathrm{likely}\ \mathrm{to}\ \mathrm{be}\ \mathrm{selected},\\ &\hphantom{\quad \mathop{\to}\limits^{p \to \infty} \frac{1}{T} \sum_{t = 1}^{T} \frac{1}{N_{t}}\quad} {\mathop{\operatorname{var}}}(r_{f}) = 0. \end{aligned}$$

The conclusion is: when random selection is considered, only N t affects the results, that is, the number of alternative forms in each stage. When more informative routing rule is considered, for fixed N t , the value of P it and S t will influence the estimation precision.

(2) CAT When the CAT item pool is composed of all the items in the MST panels, and assume the item selection is purely random, the average overlap rate would be (Chen et al. 2003):

$$\bar{r} = \frac{Tl}{\sum_{t = 1}^{T} P_{t}S_{t}l} = \frac{T}{\sum_{t = 1}^{T} N_{t}}. $$

1.2 A.2 Proof for Theorem 2

(1) CAT The probability that two randomly selected examinees have x same items in common is

$$ p(x) = \frac{\big( \begin{array}{c} \scriptstyle S\\[-3pt] \scriptstyle L \end{array} \bigr)\bigl( \begin{array}{c} \scriptstyle L\\[-3pt] \scriptstyle x \end{array} \bigr) \bigl( \begin{array}{c} \scriptstyle S - L\\[-3pt] \scriptstyle L - x \end{array} \bigr)}{\bigl( \begin{array}{c} \scriptstyle S\\[-3pt] \scriptstyle L \end{array} \bigr)\bigl( \begin{array}{c} \scriptstyle S\\[-3pt] \scriptstyle L \end{array} \bigr)} = \frac{\bigl( \begin{array}{c} \scriptstyle L\\[-3pt] \scriptstyle x \end{array} \bigr)\bigl( \begin{array}{c} \scriptstyle S - L\\[-3pt] \scriptstyle L - x \end{array} \big)}{\bigl( \begin{array}{c} \scriptstyle S\\[-3pt] \scriptstyle L \end{array} \bigr)}, $$
(A.1)

where S is the total number of items in the bank, L is the test length.

The variance of the overlap rate is simply the variance of the hypergeometric distribution, expressed as

$$\frac{{\mathop{\operatorname{var}}} (x)}{L^{2}} = \frac{(S - L)^{2}}{S^{2}(S - 1)} $$

(2) MST First, \(S = \sum_{t = 1}^{T} N_{t}l\), and \(\sum_{t = 1}^{T} l = Tl = L\). The probably that two randomly selected examinees have one form in common at stage t is

$$ p(X = 1) = \frac{\bigl( \begin{array}{c} \scriptstyle N_{t}\\[-3pt] \scriptstyle 1 \end{array} \bigr)\bigl( \begin{array}{c} \scriptstyle 1\\[-3pt] \scriptstyle 1 \end{array} \bigr)}{\bigl( \begin{array}{c} \scriptstyle N_{t}\\[-3pt] \scriptstyle 1 \end{array} \bigr)\bigl( \begin{array}{c} \scriptstyle N_{t}\\[-3pt] \scriptstyle 1 \end{array} \bigr)} = \frac{1}{N_{t}}. $$
(A.2)

The probability that two randomly selected examinees have x forms in common during the entire test is

$$ p(X = x) = \frac{\bigl( \begin{array}{c} \scriptstyle T\\[-3pt] \scriptstyle x \end{array} \bigr)\bigl( \begin{array}{c} \scriptstyle N_{t}\\ \scriptstyle 1 \end{array} \bigr)^{x}\prod_{t \in \bar{\varOmega }_{x}} P \bigl( \begin{array}{c} \scriptstyle N_{t}\\[-3pt] \scriptstyle 2 \end{array} \bigr)}{\bigl[ \prod_{t = 1}^{T} \bigl( \begin{array}{c} \scriptstyle N_{t}\\[-3pt] \scriptstyle 1 \end{array} \bigr) \bigr]^{2}}, $$
(A.3)

where P represents permutation, Ω x is the set of forms (stages) that are in common. The variance of the pair-wise overlap rate is \({\mathop{\operatorname{var}}} ( \frac{xl}{L} ) = \frac{1}{T^{2}}{\mathop{\operatorname{var}}} (x)\), where as \({\mathop{\operatorname{var}}} (x)\) can be derived easily for the given probability mass function in (A.3). The final closed form of the variance is \(\sum_{t = 1}^{T} \frac{N_{t} - 1}{N_{t}^{2}}\).

(3) Proof of the inequality

To prove \(\sigma_{\mathrm{CAT}}^{2} \le \sigma_{\mathrm{MST}}^{2}\), we need to show

$$\Biggl( \sum_{t = 1}^{T} N_{t} - T \Biggr)^{2} \le \Biggl( \sum_{t = 1}^{T} \frac{N_{t} - 1}{N_{t}^{2}} \Biggr) \Biggl( \sum_{t = 1}^{T} \frac{N_{t}}{T} \Biggr)^{2} \Biggl( \sum _{t = 1}^{T} N_{t}l - 1 \Biggr), $$

because

$$\Biggl( \sum_{t = 1}^{T} \frac{N_{t} - 1}{N_{t}^{2}} \Biggr) \ge \sum_{t = 1}^{T} \frac{1}{N_{t}} - \frac{1}{2N_{t}} = \sum_{t = 1}^{T} \frac{1}{2N_{t}}\quad \mbox{if } N_{t} \ge 2. $$

According to Cauchy inequality, \(\sum_{t = 1}^{T} N_{t} \sum_{t = 1}^{T} \frac{1}{N_{t}} \ge T^{2}\), therefore,

$$ \sum_{t = 1}^{T} \frac{1}{2N_{t}} \ge \frac{1}{2}\frac{T^{2}}{\sum_{t = 1}^{T} N_{t}}. $$
(A.4)

Because

$$ \frac{ ( \sum_{t = 1}^{T} N_{t} - T )^{2}}{\sum_{t = 1}^{T} N_{t} ( l\sum_{t = 1}^{T} N_{t} - 1 )} \le \frac{1}{2}\quad \mbox{if } l \ge 2. $$
(A.5)

Combining (A.4) and (A.5), the inequality holds.

1.3 A.3 Proof of Theorem 3

(1) CAT For CAT, the mean and variance of the overlap will change accordingly by plugging in the new item bank size \(\sum_{t = 1}^{T} (N_{t}l - n_{t})\).

(2) MST Let x t denotes the number of common items between two randomly picked examinees at stage t.

$$ x_{t} = \left\{ \begin{array}{l@{\quad}l} l & \mbox{with}\ \mathrm{probability}\ \frac{1}{N_{t}}{:}\ \mathrm{two}\ \mathrm{examinees}\ \mathrm{pick}\ \mathrm{exactly}\ \mathrm{the} \mathrm{same}\ \mathrm{form}\\ \frac{n_{t}}{\left( \begin{array}{c} \scriptstyle N_{t}\\[-3pt] \scriptstyle 2 \end{array} \right)} & \mbox{with}\ \mathrm{probability}\ \frac{\left( \begin{array}{c} \scriptstyle N_{t}\\[-3pt] \scriptstyle 2 \end{array} \right)}{\left[ \left( \begin{array}{c} \scriptstyle N_{t}\\[-3pt] \scriptstyle 1 \end{array} \right) \right]^{2}} = \frac{N_{t} - 1}{N_{t}}{:}\ \mathrm{average}\ \mathrm{common}\ \mathrm{items}\ \mathrm{between} \\ & \mathrm{two}\ \mathrm{different}\ \mathrm{forms}. \end{array} \right. $$
(A.6)

Let X denote the number of common items between two randomly selected examinees, then

$$X = \sum_{t = 1}^{T} x_{t}. $$

It is straightforward to compute that

$$E(X) = \sum_{t = 1}^{T} \biggl( \frac{l}{N_{t}} + \frac{2n_{t}}{N_{t}^{2}} \biggr). $$

Because x t ’s are independent across different stages,

$${\mathop{\operatorname{var}}} (X) = \sum_{t = 1}^{T} \biggl[ \frac{l^{2}}{N_{t}} + \frac{4n_{t}^{2}}{N_{t}^{3}(N_{t} - 1)} - \frac{l^{2}}{N_{t}^{2}} - \frac{4n_{t}^{2}}{N_{t}^{4}} - \frac{4n_{t}l}{N_{t}^{3}} \biggr]. $$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, C., Zheng, Y. & Chang, HH. Does Standard Deviation Matter? Using “Standard Deviation” to Quantify Security of Multistage Testing. Psychometrika 79, 154–174 (2014). https://doi.org/10.1007/s11336-013-9356-y

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11336-013-9356-y

Key words

Navigation