Does Standard Deviation Matter? Using “Standard Deviation” to Quantify Security of Multistage Testing

Wang, Chun; Zheng, Yi; Chang, Hua-Hua

doi:10.1007/s11336-013-9356-y

Does Standard Deviation Matter? Using “Standard Deviation” to Quantify Security of Multistage Testing

Published: 10 December 2013

Volume 79, pages 154–174, (2014)
Cite this article

Psychometrika Aims and scope Submit manuscript

Chun Wang¹,
Yi Zheng² &
Hua-Hua Chang²

1157 Accesses
12 Citations
3 Altmetric
Explore all metrics

Abstract

With the advent of web-based technology, online testing is becoming a mainstream mode in large-scale educational assessments. Most online tests are administered continuously in a testing window, which may post test security problems because examinees who take the test earlier may share information with those who take the test later. Researchers have proposed various statistical indices to assess the test security, and one most often used index is the average test-overlap rate, which was further generalized to the item pooling index (Chang & Zhang, 2002, 2003). These indices, however, are all defined as the means (that is, the expected proportion of common items among examinees) and they were originally proposed for computerized adaptive testing (CAT). Recently, multistage testing (MST) has become a popular alternative to CAT. The unique features of MST make it important to report not only the mean, but also the standard deviation (SD) of test overlap rate, as we advocate in this paper. The standard deviation of test overlap rate adds important information to the test security profile, because for the same mean, a large SD reflects that certain groups of examinees share more common items than other groups. In this study, we analytically derived the lower bounds of the SD under MST, with the results under CAT as a benchmark. It is shown that when the mean overlap rate is the same between MST and CAT, the SD of test overlap tends to be larger in MST. A simulation study was conducted to provide empirical evidence. We also compared the security of MST under the single-pool versus the multiple-pool designs; both analytical and simulation studies show that the non-overlapping multiple-pool design will slightly increase the security risk.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Asymptotic Distribution of Average Test Overlap Rate in Computerized Adaptive Testing

Article 01 July 2019

An overview of differential item functioning in multistage computer adaptive testing using three-parameter logistic item response theory

Article Open access 12 May 2017

A Practical Procedure for the Construction and Reliability Analysis of Fixed-Length Tests with Randomly Drawn Test Items

Notes

Suppose we can assemble $\prod_{t=1}^{T}\prod_{t=1}^{S_{t}}P_{it}$ all possible panels from the item bank by fully mixing-and-matching all possible parallel forms, then in the nonpanel assumption, the probability that one form is assigned to a test taker is $\frac{1}{N_{t}}$, whereas with a typical panel structure, the probability that the same form is assigned to a test taker is $\frac{1}{P_{it}}\frac{P_{it}}{N_{t}}=\frac{1}{N_{t}}$, where $\frac{1}{P_{it}}$ represents the probability that a randomly chosen panel contains that particular form, and $\frac{P_{it}}{N_{t}}$ denotes the probability that the form will be chosen at stage t. Apparently, the two probabilities are the same.

References

Ariel, A., Veldkamp, B.P., & van der Linden, W.J. (2004). Constructing rotating item pools for constrained adaptive testing. Journal of Educational Measurement, 41, 345–359.
Article Google Scholar
Barrada, J.R., Olea, J., & Abad, F.J. (2008). Rotating item banks versus restriction of maximum exposure rates in computerized adaptive testing. The Spanish Journal of Psychology, 11, 618–625.
PubMed Google Scholar
Breithaupt, K., & Hare, D.R. (2007). Automated simultaneous assembly of multistage testlets for a high-stakes licensing examination. Educational and Psychological Measurement, 67(1), 5–20.
Article Google Scholar
Chang, H.-H. (2004). Understanding computerized adaptive testing: from Robbins-Monro to Lord and beyond. In D. Kaplan (Ed.), The Sage handbook of quantitative methodology for the social sciences (pp. 117–133). Thousand Oaks: Sage.
Google Scholar
Chang, H., Wang, S., & Ying, Z. (1997). Three dimensional visulization of item/test information. Paper presented at the annual meeting of American Educational Research Association, Chicago, IL.
Chang, H.-H., & Ying, Z. (1999). Alpha-stratified multistage computerized adaptive testing. Applied Psychological Measurement, 23, 211–222.
Article Google Scholar
Chang, H.-H., & Zhang, J. (2002). Hypergeometric family and item overlap rates in computerized adaptive testing. Psychometrika, 67, 387–398.
Article Google Scholar
Chang, H., & Zhang, J. (2003, April). Assessing CAT security breaches by the item pooling index—to compromise a CAT item bank, how many thieves are needed? Paper presented at the annual meeting of the National Council on Measurement in Education, Chicago.
Cheng, Y., & Chang, H. (2009). The maximum priority index method for severely constrained item selection in computerized adaptive testing. British Journal of Mathematical & Statistical Psychology, 62, 369–383.
Article Google Scholar
Chen, S.Y., Ankenmann, R.D., & Spray, J.A. (2003). The relationship between item exposure and test overlap in computerized adaptive testing. Journal of Educational Measurement, 40, 129–145.
Article Google Scholar
Davey, T., & Nering, M. (2002). Controlling item exposure and maintaining item security. In C. Mills, M.T. Potenza, J.J. Fremer, & W.C. Ward (Eds.), Computer-based testing: building the foundation for future assessments. Mahwah: Lawrence Erlbaum Associates.
Google Scholar
Dean, V., & Martineau, J. (2012). A state perspective on enhancing assessment and accountability systems through systematic implementation of technology. In R.W. Lissitz & H. Jiao (Eds.), Computers and their impact on state assessment: recent history and predictions for the future (pp. 55–77). Charlotte: Information Age Publisher.
Google Scholar
Finkelman, M., Nering, M.L., & Roussos, L.A. (2009). A conditional exposure control method for multidimensional adaptive testing. Journal of Educational Measurement, 46(1), 84–103.
Article Google Scholar
Hendrickson, A. (2007). An NCME instructional module on multistage testing. Educational Measurement, Issues and Practice, 26, 44–52.
Article Google Scholar
Kim, H., & Plake, B. (1993, April). Monte Carlo simulation comparison of two-stage testing and computerized adaptive testing. Paper presented at the meeting of the National Council on Measurement in Education, Atlanta, GA.
Lim, E. (2010). The effectiveness of using multiple item pools to increase test security in computerized adaptive testing. Unpublished doctoral thesis, University of Illinois at Urbana-Champaign.
Luecht, R.M., & Nungester, R.J. (1998). Some practical examples of computer-adaptive sequential testing. Journal of Educational Measurement, 35(3), 229–249.
Article Google Scholar
Mills, C.N., & Steffen, M. (2000). The GRE computer adaptive test: operational issues. In W.J. van der Linden & C.A.W. Glas (Eds.), Computerized adaptive testing: theory and practice (pp. 75–99). Dordrecht: Kluwer.
Google Scholar
Stocking, M.L., & Lewis, C. (1998). Controlling item exposure conditional on ability in computerized adaptive testing. Journal of Educational and Behavioral Statistics, 23, 57–75.
Article Google Scholar
Wang, C., & Chang, H.-H. (2008, June). Continuous a-stratification index in computerized item selection. Paper presented at the annual meeting of the Psychometric Society, Durham, NH.
Way, W.D. (1998). Protecting the integrity of computerized testing item pools. Educational Measurement, Issues and Practice, 17, 17–27.
Article Google Scholar
Way, W., Zara, A., & Leahy, J. (1996, April). Modifying the NCLEX ^TM CAT item selection algorithm to improve item exposure. Paper presented at the Annual Meeting of the American Educational Research Association, New York, NY.
Yi, Q., Zhang, J., & Chang, H.-H. (2008). Severity of organized item theft in computerized adaptive testing: a simulation study. Applied Psychological Measurement, 32(7), 543–558.
Article Google Scholar
Zhang, J., Chang, H.-H., & Yi, Q. (2012). Comparing single-pool and multiple-pool designs regarding test security in computerized testing. Behavior Research Methods, 44, 742–752.
Article PubMed Google Scholar

Download references

Author information

Authors and Affiliations

University of Minnesota at Twin-Cities, 75 East River Road, Elliott Hall N658, Minneapolis, MN, 55403, USA
Chun Wang
University of Illinois at Urbana-Champaign, Champaign, USA
Yi Zheng & Hua-Hua Chang

Authors

Chun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yi Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Hua-Hua Chang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chun Wang.

Appendix

1.1 A.1 Proof of Theorem 1

(1) MST The average overlap rate for MST when the random route rule is used is as follows:

$$\frac{\sum_{t = 1}^{T} \sum_{f = 1}^{N_{t}} \bigl( \begin{array}{c} \scriptstyle m_{f}\\[-3pt] \scriptstyle 2 \end{array} \bigr)l}{(Tl)\bigl( \begin{array}{c} \scriptstyle p\\[-3pt] \scriptstyle 2 \end{array} \bigr)} = \frac{\sum_{t = 1}^{T} \sum_{f = 1}^{N_{t}} \bigl( \begin{array}{c} \scriptstyle m_{f}\\[-3pt] \scriptstyle 2 \end{array} \bigr)}{T\bigl( \begin{array}{c} \scriptstyle p\\[-3pt] \scriptstyle 2 \end{array} \bigr)} = \frac{\sum_{t = 1}^{T} \sum_{f = 1}^{N_{t}} m_{f}(m_{f} - 1)}{Tp(p - 1)} \mathop{\to}\limits^{p \to \infty} \frac{1}{T}\sum_{t = 1}^{T} \sum _{f = 1}^{N_{t}} r_{f}^{2}, $$

where m _f is the number of times a form is administered, l is the number of items within each form, p is the number of examinees in total, and $r_{f} = \frac{m_{f}}{p}$ is the form exposure rate.

According to this derivation, first, the number of items within each form (i.e., the test length) will not affect the test security.

To decrease the average test overlap rate, the objective is to minimize $\frac{1}{T}\sum_{t = 1}^{T} \sum_{f = 1}^{N_{t}} r_{f}^{2}$ while subject to the constraint that $\bar{r}_{f} = \frac{1}{N_{t}}$.

Notice that

$$\begin{aligned} & \frac{1}{T}\sum_{t = 1}^{T} \sum _{f = 1}^{N_{t}} r_{f}^{2} = \frac{1}{T}\sum_{t = 1}^{T} \sum _{f = 1}^{N_{t}} \bigl[ (r_{f} - \bar{r}_{f})^{2} + 2r_{f}\bar{r}_{f} - \bar{r}_{f}^{2} \bigr] \\ &\quad = \frac{1}{T}\sum_{t = 1}^{T} \Biggl[ N_{t}{\operatorname{var}} (r_{f}) + \frac{2}{N_{t}} \sum_{f = 1}^{N_{t}} r_{f} - \frac{1}{N_{t}} \Biggr] = \frac{1}{T}\sum_{t = 1}^{T} \biggl[ N_{t}{\mathop{\operatorname{var}}} (r_{f}) + \frac{1}{N_{t}} \biggr]\quad \mathrm{because}\ \sum_{f = 1}^{N_{t}} r_{f} = 1 \\ &\quad \mathop{\to}\limits^{p \to \infty} \frac{1}{T} \sum_{t = 1}^{T} \frac{1}{N_{t}}\quad \mathrm{because}\ \mbox{if}\ \mathrm{each}\ \mathrm{module}\ \mathrm{within}\ \mathrm{a}\ \mathrm{stage}\ \mathrm{is}\ \mathrm{equally}\ \mathrm{likely}\ \mathrm{to}\ \mathrm{be}\ \mathrm{selected},\\ &\hphantom{\quad \mathop{\to}\limits^{p \to \infty} \frac{1}{T} \sum_{t = 1}^{T} \frac{1}{N_{t}}\quad} {\mathop{\operatorname{var}}}(r_{f}) = 0. \end{aligned}$$

The conclusion is: when random selection is considered, only N _t affects the results, that is, the number of alternative forms in each stage. When more informative routing rule is considered, for fixed N _t, the value of P _it and S _t will influence the estimation precision.

(2) CAT When the CAT item pool is composed of all the items in the MST panels, and assume the item selection is purely random, the average overlap rate would be (Chen et al. 2003):

$$\bar{r} = \frac{Tl}{\sum_{t = 1}^{T} P_{t}S_{t}l} = \frac{T}{\sum_{t = 1}^{T} N_{t}}. $$

1.2 A.2 Proof for Theorem 2

(1) CAT The probability that two randomly selected examinees have x same items in common is

$$ p(x) = \frac{\big( \begin{array}{c} \scriptstyle S\\[-3pt] \scriptstyle L \end{array} \bigr)\bigl( \begin{array}{c} \scriptstyle L\\[-3pt] \scriptstyle x \end{array} \bigr) \bigl( \begin{array}{c} \scriptstyle S - L\\[-3pt] \scriptstyle L - x \end{array} \bigr)}{\bigl( \begin{array}{c} \scriptstyle S\\[-3pt] \scriptstyle L \end{array} \bigr)\bigl( \begin{array}{c} \scriptstyle S\\[-3pt] \scriptstyle L \end{array} \bigr)} = \frac{\bigl( \begin{array}{c} \scriptstyle L\\[-3pt] \scriptstyle x \end{array} \bigr)\bigl( \begin{array}{c} \scriptstyle S - L\\[-3pt] \scriptstyle L - x \end{array} \big)}{\bigl( \begin{array}{c} \scriptstyle S\\[-3pt] \scriptstyle L \end{array} \bigr)}, $$

(A.1)

where S is the total number of items in the bank, L is the test length.

The variance of the overlap rate is simply the variance of the hypergeometric distribution, expressed as

$$\frac{{\mathop{\operatorname{var}}} (x)}{L^{2}} = \frac{(S - L)^{2}}{S^{2}(S - 1)} $$

(2) MST First, $S = \sum_{t = 1}^{T} N_{t}l$, and $\sum_{t = 1}^{T} l = Tl = L$. The probably that two randomly selected examinees have one form in common at stage t is

$$ p(X = 1) = \frac{\bigl( \begin{array}{c} \scriptstyle N_{t}\\[-3pt] \scriptstyle 1 \end{array} \bigr)\bigl( \begin{array}{c} \scriptstyle 1\\[-3pt] \scriptstyle 1 \end{array} \bigr)}{\bigl( \begin{array}{c} \scriptstyle N_{t}\\[-3pt] \scriptstyle 1 \end{array} \bigr)\bigl( \begin{array}{c} \scriptstyle N_{t}\\[-3pt] \scriptstyle 1 \end{array} \bigr)} = \frac{1}{N_{t}}. $$

(A.2)

The probability that two randomly selected examinees have x forms in common during the entire test is

$$ p(X = x) = \frac{\bigl( \begin{array}{c} \scriptstyle T\\[-3pt] \scriptstyle x \end{array} \bigr)\bigl( \begin{array}{c} \scriptstyle N_{t}\\ \scriptstyle 1 \end{array} \bigr)^{x}\prod_{t \in \bar{\varOmega }_{x}} P \bigl( \begin{array}{c} \scriptstyle N_{t}\\[-3pt] \scriptstyle 2 \end{array} \bigr)}{\bigl[ \prod_{t = 1}^{T} \bigl( \begin{array}{c} \scriptstyle N_{t}\\[-3pt] \scriptstyle 1 \end{array} \bigr) \bigr]^{2}}, $$

(A.3)

where P represents permutation, Ω _xis the set of forms (stages) that are in common. The variance of the pair-wise overlap rate is ${\mathop{\operatorname{var}}} ( \frac{xl}{L} ) = \frac{1}{T^{2}}{\mathop{\operatorname{var}}} (x)$, where as ${\mathop{\operatorname{var}}} (x)$ can be derived easily for the given probability mass function in (A.3). The final closed form of the variance is $\sum_{t = 1}^{T} \frac{N_{t} - 1}{N_{t}^{2}}$.

(3) Proof of the inequality

To prove $\sigma_{\mathrm{CAT}}^{2} \le \sigma_{\mathrm{MST}}^{2}$, we need to show

$$\Biggl( \sum_{t = 1}^{T} N_{t} - T \Biggr)^{2} \le \Biggl( \sum_{t = 1}^{T} \frac{N_{t} - 1}{N_{t}^{2}} \Biggr) \Biggl( \sum_{t = 1}^{T} \frac{N_{t}}{T} \Biggr)^{2} \Biggl( \sum _{t = 1}^{T} N_{t}l - 1 \Biggr), $$

because

$$\Biggl( \sum_{t = 1}^{T} \frac{N_{t} - 1}{N_{t}^{2}} \Biggr) \ge \sum_{t = 1}^{T} \frac{1}{N_{t}} - \frac{1}{2N_{t}} = \sum_{t = 1}^{T} \frac{1}{2N_{t}}\quad \mbox{if } N_{t} \ge 2. $$

According to Cauchy inequality, $\sum_{t = 1}^{T} N_{t} \sum_{t = 1}^{T} \frac{1}{N_{t}} \ge T^{2}$, therefore,

$$ \sum_{t = 1}^{T} \frac{1}{2N_{t}} \ge \frac{1}{2}\frac{T^{2}}{\sum_{t = 1}^{T} N_{t}}. $$

(A.4)

Because

$$ \frac{ ( \sum_{t = 1}^{T} N_{t} - T )^{2}}{\sum_{t = 1}^{T} N_{t} ( l\sum_{t = 1}^{T} N_{t} - 1 )} \le \frac{1}{2}\quad \mbox{if } l \ge 2. $$

(A.5)

Combining (A.4) and (A.5), the inequality holds.

1.3 A.3 Proof of Theorem 3

(1) CAT For CAT, the mean and variance of the overlap will change accordingly by plugging in the new item bank size $\sum_{t = 1}^{T} (N_{t}l - n_{t})$.

(2) MST Let x _t denotes the number of common items between two randomly picked examinees at stage t.

$$ x_{t} = \left\{ \begin{array}{l@{\quad}l} l & \mbox{with}\ \mathrm{probability}\ \frac{1}{N_{t}}{:}\ \mathrm{two}\ \mathrm{examinees}\ \mathrm{pick}\ \mathrm{exactly}\ \mathrm{the} \mathrm{same}\ \mathrm{form}\\ \frac{n_{t}}{\left( \begin{array}{c} \scriptstyle N_{t}\\[-3pt] \scriptstyle 2 \end{array} \right)} & \mbox{with}\ \mathrm{probability}\ \frac{\left( \begin{array}{c} \scriptstyle N_{t}\\[-3pt] \scriptstyle 2 \end{array} \right)}{\left[ \left( \begin{array}{c} \scriptstyle N_{t}\\[-3pt] \scriptstyle 1 \end{array} \right) \right]^{2}} = \frac{N_{t} - 1}{N_{t}}{:}\ \mathrm{average}\ \mathrm{common}\ \mathrm{items}\ \mathrm{between} \\ & \mathrm{two}\ \mathrm{different}\ \mathrm{forms}. \end{array} \right. $$

(A.6)

Let X denote the number of common items between two randomly selected examinees, then

$$X = \sum_{t = 1}^{T} x_{t}. $$

It is straightforward to compute that

$$E(X) = \sum_{t = 1}^{T} \biggl( \frac{l}{N_{t}} + \frac{2n_{t}}{N_{t}^{2}} \biggr). $$

Because x _t’s are independent across different stages,

$${\mathop{\operatorname{var}}} (X) = \sum_{t = 1}^{T} \biggl[ \frac{l^{2}}{N_{t}} + \frac{4n_{t}^{2}}{N_{t}^{3}(N_{t} - 1)} - \frac{l^{2}}{N_{t}^{2}} - \frac{4n_{t}^{2}}{N_{t}^{4}} - \frac{4n_{t}l}{N_{t}^{3}} \biggr]. $$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, C., Zheng, Y. & Chang, HH. Does Standard Deviation Matter? Using “Standard Deviation” to Quantify Security of Multistage Testing. Psychometrika 79, 154–174 (2014). https://doi.org/10.1007/s11336-013-9356-y

Download citation

Received: 15 September 2012
Published: 10 December 2013
Issue Date: January 2014
DOI: https://doi.org/10.1007/s11336-013-9356-y

Key words

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Does Standard Deviation Matter? Using “Standard Deviation” to Quantify Security of Multistage Testing

Abstract

Access this article

Similar content being viewed by others

The Asymptotic Distribution of Average Test Overlap Rate in Computerized Adaptive Testing

An overview of differential item functioning in multistage computer adaptive testing using three-parameter logistic item response theory

A Practical Procedure for the Construction and Reliability Analysis of Fixed-Length Tests with Randomly Drawn Test Items

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix

1.1 A.1 Proof of Theorem 1

1.2 A.2 Proof for Theorem 2

1.3 A.3 Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Key words

Navigation

Does Standard Deviation Matter? Using “Standard Deviation” to Quantify Security of Multistage Testing

Abstract

Access this article

Similar content being viewed by others

The Asymptotic Distribution of Average Test Overlap Rate in Computerized Adaptive Testing

An overview of differential item functioning in multistage computer adaptive testing using three-parameter logistic item response theory

A Practical Procedure for the Construction and Reliability Analysis of Fixed-Length Tests with Randomly Drawn Test Items

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

1.1 A.1 Proof of Theorem 1

1.2 A.2 Proof for Theorem 2

1.3 A.3 Proof of Theorem 3

Rights and permissions

About this article

Cite this article

Share this article

Key words

Search

Navigation