Skip to main content
Log in

Sequential Generalized Likelihood Ratio Tests for Online Item Monitoring

  • Theory and Methods
  • Published:
Psychometrika Aims and scope Submit manuscript

Abstract

The study presents statistical procedures that monitor functioning of items over time. We propose generalized likelihood ratio tests that surveil multiple item parameters and implement with various sampling techniques to perform continuous or intermittent monitoring. The procedures examine stability of item parameters across time and inform compromise as soon as they identify significant parameter shift. The performance of the monitoring procedures was validated using simulated and real-assessment data. The empirical evaluation suggests that the proposed procedures perform adequately well in identifying the parameter drift. They showed satisfactory detection power and gave timely signals while regulating error rates reasonably low. The procedures also showed superior performance when compared with the existent methods. The empirical findings suggest that multivariate parametric monitoring can provide an efficient and powerful control tool for maintaining the quality of items. The procedures allow joint monitoring of multiple item parameters and achieve sufficient power using powerful likelihood-ratio tests. Based on the findings from the empirical experimentation, we suggest some practical strategies for performing online item monitoring.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Previous studies defined the reference sample as \({\mathcal {R}} = \{i: \, i = 1 , \, \ldots \, , \,t - m \}\) such that it increases as the monitoring progresses. This study uses a fixed reference sample to alleviate the probable impact of false negatives in the expanding reference sample.

  2. Under the simulation design in Sect. 4, sequential testing based on \({\mathcal {T}}^2\) showed average false positive rate of 16.88%. The procedure also tended to flag drift items prematurely before the actual parameter shift, exhibiting early detection rate of 9.70%. The chart seemed overly sensitive to small fluctuations that occur from the sampling and calibration error. Note that, unlike standard Shewhart control charts, which examine manifest variables, the Shewhart chart based on \({\mathcal {T}}^2\) examines estimable parameters and can be influenced by the sampling and estimation error.

  3. We note that there are other ways of constructing a multivariate chart (e.g., Healy, 1987, Pignatiello & Runger, 1990, Woodall & Ncube, 1985). These procedures, however, make impractical assumptions (e.g., known directions or multiple univariate charts) or make little difference in the monitoring statistics in the present setting because only one observation is evaluated each time.

  4. In sequential testing, Type I error can be defined in three ways–across the event times, the items, and across both the events and items.

  5. Recall that the charting statistics are obtained by subtracting the reference values (k). The larger the k, the smaller the null charting statistics, and thus, the smaller the decision limit.

  6. We also contemplated simulation for attaining the threshold values. The resulting values, however, did not generally accord with the statistics in the real data possibly due to disparity in sampling (e.g., content-balancing and item exposure control in real testing.)

References

  • Armstrong, R. D., & Shi, M. (2009). A parametric cumulative sum statistic for person fit. Applied Psychological Measurement, 33, 391–410.

    Article  Google Scholar 

  • Ban, J. C., Hanson, B. A., Wang, T., Yi, Q., & Harris, D. J. (2001). A comparative study of on-line pretest item-calibration/scaling methods in computerized adaptive testing. Journal of Educational Measurement, 38(3), 191–212.

    Article  Google Scholar 

  • Basseville, M., & Nikiforov, I. V. (1993). Detection of abrupt changes: Theory and applications. Prentice-Hall Inc.

  • Birnbaum, A. (1968). Theories of mental test scores. In F. M. Lord & M. R. Novick (Eds.), Some latent trait models and their use in inferring an examinee’s ability (pp. 397–479). MA: Addison-Wesley, Reading.

  • Bock, R., Muraki, E., & Pfeiffenberger, W. (1988). Item pool maintenance in the presence of item parameter drift. Journal of Educational Measurement, 25, 275–285.

    Article  Google Scholar 

  • Choe, E. M., Zhang, J., & Chang, H.-H. (2018). Sequential detection of compromised items using response times in computerized adaptive testing. Psychometrika, 83, 650–673.

    Article  PubMed  Google Scholar 

  • Clark, A. (2013). Review of parameter drift methodology and implications for operational testing. Retrieved from https://www.ncbex.org/statistics-and-research/covington-award

  • Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159.

    Article  PubMed  Google Scholar 

  • Crosier, R. B. (1988). Multivariate generalizations of cumulative sum quality-control schemes. Technometrics, 30, 291–303.

    Article  Google Scholar 

  • DeMars, C. E. (2004). Detection of item parameter drift over multiple test administrations. Applied Measurement in Education, 17, 265–300.

    Article  Google Scholar 

  • Donoghue, J. R., & Isham, S. P. (1998). A comparison of procedures to detect item parameter drift. Applied Psychological Measurement, 22(1), 33–51.

    Article  Google Scholar 

  • Goldstein, H. (1983). Measuring changes in educational attainment over time: Problems and possibilities. Journal of Educational Measurement, 20, 369–377.

    Article  Google Scholar 

  • Guo, H., Robin, F., & Dorans, N. (2017). Detecting item drift in large-scale testing. Journal of Educational Measurement, 54, 265–284.

    Article  Google Scholar 

  • Healy, J. D. (1987). A note on multivariate CUSUM procedures. Technometrics, 29, 409–412.

    Article  Google Scholar 

  • Hotelling, H. (1931). The generalization of Student’s ratio. Annals of Mathematical Statistics, 2, 360–378. https://doi.org/10.1214/aoms/1177732979

    Article  Google Scholar 

  • Huggins-Manley, A. C. (2017). Psychometric Consequences of Subpopulation Item Parameter Drift. Educational and Psychological Measurement, 2017, 143–164. https://doi.org/10.1177/0013164416643369

    Article  Google Scholar 

  • Kang, H.-A., Zheng, Y., & Chang, H.-H. (2020). Online Calibration of a Joint Model of Item Responses and Response Times in Computerized Adaptive Testing. Journal of Educational and Behavioral Statistics, 45, 175–208.

    Article  Google Scholar 

  • Klein Entink, R. H., Kuhn, J.-T., Hornke, L. F., & Fox, J.-P. (2009). Evaluating cognitive theory: A joint modeling approach using responses and response times. Psychological Methods, 14, 54–75.

    Article  PubMed  Google Scholar 

  • Lai, T. (1991). Asymptotic optimality of generalized sequential likelihood ratio tests in some classical sequential testing problems. In B. K. Ghosh & P. K. Sen (Eds.), Handbook of sequential analysis handbook of sequential analysis (pp. 121–144). New York: Marcel Dekker Inc.

    Google Scholar 

  • Lee, Y.-H., & Lewis, C. (2021). Monitoring item performance with CUSUM statistics in continuous testing. Journal of Educational and Behavioral Statistics, 46, 611–648. https://doi.org/10.3102/1076998621994563

    Article  Google Scholar 

  • Liu, C., Han, K. T., & Li, J. (2019). Compromised item detection for computerized adaptive testing. Front. Psychol., 10, 829. https://doi.org/10.3389/fpsyg.2019.00829

    Article  PubMed  PubMed Central  Google Scholar 

  • Lowry, C. A., Woodall, W. H., Champ, C. W., & Rigdon, S. E. (1992). A multivariate EWMA control chart. Technometrics, 34, 46–53.

    Article  Google Scholar 

  • Marianti, S., Fox, J.-P., Avetisyan, M., Veldkamp, B. P., & TijmstraFirs, J. (2014). Testing for aberrant behavior in response time modeling. Journal of Educational and Behavioral Statistics, 39, 426–451.

    Article  Google Scholar 

  • Page, E. S. (1954). Continuous inspection schemes. Biometrika, 41, 100–115. https://doi.org/10.1093/biomet/41.1-2.100

    Article  Google Scholar 

  • Pignatiello, J. J., & Runger, G. C. (1990). Comparisons of multivariate CUSUM charts. Journal of Quality Technology, 22, 173–186.

    Article  Google Scholar 

  • Segall, D. O. (2002). An item response model for characterizing test compromise. Journal of Educational and Behavioral Statistics, 27, 163–179.

    Article  Google Scholar 

  • Segall, D. O. (2004). A sharing item response theory model for computerized adaptive testing. Journal of Educational and Behavioral Statistics, 29, 439–460.

    Article  Google Scholar 

  • Shu, Z., Henson, R., & Luecht, R. (2013). Using deterministic, gated item response theory model to detect test cheating due to item compromise. Psychometrika, 78, 481–497.

    Article  PubMed  Google Scholar 

  • Sinharay, S., & Johnson, M. S. (2020). The use of item scores and response times to detect examinees who may have benefited from item preknowledge. British Journal of Mathematical and Statistical Psychology, 73, 397–419.

    Article  PubMed  Google Scholar 

  • Tendeiro, J. N., Meijer, R. R., Schakel, L., & Maij-de Meij, A. M. (2013). Using cumulative sum statistics to detect inconsistencies in unproctored internet testing. Educational and Psychological Measurement, 73, 143–161.

    Article  Google Scholar 

  • van der Linden, W. J. (2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31, 181–204.

    Article  Google Scholar 

  • van der Linden, W. J. (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72, 287–308.

    Article  Google Scholar 

  • van der Linden, W. J., & Guo, F. (2008). Bayesian procedures for identifying aberrant response-time patterns in adaptive testing. Psychometrika, 73(3), 365–384.

    Article  Google Scholar 

  • van Krimpen-Stoop, E. M. L. A., & Meijer, R. R. (2001). CUSUM-based person-fit statistics for adaptive testing. Journal of Educational and Behavioral Statistics, 26, 199–218.

    Article  Google Scholar 

  • Veerkamp, W. J. J., & Glas, C. A. W. (2000). Detection of known items in adaptive testing with a statistical quality control method. Journal of Educational and Behavioral Statistics, 25, 373–389.

    Article  Google Scholar 

  • Wang, X., & Liu, Y. (2020). Detecting compromised items using information from secure items. Journal of Educational and Behavioral Statistics, 45, 667–689.

    Article  Google Scholar 

  • Wells, C. S., Subkoviak, M. J., & Serlin, R. C. (2002). The effect of item parameter drift on examinee ability estimates. Applied Psychological Measurement, 26, 77–87.

    Article  Google Scholar 

  • Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics, 9, 60–62.

    Article  Google Scholar 

  • Woodall, W. H., & Ncube, M. M. (1985). Multivariate CUSUM quality control procedures. Technometrics, 27, 285–292.

    Article  Google Scholar 

  • Yang, Y., Ferdous, A., & Chin, T. Y. (2007). Exposed items detection in personnel selection assessment: An exploration of new item statistic. Chicago, IL: Paper presented at the annual meeting of the National Council of Measurement in Education.

  • Zhang, J. (2014). A sequential procedure for detecting compromised items in the item pool of CAT system. Applied Psychological Measurement, 38, 87–104.

    Article  Google Scholar 

  • Zhang, J., & Li, J. (2016). Monitoring items in real time to enhance CAT security. Journal of Educational Measurement, 53, 131–151.

    Article  Google Scholar 

  • Zhang, J., Li, Z., & Wang, Z. (2010). A multivariate control chart for simultaneously monitoring process mean and variability. Computational Statistics and Data Analysis, 54, 2244–2252.

    Article  Google Scholar 

  • Zopluoglu, C. (2019). Detecting examinees with item Preknowledge in large-scale testing using extreme gradient boosting (XGBoost). Educational and Psychological Measurement, 79, 931–961.

    Article  PubMed  PubMed Central  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hyeon-Ah Kang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kang, HA. Sequential Generalized Likelihood Ratio Tests for Online Item Monitoring. Psychometrika 88, 672–696 (2023). https://doi.org/10.1007/s11336-022-09871-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11336-022-09871-9

Keywords

Navigation