Skip to main content

Comparison of Outlier Detection Methods in NEAT Design

  • Conference paper
  • First Online:
Quantitative Psychology (IMPS 2020)

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 353))

Included in the following conference series:

Abstract

In equating practice, the existence of outliers in the anchor items can deteriorate the equating accuracy and threaten the validity of test scores. This study used simulation to compare the performance of three outlier detection methods when conducting equating: the t-test method, the logit difference method, and the robust z statistic. The investigated factors include sample size, proportion of outliers, item difficulty drift direction, and group difference. Overall, across all simulated conditions, the t-test method outperformed the other methods in terms of sensitivity of flagging true outliers, specificity of flagging true non-outliers, bias of translation constant, and the root mean square error of the estimated examinee ability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Altman, D. G., & Bland, J. M. (1994). Diagnostic tests. 1: Sensitivity and specificity. BMJ, 308, 1552.

    Article  Google Scholar 

  • Bock, R. D., Muraki, E., & Pfeiffenberger, W. (1988). Item pool maintenance in the presence of item parameter drift. Journal of Educational Measurement, 25(4), 275–285.

    Article  Google Scholar 

  • DeMars, C. E. (2004). Detection of item parameter drift over multiple test administrations. Applied Measurement in Education, 17, 265–300.

    Article  Google Scholar 

  • Donoghue, J. R., & Isham, S. P. (1998). A comparison of procedures to detect item parameter drift. Applied Psychological Measurement, 22, 33–51.

    Article  Google Scholar 

  • Goldstein, H. (1983). Measuring changes in educational attainment over time: Problems and possibilities. Journal of Educational Measurement, 20, 369–377.

    Article  Google Scholar 

  • Harris, D. J. (1991). Equating with nonrepresentative common item sets and nonequivalent groups. Paper presented at the annual meeting of the American Educational Research Association, Chicago.

    Google Scholar 

  • He, Y., Cui, Z., Fang, Y., & Chen, H. (2013). Using a linear regression method to detect outliers in IRT common item equating. Applied Psychological Measurement, 37, 522–540.

    Article  Google Scholar 

  • Hogg, R. V. (1979). Statistical robustness: One view on its use in applications today. The American Statistician, 33, 108–115.

    MathSciNet  MATH  Google Scholar 

  • Hu, H., Rogers, W. T., & Vukmirovic, Z. (2008). Investigation of IRT-based equating methods in the presence of outlier common items. Applied Psychological Measurement, 32, 311–333.

    Article  MathSciNet  Google Scholar 

  • Huang, C. Y., & Shyu, C. Y. (2003, April). The impact of item parameter drift on equating. Paper Presented at the Annual Meeting of the National Council on Measurement in Education, Chicago.

    Google Scholar 

  • Huynh, H., & Meyer, P. (2010). Use of robust z in detecting unstable items in item response theory models. Practical Assessment, Research and Evaluation, 15(2).

    Google Scholar 

  • Klein, L. W., & Jarjoura, D. (1985). The importance of content representation for common-item equating with nonrandom groups. Journal of Educational Measurement, 22, 197–206.

    Article  Google Scholar 

  • Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking. Springer.

    Book  MATH  Google Scholar 

  • Liu, C., Jurich, D., Morrison, C., & Grabovsky, I. (2020). Detection of outliers in anchor items using modified Rasch fit statistics. Paper presented at the annual meeting of the National Council on Measurement in Education, Denver, CO.

    Google Scholar 

  • Manna, V. F., & Gu, L. (2019). Different methods of adjusting for form difficulty under the Rasch model: Impact on consistency of assessment results (Technical report RR-19-08). Educational Testing Service.

    Google Scholar 

  • Miller, G. E., Rotou, O., & Twing, J. S. (2004). Evaluation of the 0.3 logit screening criterion in common item equating. Journal of Applied Measurement, 5, 172–177.

    Google Scholar 

  • Muraki, E., & Engelhard, G. (1989). Examining differential item functioning with BIMAIN. Paper presented at the annual meeting of the American Educational Research Association, San Francisco.

    Google Scholar 

  • Murphy, S., Little, I., Fan, M., Lin, C., & Kirkpatrick, R. (2010, April). The impact of different anchor stability methods on equating results and student performance. Paper presented at the annual meeting of the National Council on Measurement in Education, Denver, CO.

    Google Scholar 

  • Smith, R. M. (1996). A comparison of the Rasch separate calibration and between-fit methods of detecting item bias. Educational and Psychological Measurement, 56(3), 403–418.

    Article  MathSciNet  Google Scholar 

  • Smith, R. M., & Suh, K. K. (2003). Rasch fit statistics as a test of the invariance of item parameter estimates. Journal of Applied Measurement, 4, 153–163.

    Google Scholar 

  • Thissen, D., Steinberg, L., & Wainer, H. (1992). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–113). Lawrence Erlbaum Associates.

    Google Scholar 

  • von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The kernel method of test equating. Springer.

    Book  MATH  Google Scholar 

  • Winsteps & Rasch measurement Software. (2019). Fit diagnosis: Infit outfit mean-square standardized. Retrieved from http://www.winsteps.com/winman/misfitdiagnosis.htm.

  • Wright, B. D., & Stone, M. H. (1979). Best test design. MESA Press.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chunyan Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, C., Jurich, D. (2021). Comparison of Outlier Detection Methods in NEAT Design. In: Wiberg, M., Molenaar, D., González, J., Böckenholt, U., Kim, JS. (eds) Quantitative Psychology. IMPS 2020. Springer Proceedings in Mathematics & Statistics, vol 353. Springer, Cham. https://doi.org/10.1007/978-3-030-74772-5_20

Download citation

Publish with us

Policies and ethics