Abstract
In equating practice, the existence of outliers in the anchor items can deteriorate the equating accuracy and threaten the validity of test scores. This study used simulation to compare the performance of three outlier detection methods when conducting equating: the t-test method, the logit difference method, and the robust z statistic. The investigated factors include sample size, proportion of outliers, item difficulty drift direction, and group difference. Overall, across all simulated conditions, the t-test method outperformed the other methods in terms of sensitivity of flagging true outliers, specificity of flagging true non-outliers, bias of translation constant, and the root mean square error of the estimated examinee ability.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Altman, D. G., & Bland, J. M. (1994). Diagnostic tests. 1: Sensitivity and specificity. BMJ, 308, 1552.
Bock, R. D., Muraki, E., & Pfeiffenberger, W. (1988). Item pool maintenance in the presence of item parameter drift. Journal of Educational Measurement, 25(4), 275–285.
DeMars, C. E. (2004). Detection of item parameter drift over multiple test administrations. Applied Measurement in Education, 17, 265–300.
Donoghue, J. R., & Isham, S. P. (1998). A comparison of procedures to detect item parameter drift. Applied Psychological Measurement, 22, 33–51.
Goldstein, H. (1983). Measuring changes in educational attainment over time: Problems and possibilities. Journal of Educational Measurement, 20, 369–377.
Harris, D. J. (1991). Equating with nonrepresentative common item sets and nonequivalent groups. Paper presented at the annual meeting of the American Educational Research Association, Chicago.
He, Y., Cui, Z., Fang, Y., & Chen, H. (2013). Using a linear regression method to detect outliers in IRT common item equating. Applied Psychological Measurement, 37, 522–540.
Hogg, R. V. (1979). Statistical robustness: One view on its use in applications today. The American Statistician, 33, 108–115.
Hu, H., Rogers, W. T., & Vukmirovic, Z. (2008). Investigation of IRT-based equating methods in the presence of outlier common items. Applied Psychological Measurement, 32, 311–333.
Huang, C. Y., & Shyu, C. Y. (2003, April). The impact of item parameter drift on equating. Paper Presented at the Annual Meeting of the National Council on Measurement in Education, Chicago.
Huynh, H., & Meyer, P. (2010). Use of robust z in detecting unstable items in item response theory models. Practical Assessment, Research and Evaluation, 15(2).
Klein, L. W., & Jarjoura, D. (1985). The importance of content representation for common-item equating with nonrandom groups. Journal of Educational Measurement, 22, 197–206.
Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling, and linking. Springer.
Liu, C., Jurich, D., Morrison, C., & Grabovsky, I. (2020). Detection of outliers in anchor items using modified Rasch fit statistics. Paper presented at the annual meeting of the National Council on Measurement in Education, Denver, CO.
Manna, V. F., & Gu, L. (2019). Different methods of adjusting for form difficulty under the Rasch model: Impact on consistency of assessment results (Technical report RR-19-08). Educational Testing Service.
Miller, G. E., Rotou, O., & Twing, J. S. (2004). Evaluation of the 0.3 logit screening criterion in common item equating. Journal of Applied Measurement, 5, 172–177.
Muraki, E., & Engelhard, G. (1989). Examining differential item functioning with BIMAIN. Paper presented at the annual meeting of the American Educational Research Association, San Francisco.
Murphy, S., Little, I., Fan, M., Lin, C., & Kirkpatrick, R. (2010, April). The impact of different anchor stability methods on equating results and student performance. Paper presented at the annual meeting of the National Council on Measurement in Education, Denver, CO.
Smith, R. M. (1996). A comparison of the Rasch separate calibration and between-fit methods of detecting item bias. Educational and Psychological Measurement, 56(3), 403–418.
Smith, R. M., & Suh, K. K. (2003). Rasch fit statistics as a test of the invariance of item parameter estimates. Journal of Applied Measurement, 4, 153–163.
Thissen, D., Steinberg, L., & Wainer, H. (1992). Detection of differential item functioning using the parameters of item response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 67–113). Lawrence Erlbaum Associates.
von Davier, A. A., Holland, P. W., & Thayer, D. T. (2004). The kernel method of test equating. Springer.
Winsteps & Rasch measurement Software. (2019). Fit diagnosis: Infit outfit mean-square standardized. Retrieved from http://www.winsteps.com/winman/misfitdiagnosis.htm.
Wright, B. D., & Stone, M. H. (1979). Best test design. MESA Press.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, C., Jurich, D. (2021). Comparison of Outlier Detection Methods in NEAT Design. In: Wiberg, M., Molenaar, D., González, J., Böckenholt, U., Kim, JS. (eds) Quantitative Psychology. IMPS 2020. Springer Proceedings in Mathematics & Statistics, vol 353. Springer, Cham. https://doi.org/10.1007/978-3-030-74772-5_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-74772-5_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-74771-8
Online ISBN: 978-3-030-74772-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)