Abstract
With increased use of constructed response items in large scale assessments, the cost of scoring has been a major consideration (Noh et al. in KICE Report RRE 2012-6, 2012; Wainer and Thissen in Applied Measurement in Education 6:103–118, 1993). In response to the scoring cost issues, various forms of automated system for scoring constructed response items have been developed and used. The purpose of this research is to provide a comprehensive analysis for the generalizability of automated scoring results and compare it to that of scores produced by human raters. The results of this study provide evidence supporting the argument that the automated scoring system offers outcomes nearly as reliable as those produced by human scoring. Based on these findings, the automated scoring system appears to be a promising alternative to human scoring particularly for short factual answer items.
Similar content being viewed by others
References
Baxter, G. P., Shavelson, R. J., Goldman, S. R., & Pine, J. (1992). Evaluation of procedure-based scoring for hands-on science assessment. Journal of Educational Measurement, 29(1), 1–17.
Bejar, I. I. (2011). A validity-based approach to quality control and assurance of automated scoring. Assessment in Education: Principles, Policy and Practice, 18(3), 319–341.
Brennan, R. L. (2001a). Generalizability theory. New York: Springer.
Brennan, R. L. (2001b). Manual for mGENOVA (Version 2.1). Iowa City, IA: Iowa Testing Programs, University of Iowa.
Brennan, R. L., Gao, X., & Colton, D. A. (1995). Generalizability analyses of work keys listening and writing tests. Educational and Psychological Measurement, 55, 157–176.
Clauser, B. E., Swanson, D. B., & Clyman, S. G. (1999). A comparison of the generalizability of scores produced by expert raters and automated scoring systems. Applied Measurement in Education, 12, 281–299.
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: Wiley.
Custer, M., Sharairi, S., & Swift, D. (2012). A comparison of scoring options for omitted and not-reached items through the recovery of IRT parameters when utilizing the Rasch model and joint maximum likelihood estimation. Paper presented at the annual meeting of the National Council on Measurement in Education, Vancouver, British Columbia.
Gerard, L. F., & Linn, M. C. (2016). Using automated scores of student essays to support teacher guidance in classroom inquiry. Journal of Science Teacher Education, 27(1), 111–129.
Hearst, M. A. (2000). The debate on automated essay grading. IEEE Intelligent Systems and their Applications, 15(5), 22–37.
Hui, S. K. F., Brown, G. T. L., & Chan, S. W. M. (2017). Assessment for learning and for accountability in classrooms: The experience of four Hong Kong primary school curriculum leaders. Asia Pacific Education Review, 18(1), 41–51.
Jeon, M., Lee, G., Hwang, J., & Kang, S. J. (2009). Estimating reliability of school-level scores using multilevel and generalizability theory models. Asia Pacific Education Review, 10(2), 149–158.
Karami, H. (2013). An investigation of the gender differential performance on a high-stakes language proficiency test in Iran. Asia Pacific Education Review, 14(3), 435–444.
Kim, S., & Feldt, L. S. (2010). The estimation of the IRT reliability coefficient and its lower and upper bounds, with comparisons to CTT reliability statistics. Asia Pacific Education Review, 11(2), 179–188.
Kuechler, W. L., & Simkin, M. G. (2010). Why is performance on multiple-choice tests and constructed-response tests not more closely related? theory and an empirical test. Decision Sciences Journal of Innovative Education, 8(1), 55–73.
Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and Humanities, 37(4), 389–405.
Lee, G., & Park, I. (2012). A comparison of the approaches of generalizability theory and item response theory in estimating the reliability of test scores for testlet-composed tests. Asia Pacific Education Review, 13(1), 47–54.
Livingston, S. A. (2009). Constructed-response test questions: Why we use them; how we score them. R & D connections. Retrieved from http://www.ets.org/Media/Research/pdf/RD_Connections11.pdf.
Noh, E. H., Kim, M. H., Sung, K. H., & Kim, H. S. (2013). Improvement and Application of an Automatic Scoring Program for Short Answer of Korean Items in Large-Scale Assessments. KICE Report RRE 2013-5.
Noh, E. H., Sim, J. H., Kim, M. H., & Kim, J. H. (2012). Developing an Automatic Content Scoring Program for Short Answer Korean Items in Large-Scale Assessments. KICE Report RRE 2012-6.
Reckase, M. D. (1995). Portfolio assessment: A theoretical estimate of score reliability. Educational Measurement: Issues and Practice, 14(1), 12–14.
Shavelson, R. J., & Webb, N. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.
Shermis, M. D. (2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20, 53–76.
Sukkarieh, J. Z., & Blackmore, J. (2009). C-rater: Automatic content scoring for short constructed response. Proceedings of the Twenty-Second International FLAIRS Conference. 290–295.
Sukkarieh, J. Z., Pulman, S. G., & Raikes, N. (2003). Auto-marking: Using computational linguistics to score short, free text responses. In Paper presented at the 29th annual conference of the International Association for Educational Assessment (IAEA), Manchester, UK.
Topol, B., Olson, J., & Roeber, E. (2011). The cost of new higher quality assessments: A comprehensive analysis of the potential costs for future state assessments. Stanford, CA: Stanford Center for Opportunity Policy in Education.
Topol, B., Olson, J., & Roeber, E. (2014). Pricing study Machine scoring of student essays. Retrieved from http://cdno4.gettingsmart.com/wp-content/uploads/2014/02/ASAP-Pricing-Study-Final.pdf.
Wainer, H., & Thissen, D. (1993). Combining multiple-choice and constructed-response test scores: Toward a Marxist theory of test construction. Applied Measurement in Education, 6, 103–118.
Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13.
World Class Arena Limited (n.d.). Short answer marking engines. Retrieved from http://www.worldclassarena.net/doc/file5.pdf.
Yin, P. (2005). A multivariate generalizability analysis of the multistate bar examination. Educational and Psychological Measurement, 65, 668–686.
Zhang, M. (2013). Contrasting Automated and Human Scoring of Essays. R & D Connections. ETS. Retrieved from http://www.ets.org/Media/Research/pdf/RD_Connections_21.pdf.
Zuo, Y. (2007). A multivariate generalizability analysis of student style questionnaire, Unpublished thesis, University of Florida.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sung, K.H., Noh, E.H. & Chon, K.H. Multivariate generalizability analysis of automated scoring for short answer items of social studies in large-scale assessment. Asia Pacific Educ. Rev. 18, 425–437 (2017). https://doi.org/10.1007/s12564-017-9498-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12564-017-9498-1