Skip to main content
Log in

Multivariate generalizability analysis of automated scoring for short answer items of social studies in large-scale assessment

  • Published:
Asia Pacific Education Review Aims and scope Submit manuscript

Abstract

With increased use of constructed response items in large scale assessments, the cost of scoring has been a major consideration (Noh et al. in KICE Report RRE 2012-6, 2012; Wainer and Thissen in Applied Measurement in Education 6:103–118, 1993). In response to the scoring cost issues, various forms of automated system for scoring constructed response items have been developed and used. The purpose of this research is to provide a comprehensive analysis for the generalizability of automated scoring results and compare it to that of scores produced by human raters. The results of this study provide evidence supporting the argument that the automated scoring system offers outcomes nearly as reliable as those produced by human scoring. Based on these findings, the automated scoring system appears to be a promising alternative to human scoring particularly for short factual answer items.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Baxter, G. P., Shavelson, R. J., Goldman, S. R., & Pine, J. (1992). Evaluation of procedure-based scoring for hands-on science assessment. Journal of Educational Measurement, 29(1), 1–17.

    Article  Google Scholar 

  • Bejar, I. I. (2011). A validity-based approach to quality control and assurance of automated scoring. Assessment in Education: Principles, Policy and Practice, 18(3), 319–341.

    Article  Google Scholar 

  • Brennan, R. L. (2001a). Generalizability theory. New York: Springer.

    Book  Google Scholar 

  • Brennan, R. L. (2001b). Manual for mGENOVA (Version 2.1). Iowa City, IA: Iowa Testing Programs, University of Iowa.

    Google Scholar 

  • Brennan, R. L., Gao, X., & Colton, D. A. (1995). Generalizability analyses of work keys listening and writing tests. Educational and Psychological Measurement, 55, 157–176.

    Article  Google Scholar 

  • Clauser, B. E., Swanson, D. B., & Clyman, S. G. (1999). A comparison of the generalizability of scores produced by expert raters and automated scoring systems. Applied Measurement in Education, 12, 281–299.

    Article  Google Scholar 

  • Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: Wiley.

    Google Scholar 

  • Custer, M., Sharairi, S., & Swift, D. (2012). A comparison of scoring options for omitted and not-reached items through the recovery of IRT parameters when utilizing the Rasch model and joint maximum likelihood estimation. Paper presented at the annual meeting of the National Council on Measurement in Education, Vancouver, British Columbia.

  • Gerard, L. F., & Linn, M. C. (2016). Using automated scores of student essays to support teacher guidance in classroom inquiry. Journal of Science Teacher Education, 27(1), 111–129.

    Article  Google Scholar 

  • Hearst, M. A. (2000). The debate on automated essay grading. IEEE Intelligent Systems and their Applications, 15(5), 22–37.

    Article  Google Scholar 

  • Hui, S. K. F., Brown, G. T. L., & Chan, S. W. M. (2017). Assessment for learning and for accountability in classrooms: The experience of four Hong Kong primary school curriculum leaders. Asia Pacific Education Review, 18(1), 41–51.

    Article  Google Scholar 

  • Jeon, M., Lee, G., Hwang, J., & Kang, S. J. (2009). Estimating reliability of school-level scores using multilevel and generalizability theory models. Asia Pacific Education Review, 10(2), 149–158.

    Article  Google Scholar 

  • Karami, H. (2013). An investigation of the gender differential performance on a high-stakes language proficiency test in Iran. Asia Pacific Education Review, 14(3), 435–444.

    Article  Google Scholar 

  • Kim, S., & Feldt, L. S. (2010). The estimation of the IRT reliability coefficient and its lower and upper bounds, with comparisons to CTT reliability statistics. Asia Pacific Education Review, 11(2), 179–188.

    Article  Google Scholar 

  • Kuechler, W. L., & Simkin, M. G. (2010). Why is performance on multiple-choice tests and constructed-response tests not more closely related? theory and an empirical test. Decision Sciences Journal of Innovative Education, 8(1), 55–73.

    Article  Google Scholar 

  • Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and Humanities, 37(4), 389–405.

    Article  Google Scholar 

  • Lee, G., & Park, I. (2012). A comparison of the approaches of generalizability theory and item response theory in estimating the reliability of test scores for testlet-composed tests. Asia Pacific Education Review, 13(1), 47–54.

    Article  Google Scholar 

  • Livingston, S. A. (2009). Constructed-response test questions: Why we use them; how we score them. R & D connections. Retrieved from http://www.ets.org/Media/Research/pdf/RD_Connections11.pdf.

  • Noh, E. H., Kim, M. H., Sung, K. H., & Kim, H. S. (2013). Improvement and Application of an Automatic Scoring Program for Short Answer of Korean Items in Large-Scale Assessments. KICE Report RRE 2013-5.

  • Noh, E. H., Sim, J. H., Kim, M. H., & Kim, J. H. (2012). Developing an Automatic Content Scoring Program for Short Answer Korean Items in Large-Scale Assessments. KICE Report RRE 2012-6.

  • Reckase, M. D. (1995). Portfolio assessment: A theoretical estimate of score reliability. Educational Measurement: Issues and Practice, 14(1), 12–14.

    Article  Google Scholar 

  • Shavelson, R. J., & Webb, N. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.

    Google Scholar 

  • Shermis, M. D. (2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20, 53–76.

    Article  Google Scholar 

  • Sukkarieh, J. Z., & Blackmore, J. (2009). C-rater: Automatic content scoring for short constructed response. Proceedings of the Twenty-Second International FLAIRS Conference. 290–295.

  • Sukkarieh, J. Z., Pulman, S. G., & Raikes, N. (2003). Auto-marking: Using computational linguistics to score short, free text responses. In Paper presented at the 29th annual conference of the International Association for Educational Assessment (IAEA), Manchester, UK.

  • Topol, B., Olson, J., & Roeber, E. (2011). The cost of new higher quality assessments: A comprehensive analysis of the potential costs for future state assessments. Stanford, CA: Stanford Center for Opportunity Policy in Education.

    Google Scholar 

  • Topol, B., Olson, J., & Roeber, E. (2014). Pricing study Machine scoring of student essays. Retrieved from http://cdno4.gettingsmart.com/wp-content/uploads/2014/02/ASAP-Pricing-Study-Final.pdf.

  • Wainer, H., & Thissen, D. (1993). Combining multiple-choice and constructed-response test scores: Toward a Marxist theory of test construction. Applied Measurement in Education, 6, 103–118.

    Article  Google Scholar 

  • Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13.

    Article  Google Scholar 

  • World Class Arena Limited (n.d.). Short answer marking engines. Retrieved from http://www.worldclassarena.net/doc/file5.pdf.

  • Yin, P. (2005). A multivariate generalizability analysis of the multistate bar examination. Educational and Psychological Measurement, 65, 668–686.

    Article  Google Scholar 

  • Zhang, M. (2013). Contrasting Automated and Human Scoring of Essays. R & D Connections. ETS. Retrieved from http://www.ets.org/Media/Research/pdf/RD_Connections_21.pdf.

  • Zuo, Y. (2007). A multivariate generalizability analysis of student style questionnaire, Unpublished thesis, University of Florida.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kyong Hee Chon.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sung, K.H., Noh, E.H. & Chon, K.H. Multivariate generalizability analysis of automated scoring for short answer items of social studies in large-scale assessment. Asia Pacific Educ. Rev. 18, 425–437 (2017). https://doi.org/10.1007/s12564-017-9498-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12564-017-9498-1

Keywords

Navigation