Multivariate generalizability analysis of automated scoring for short answer items of social studies in large-scale assessment

Sung, Kyung Hee; Noh, Eun Hee; Chon, Kyong Hee

doi:10.1007/s12564-017-9498-1

Multivariate generalizability analysis of automated scoring for short answer items of social studies in large-scale assessment

Published: 26 July 2017

Volume 18, pages 425–437, (2017)
Cite this article

Asia Pacific Education Review Aims and scope Submit manuscript

Kyung Hee Sung¹,
Eun Hee Noh¹ &
Kyong Hee Chon²

388 Accesses
4 Citations
3 Altmetric
2 Mentions
Explore all metrics

Abstract

With increased use of constructed response items in large scale assessments, the cost of scoring has been a major consideration (Noh et al. in KICE Report RRE 2012-6, 2012; Wainer and Thissen in Applied Measurement in Education 6:103–118, 1993). In response to the scoring cost issues, various forms of automated system for scoring constructed response items have been developed and used. The purpose of this research is to provide a comprehensive analysis for the generalizability of automated scoring results and compare it to that of scores produced by human raters. The results of this study provide evidence supporting the argument that the automated scoring system offers outcomes nearly as reliable as those produced by human scoring. Based on these findings, the automated scoring system appears to be a promising alternative to human scoring particularly for short factual answer items.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The Use of Cronbach’s Alpha When Developing and Reporting Research Instruments in Science Education

Article Open access 07 June 2017

The Impact of Peer Assessment on Academic Performance: A Meta-analysis of Control Group Studies

Article Open access 10 December 2019

Why, When, Who, What, How, and Where for Trainees Writing Literature Review Articles

Article 21 May 2019

References

Baxter, G. P., Shavelson, R. J., Goldman, S. R., & Pine, J. (1992). Evaluation of procedure-based scoring for hands-on science assessment. Journal of Educational Measurement, 29(1), 1–17.
Article Google Scholar
Bejar, I. I. (2011). A validity-based approach to quality control and assurance of automated scoring. Assessment in Education: Principles, Policy and Practice, 18(3), 319–341.
Article Google Scholar
Brennan, R. L. (2001a). Generalizability theory. New York: Springer.
Book Google Scholar
Brennan, R. L. (2001b). Manual for mGENOVA (Version 2.1). Iowa City, IA: Iowa Testing Programs, University of Iowa.
Google Scholar
Brennan, R. L., Gao, X., & Colton, D. A. (1995). Generalizability analyses of work keys listening and writing tests. Educational and Psychological Measurement, 55, 157–176.
Article Google Scholar
Clauser, B. E., Swanson, D. B., & Clyman, S. G. (1999). A comparison of the generalizability of scores produced by expert raters and automated scoring systems. Applied Measurement in Education, 12, 281–299.
Article Google Scholar
Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: Wiley.
Google Scholar
Custer, M., Sharairi, S., & Swift, D. (2012). A comparison of scoring options for omitted and not-reached items through the recovery of IRT parameters when utilizing the Rasch model and joint maximum likelihood estimation. Paper presented at the annual meeting of the National Council on Measurement in Education, Vancouver, British Columbia.
Gerard, L. F., & Linn, M. C. (2016). Using automated scores of student essays to support teacher guidance in classroom inquiry. Journal of Science Teacher Education, 27(1), 111–129.
Article Google Scholar
Hearst, M. A. (2000). The debate on automated essay grading. IEEE Intelligent Systems and their Applications, 15(5), 22–37.
Article Google Scholar
Hui, S. K. F., Brown, G. T. L., & Chan, S. W. M. (2017). Assessment for learning and for accountability in classrooms: The experience of four Hong Kong primary school curriculum leaders. Asia Pacific Education Review, 18(1), 41–51.
Article Google Scholar
Jeon, M., Lee, G., Hwang, J., & Kang, S. J. (2009). Estimating reliability of school-level scores using multilevel and generalizability theory models. Asia Pacific Education Review, 10(2), 149–158.
Article Google Scholar
Karami, H. (2013). An investigation of the gender differential performance on a high-stakes language proficiency test in Iran. Asia Pacific Education Review, 14(3), 435–444.
Article Google Scholar
Kim, S., & Feldt, L. S. (2010). The estimation of the IRT reliability coefficient and its lower and upper bounds, with comparisons to CTT reliability statistics. Asia Pacific Education Review, 11(2), 179–188.
Article Google Scholar
Kuechler, W. L., & Simkin, M. G. (2010). Why is performance on multiple-choice tests and constructed-response tests not more closely related? theory and an empirical test. Decision Sciences Journal of Innovative Education, 8(1), 55–73.
Article Google Scholar
Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and Humanities, 37(4), 389–405.
Article Google Scholar
Lee, G., & Park, I. (2012). A comparison of the approaches of generalizability theory and item response theory in estimating the reliability of test scores for testlet-composed tests. Asia Pacific Education Review, 13(1), 47–54.
Article Google Scholar
Livingston, S. A. (2009). Constructed-response test questions: Why we use them; how we score them. R & D connections. Retrieved from http://www.ets.org/Media/Research/pdf/RD_Connections11.pdf.
Noh, E. H., Kim, M. H., Sung, K. H., & Kim, H. S. (2013). Improvement and Application of an Automatic Scoring Program for Short Answer of Korean Items in Large-Scale Assessments. KICE Report RRE 2013-5.
Noh, E. H., Sim, J. H., Kim, M. H., & Kim, J. H. (2012). Developing an Automatic Content Scoring Program for Short Answer Korean Items in Large-Scale Assessments. KICE Report RRE 2012-6.
Reckase, M. D. (1995). Portfolio assessment: A theoretical estimate of score reliability. Educational Measurement: Issues and Practice, 14(1), 12–14.
Article Google Scholar
Shavelson, R. J., & Webb, N. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage.
Google Scholar
Shermis, M. D. (2014). State-of-the-art automated essay scoring: Competition, results, and future directions from a United States demonstration. Assessing Writing, 20, 53–76.
Article Google Scholar
Sukkarieh, J. Z., & Blackmore, J. (2009). C-rater: Automatic content scoring for short constructed response. Proceedings of the Twenty-Second International FLAIRS Conference. 290–295.
Sukkarieh, J. Z., Pulman, S. G., & Raikes, N. (2003). Auto-marking: Using computational linguistics to score short, free text responses. In Paper presented at the 29th annual conference of the International Association for Educational Assessment (IAEA), Manchester, UK.
Topol, B., Olson, J., & Roeber, E. (2011). The cost of new higher quality assessments: A comprehensive analysis of the potential costs for future state assessments. Stanford, CA: Stanford Center for Opportunity Policy in Education.
Google Scholar
Topol, B., Olson, J., & Roeber, E. (2014). Pricing study Machine scoring of student essays. Retrieved from http://cdno4.gettingsmart.com/wp-content/uploads/2014/02/ASAP-Pricing-Study-Final.pdf.
Wainer, H., & Thissen, D. (1993). Combining multiple-choice and constructed-response test scores: Toward a Marxist theory of test construction. Applied Measurement in Education, 6, 103–118.
Article Google Scholar
Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13.
Article Google Scholar
World Class Arena Limited (n.d.). Short answer marking engines. Retrieved from http://www.worldclassarena.net/doc/file5.pdf.
Yin, P. (2005). A multivariate generalizability analysis of the multistate bar examination. Educational and Psychological Measurement, 65, 668–686.
Article Google Scholar
Zhang, M. (2013). Contrasting Automated and Human Scoring of Essays. R & D Connections. ETS. Retrieved from http://www.ets.org/Media/Research/pdf/RD_Connections_21.pdf.
Zuo, Y. (2007). A multivariate generalizability analysis of student style questionnaire, Unpublished thesis, University of Florida.

Download references

Author information

Authors and Affiliations

Korea Institute for Curriculum and Evaluation, 21-15, Jeongdong-gil, Jung-gu, Seoul, Korea
Kyung Hee Sung & Eun Hee Noh
Kangnam University, 201-2 Education Bldg., 40 Gangnam-ro, Giheung-gu, Yongin-Si, Korea
Kyong Hee Chon

Authors

Kyung Hee Sung
View author publications
You can also search for this author in PubMed Google Scholar
Eun Hee Noh
View author publications
You can also search for this author in PubMed Google Scholar
Kyong Hee Chon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kyong Hee Chon.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sung, K.H., Noh, E.H. & Chon, K.H. Multivariate generalizability analysis of automated scoring for short answer items of social studies in large-scale assessment. Asia Pacific Educ. Rev. 18, 425–437 (2017). https://doi.org/10.1007/s12564-017-9498-1

Download citation

Received: 18 March 2016
Revised: 30 June 2017
Accepted: 11 July 2017
Published: 26 July 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s12564-017-9498-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multivariate generalizability analysis of automated scoring for short answer items of social studies in large-scale assessment

Abstract

Access this article

Similar content being viewed by others

The Use of Cronbach’s Alpha When Developing and Reporting Research Instruments in Science Education

The Impact of Peer Assessment on Academic Performance: A Meta-analysis of Control Group Studies

Why, When, Who, What, How, and Where for Trainees Writing Literature Review Articles

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multivariate generalizability analysis of automated scoring for short answer items of social studies in large-scale assessment

Abstract

Access this article

Similar content being viewed by others

The Use of Cronbach’s Alpha When Developing and Reporting Research Instruments in Science Education

The Impact of Peer Assessment on Academic Performance: A Meta-analysis of Control Group Studies

Why, When, Who, What, How, and Where for Trainees Writing Literature Review Articles

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation