Automated Short Answer Scoring Using an Ensemble of Neural Networks and Latent Semantic Analysis Classifiers

Ormerod, Christopher; Lottridge, Susan; Harris, Amy E.; Patel, Milan; van Wamelen, Paul; Kodeswaran, Balaji; Woolf, Sharon; Young, Mackenzie

doi:10.1007/s40593-022-00294-2

Automated Short Answer Scoring Using an Ensemble of Neural Networks and Latent Semantic Analysis Classifiers

ARTICLE
Published: 02 June 2022

Volume 33, pages 467–496, (2023)
Cite this article

International Journal of Artificial Intelligence in Education Aims and scope Submit manuscript

Christopher Ormerod ORCID: orcid.org/0000-0002-2129-7021¹,
Susan Lottridge¹,
Amy E. Harris¹,
Milan Patel¹,
Paul van Wamelen¹,
Balaji Kodeswaran¹,
Sharon Woolf¹ &
…
Mackenzie Young¹

693 Accesses
1 Citation
2 Altmetric
Explore all metrics

Abstract

We introduce a short answer scoring engine made up of an ensemble of deep neural networks and a Latent Semantic Analysis-based model to score short constructed responses for a large suite of questions from a national assessment program. We evaluate the performance of the engine and show that the engine achieves above-human-level performance on a large set of items. Items are scored using 2-point and 3-point holistic rubrics. We outline the items, data, handscoring methods, engine, and results. We also provide an overview of performance key student groups including: gender, ethnicity, English language proficiency, disability status, and economically disadvantaged status.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Learning Techniques for Automatic Short Answer Grading: Predicting Scores for English and German Answers

Machine Learning Approach for Automatic Short Answer Grading: A Systematic Review

SPRAG: building and benchmarking a Short Programming-Related Answer Grading dataset

Article 04 June 2024

Data Availability

The data provided may contain personally identifiable information and is not publicly available.

Code Availability

The code used to model these responses is not generally available.

Notes

https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/word_embeddings.ipynb
https://nlp.stanford.edu/projects/glove/
The student and item-specific embeddings were that it was trained predominantly on a corpus in which there are many spelling mistakes. Approximately 90 thousand words in the 1.12 million-word vocabulary could be found in a large dictionary of correctly spelled words. Ad hoc inspection indicated that many incorrectly spelled words have a very high cosine similarity with their correctly spelled variants.

References

American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.
Google Scholar
Anscombe, F. J. (1948). The transformation of Poisson, binomial and negative-binomial data. Biometrika, 246–254.
Anson, C. S. (2013). NCTE position statement on machine scoring: Machine scoring fails the test.
Arter, J. (2000). Rubrics, scoring guides, and performance criteria: Classroom tools for. The Annual Conference of the American Educational Research Association. New Orleans.
Basu, S., Jacobs, C., & Vanderwende, L. (2013). Powergrading: A clustering approach to amplify human effort for short answer grading. Transactions of the Association for Computational Linguistics, 391–402.
Bridgeman, B. C. (2009). Considering fairness and validity in evaluating automated scoring. National Council on Measurement in Education. San Diego.
Cho, K. v. (2012). Learning phrase representations using rnn encoderdecoder for statistical machine translation. preprint arxivs, 1406.1078.
Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin, 213.
Darling-Hammond, L. J. (2013). Criteria for high-quality assessment. Stanford Center for Opportunity Policy in Education.
Google Scholar
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.
Article Google Scholar
Devlin, J. M.-W. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dikli, S. (2006). An overview of automated scoring of essays. The Journal of Technology, Learning and Assessment.
Dzikovska, M. O., Nielsen, R. D., Brew, C., Leacock, C., Giampiccolo, D., Bentivogli, L., Clark, P., Dagan, I., & Dang, H. T. (2013). Semeval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge. NORTH TEXAS STATE UNIV DENTON.
Esteva, A. B. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 115.
Fan, Xitao, & Nowell, Dana L. (2011). Using propensity score matching in educational research. Gifted Child Quarterly, 55(1), 74–79.
Article Google Scholar
Gewertz, C. (2013, June 9). States Ponder Costs of Common Tests. Education Week, pp. 20–22.
Gong, T. a. (2019). An Attention-based Deep Model for Automatic Short Answer Score. International Journal of Computer Science and Software Engineering, 127–132.
Hand-Scoring Rules. (2016). Retrieved from http://www.smarterapp.org/documents/Smarter_Balanced_Hand_Scoring_Rules.pdf. Accessed 18 June.
Harris, Z. S. (1954). Distributional structure. Word, 146–162.
Hochreiter, S. a. (1997). Long short-term memory. Neural computation, 1735–1780.
Kumar, Y., Swati A., Debanjan M., Rajiv R. S., Ponnurangam K., & Roger Z. (2019). Get it scored using autosas—an automated system for scoring short answers. In Proceedings of the AAAI Conference on Artificial Intelligence, 33(1), pp. 9662–9669.
Leacock, C. a. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 389–405.
Lee, N. T. (2018). Detecting racial bias in algorithms and machine learning. Journal of Information, Communication and Ethics in Society.
Madnani, N. & Loukina, A. (2016). RSMTool: A Collection of Tools for Building and Evaluating Automated Scoring Models. Journal of Open Source Software.
McCurry, D. (2010). Can machine scoring deal with broad and open writing. Assessing Writing, 118–129.
McGraw-Hill Education, C. T. (2014). Smarter balanced assessment consortium field test: Automated scoring research studies (in accordance with smarter balanced RFP 17).
Mikolov, T. I. (2013). Distributed representations of words and phrases and their compositionality. . Advances in neural information processing systems, 3111–3119.
Mohler, M. a. (2009). Text-to-text semantic similarity for automatic short answer grading. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. (pp. 567–575). Association for Computational Linguistics.
Murphy, R. F. (2019). Artificial Intelligence Applications to Support K–12 Teachers and Teaching. RAND Corporation.
Google Scholar
Norvig, P. (2007a). Retrieved from How to write a spelling corrector: http://norvig.com/spell-correct.html. Accessed July 2018
Norvig, P. (2007b). How to write a spelling corrector. Retrieved from How to write a spelling corrector: http://norvig.com/spell-correct.html. Accessed July 2018
Ormerod, C. M. & Harris, A. E. (2018). Neural network approach to classifying alarming student responses to online assessment. arXiv preprint, 1809.08899.
Ormerod, C. M., Malhotra, A., & Jafari, A. (2021). Automated essay scoring using efficient transformer-based language models. arXiv preprint arXiv:2102.13136.
Osgood, C. E. (1964). Semantic differential technique in the comparative study of cultures 1. American Anthropologist, 171–200.
Page, E. B. & Petersen, N. S. (1995). The computer moves into essay grading: Updating the ancient test. Phi delta kappan, 561.
Page, E. B. (2003). Project Essay Grade: PEG. In M. D. Shermis (Ed.), Automated essay scoring: A cross-disciplinary perspective (p. 43). New Jersey: Lawrence Erlbaum Associates.
Google Scholar
Pearson and ETS. (2015). Research results of PARCC automated scoring proof of concept study. Retrieved from http://www.parcconline.org/images/Resources/Educator-resources/PARCC_AI_Research_Report.pdf. Accessed Sept 2019.
Perelman, L. C. (2013). Critique of Mark D. Shermis & Ben Hammer, Contrasting state-of-the-art automated scoring of essays: Analysis. Journal of Writing Assessment.
Powers, D. E. (2002). Stumping e-rater: challenging the validity of automated essay scoring. Computers in Human Behavior, 103–134.
Rajpurkar, P. J. (2016). Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:, 1606.05250.
Riordan, B., Horbach, A., Cahill, A., Zesch, T., & Lee, C. (2017). Investigating neural architectures for short answer scoring. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 159–168.
Roberts, K. (2016). Assessing the corpus size vs. similarity trade-off for word embeddings in clinical NLP. Proceedings of the Clinical Natural Language Processing Workshop, (pp. 54–63). Osaka, Japan.
Rubin, D. B. (2006). Matched sampling for causal effects. Cambridge University Press.
Book MATH Google Scholar
Sakaguchi, K. M. (2015). Effective feature integration for automated short answer scoring. Proceedings of the 2015 conference of the North American Chapter of the association for computational linguistics: Human language technologies, (pp. 1049–1054).
Sato, E. R. (2011). SMARTER balanced assessment consortium common core state standards analysis: Eligible content for the summative assessment. Final Report. Smarter Balanced Assessment Consortium.
Shermis, M. D. (2013a). Contrasting state-of-the-art automated scoring of essays: Analysis. Annual national council on measurement in education meeting.
Shermis, M. D. (2013). Handbook of automated essay evaluation: Current applications and new directions. Routledge.
Book Google Scholar
Shermis, M. D. (2015). Contrasting state-of-the-art in the machine scoring of short-form constructed responses. Educational Assessment, 46–65.
Silver, D. A. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 484.
Smith, C. (2017). iOS 10: Siri now works in third-party apps, comes with extra AI features. BGR.
Sultan, M. A. (2015). Fast and easy short answer grading with high accuracy. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (pp. 1070–1075).
Sung, Chul, Tejas Indulal Dhamecha, and Nirmal Mukhi. (2019). Improving short answer grading using transformer-based pre-training. International Conference on Artificial Intelligence in Education, (pp. 469–481).
Szegedy, C. S. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence.
Tomas Mikolov, K. C. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of Workshop at ICLR.
Turney, P. D. (2010). From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research, 141–188.
Vaswani, A. N. (2017). Attention is all you need. Advances in neural information processing systems., 5998–6008.
Vogels, W. (2017). Bringing the Magic of Amazon AI and Alexa to Apps on AWS. Retrieved from All Things Distributed: www.allthingsdistributed.com
Williamson, D. M., Xiaoming X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational measurement: issues and practice, 31(1), 2–13.
Wu, Y. M. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation. arxiv preprints, 1609.08144.
Zhilin Yang, Z. D. (2019). XLNet: Generalized Autoregressive Pretraining for Language Understanding. preprint Paper arxiv, 1906.08237.
Zhou, Z.-H. J. (2002a). Ensembling neural networks: many could be better than all. Artificial Intelligence, 239–263.
Zhou, Z.-H. J. (2002b). Ensembling neural networks: many could be better than all. Artificial Intelligence, 239–263.
Zhai, X. Y. (2020). Applying machine learning in science assessment: a systematic review. Studies in Science Education, 111–151.
Zou, J. a. (2018). AI can be sexist and racist—it’s time to make it fair. Nature, 324–326.

Download references

Author information

Authors and Affiliations

Cambium Assessment, Cambium Learning® Group, Cambium Assessment, Inc. 1000 Thomas Jefferson St., N.W., Washington, DC, 20007, USA
Christopher Ormerod, Susan Lottridge, Amy E. Harris, Milan Patel, Paul van Wamelen, Balaji Kodeswaran, Sharon Woolf & Mackenzie Young

Authors

Christopher Ormerod
View author publications
You can also search for this author in PubMed Google Scholar
Susan Lottridge
View author publications
You can also search for this author in PubMed Google Scholar
Amy E. Harris
View author publications
You can also search for this author in PubMed Google Scholar
Milan Patel
View author publications
You can also search for this author in PubMed Google Scholar
Paul van Wamelen
View author publications
You can also search for this author in PubMed Google Scholar
Balaji Kodeswaran
View author publications
You can also search for this author in PubMed Google Scholar
Sharon Woolf
View author publications
You can also search for this author in PubMed Google Scholar
Mackenzie Young
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christopher Ormerod.

Ethics declarations

Conflicts of Interest/Competing Interests

Not Applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ormerod, C., Lottridge, S., Harris, A.E. et al. Automated Short Answer Scoring Using an Ensemble of Neural Networks and Latent Semantic Analysis Classifiers. Int J Artif Intell Educ 33, 467–496 (2023). https://doi.org/10.1007/s40593-022-00294-2

Download citation

Accepted: 02 May 2022
Published: 02 June 2022
Issue Date: September 2023
DOI: https://doi.org/10.1007/s40593-022-00294-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automated Short Answer Scoring Using an Ensemble of Neural Networks and Latent Semantic Analysis Classifiers

Abstract

Access this article

Similar content being viewed by others

Deep Learning Techniques for Automatic Short Answer Grading: Predicting Scores for English and German Answers

Machine Learning Approach for Automatic Short Answer Grading: A Systematic Review

SPRAG: building and benchmarking a Short Programming-Related Answer Grading dataset

Data Availability

Code Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of Interest/Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automated Short Answer Scoring Using an Ensemble of Neural Networks and Latent Semantic Analysis Classifiers

Abstract

Access this article

Similar content being viewed by others

Deep Learning Techniques for Automatic Short Answer Grading: Predicting Scores for English and German Answers

Machine Learning Approach for Automatic Short Answer Grading: A Systematic Review

SPRAG: building and benchmarking a Short Programming-Related Answer Grading dataset

Data Availability

Code Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of Interest/Competing Interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation