Skip to main content

Advertisement

Log in

Automated Short Answer Scoring Using an Ensemble of Neural Networks and Latent Semantic Analysis Classifiers

  • ARTICLE
  • Published:
International Journal of Artificial Intelligence in Education Aims and scope Submit manuscript

Abstract

We introduce a short answer scoring engine made up of an ensemble of deep neural networks and a Latent Semantic Analysis-based model to score short constructed responses for a large suite of questions from a national assessment program. We evaluate the performance of the engine and show that the engine achieves above-human-level performance on a large set of items. Items are scored using 2-point and 3-point holistic rubrics. We outline the items, data, handscoring methods, engine, and results. We also provide an overview of performance key student groups including: gender, ethnicity, English language proficiency, disability status, and economically disadvantaged status.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Data Availability

The data provided may contain personally identifiable information and is not publicly available.

Code Availability

The code used to model these responses is not generally available.

Notes

  1. https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/word_embeddings.ipynb

  2. https://nlp.stanford.edu/projects/glove/

  3. The student and item-specific embeddings were that it was trained predominantly on a corpus in which there are many spelling mistakes. Approximately 90 thousand words in the 1.12 million-word vocabulary could be found in a large dictionary of correctly spelled words. Ad hoc inspection indicated that many incorrectly spelled words have a very high cosine similarity with their correctly spelled variants.

References

  • American Educational Research Association, American Psychological Association, National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.

    Google Scholar 

  • Anscombe, F. J. (1948). The transformation of Poisson, binomial and negative-binomial data. Biometrika, 246–254.

  • Anson, C. S. (2013). NCTE position statement on machine scoring: Machine scoring fails the test.

  • Arter, J. (2000). Rubrics, scoring guides, and performance criteria: Classroom tools for. The Annual Conference of the American Educational Research Association. New Orleans.

  • Basu, S., Jacobs, C., & Vanderwende, L. (2013). Powergrading: A clustering approach to amplify human effort for short answer grading. Transactions of the Association for Computational Linguistics, 391–402.

  • Bridgeman, B. C. (2009). Considering fairness and validity in evaluating automated scoring. National Council on Measurement in Education. San Diego.

  • Cho, K. v. (2012). Learning phrase representations using rnn encoderdecoder for statistical machine translation. preprint arxivs, 1406.1078.

  • Cohen, J. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological bulletin, 213.

  • Darling-Hammond, L. J. (2013). Criteria for high-quality assessment. Stanford Center for Opportunity Policy in Education.

    Google Scholar 

  • Deerwester, S., Dumais, S., Furnas, G., Landauer, T., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.

    Article  Google Scholar 

  • Devlin, J. M.-W. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

  • Dikli, S. (2006). An overview of automated scoring of essays. The Journal of Technology, Learning and Assessment.

  • Dzikovska, M. O., Nielsen, R. D., Brew, C., Leacock, C., Giampiccolo, D., Bentivogli, L., Clark, P., Dagan, I., & Dang, H. T. (2013). Semeval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge. NORTH TEXAS STATE UNIV DENTON.

  • Esteva, A. B. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 115.

  • Fan, Xitao, & Nowell, Dana L. (2011). Using propensity score matching in educational research. Gifted Child Quarterly, 55(1), 74–79.

    Article  Google Scholar 

  • Gewertz, C. (2013, June 9). States Ponder Costs of Common Tests. Education Week, pp. 20–22.

  • Gong, T. a. (2019). An Attention-based Deep Model for Automatic Short Answer Score. International Journal of Computer Science and Software Engineering, 127–132.

  • Hand-Scoring Rules. (2016). Retrieved from http://www.smarterapp.org/documents/Smarter_Balanced_Hand_Scoring_Rules.pdf. Accessed 18 June.

  • Harris, Z. S. (1954). Distributional structure. Word, 146–162.

  • Hochreiter, S. a. (1997). Long short-term memory. Neural computation, 1735–1780.

  • Kumar, Y., Swati A., Debanjan M., Rajiv R. S., Ponnurangam K., & Roger Z. (2019). Get it scored using autosas—an automated system for scoring short answers. In Proceedings of the AAAI Conference on Artificial Intelligence, 33(1), pp. 9662–9669.

  • Leacock, C. a. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 389–405.

  • Lee, N. T. (2018). Detecting racial bias in algorithms and machine learning. Journal of Information, Communication and Ethics in Society.

  • Madnani, N. & Loukina, A. (2016). RSMTool: A Collection of Tools for Building and Evaluating Automated Scoring Models. Journal of Open Source Software.

  • McCurry, D. (2010). Can machine scoring deal with broad and open writing. Assessing Writing, 118–129.

  • McGraw-Hill Education, C. T. (2014). Smarter balanced assessment consortium field test: Automated scoring research studies (in accordance with smarter balanced RFP 17).

  • Mikolov, T. I. (2013). Distributed representations of words and phrases and their compositionality. . Advances in neural information processing systems, 3111–3119.

  • Mohler, M. a. (2009). Text-to-text semantic similarity for automatic short answer grading. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. (pp. 567–575). Association for Computational Linguistics.

  • Murphy, R. F. (2019). Artificial Intelligence Applications to Support K–12 Teachers and Teaching. RAND Corporation.

    Google Scholar 

  • Norvig, P. (2007a). Retrieved from How to write a spelling corrector: http://norvig.com/spell-correct.html. Accessed July 2018

  • Norvig, P. (2007b). How to write a spelling corrector. Retrieved from How to write a spelling corrector: http://norvig.com/spell-correct.html. Accessed July 2018

  • Ormerod, C. M. & Harris, A. E. (2018). Neural network approach to classifying alarming student responses to online assessment. arXiv preprint, 1809.08899.

  • Ormerod, C. M., Malhotra, A., & Jafari, A. (2021). Automated essay scoring using efficient transformer-based language models. arXiv preprint arXiv:2102.13136.

  • Osgood, C. E. (1964). Semantic differential technique in the comparative study of cultures 1. American Anthropologist, 171–200.

  • Page, E. B. & Petersen, N. S. (1995). The computer moves into essay grading: Updating the ancient test. Phi delta kappan, 561.

  • Page, E. B. (2003). Project Essay Grade: PEG. In M. D. Shermis (Ed.), Automated essay scoring: A cross-disciplinary perspective (p. 43). New Jersey: Lawrence Erlbaum Associates.

    Google Scholar 

  • Pearson and ETS. (2015). Research results of PARCC automated scoring proof of concept study. Retrieved from http://www.parcconline.org/images/Resources/Educator-resources/PARCC_AI_Research_Report.pdf. Accessed Sept 2019.

  • Perelman, L. C. (2013). Critique of Mark D. Shermis & Ben Hammer, Contrasting state-of-the-art automated scoring of essays: Analysis. Journal of Writing Assessment.

  • Powers, D. E. (2002). Stumping e-rater: challenging the validity of automated essay scoring. Computers in Human Behavior, 103–134.

  • Rajpurkar, P. J. (2016). Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:, 1606.05250.

  • Riordan, B., Horbach, A., Cahill, A., Zesch, T., & Lee, C. (2017). Investigating neural architectures for short answer scoring. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 159–168.

  • Roberts, K. (2016). Assessing the corpus size vs. similarity trade-off for word embeddings in clinical NLP. Proceedings of the Clinical Natural Language Processing Workshop, (pp. 54–63). Osaka, Japan.

  • Rubin, D. B. (2006). Matched sampling for causal effects. Cambridge University Press.

    Book  MATH  Google Scholar 

  • Sakaguchi, K. M. (2015). Effective feature integration for automated short answer scoring. Proceedings of the 2015 conference of the North American Chapter of the association for computational linguistics: Human language technologies, (pp. 1049–1054).

  • Sato, E. R. (2011). SMARTER balanced assessment consortium common core state standards analysis: Eligible content for the summative assessment. Final Report. Smarter Balanced Assessment Consortium.

  • Shermis, M. D. (2013a). Contrasting state-of-the-art automated scoring of essays: Analysis. Annual national council on measurement in education meeting.

  • Shermis, M. D. (2013). Handbook of automated essay evaluation: Current applications and new directions. Routledge.

    Book  Google Scholar 

  • Shermis, M. D. (2015). Contrasting state-of-the-art in the machine scoring of short-form constructed responses. Educational Assessment, 46–65.

  • Silver, D. A. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 484.

  • Smith, C. (2017). iOS 10: Siri now works in third-party apps, comes with extra AI features. BGR.

  • Sultan, M. A. (2015). Fast and easy short answer grading with high accuracy. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (pp. 1070–1075).

  • Sung, Chul, Tejas Indulal Dhamecha, and Nirmal Mukhi. (2019). Improving short answer grading using transformer-based pre-training. International Conference on Artificial Intelligence in Education, (pp. 469–481).

  • Szegedy, C. S. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence.

  • Tomas Mikolov, K. C. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of Workshop at ICLR.

  • Turney, P. D. (2010). From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research, 141–188.

  • Vaswani, A. N. (2017). Attention is all you need. Advances in neural information processing systems., 5998–6008.

  • Vogels, W. (2017). Bringing the Magic of Amazon AI and Alexa to Apps on AWS. Retrieved from All Things Distributed: www.allthingsdistributed.com

  • Williamson, D. M., Xiaoming X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational measurement: issues and practice, 31(1), 2–13.

  • Wu, Y. M. (2016). Google's neural machine translation system: Bridging the gap between human and machine translation. arxiv preprints, 1609.08144.

  • Zhilin Yang, Z. D. (2019). XLNet: Generalized Autoregressive Pretraining for Language Understanding. preprint Paper arxiv, 1906.08237.

  • Zhou, Z.-H. J. (2002a). Ensembling neural networks: many could be better than all. Artificial Intelligence, 239–263.

  • Zhou, Z.-H. J. (2002b). Ensembling neural networks: many could be better than all. Artificial Intelligence, 239–263.

  • Zhai, X. Y. (2020). Applying machine learning in science assessment: a systematic review. Studies in Science Education, 111–151.

  • Zou, J. a. (2018). AI can be sexist and racist—it’s time to make it fair. Nature, 324–326.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christopher Ormerod.

Ethics declarations

Conflicts of Interest/Competing Interests

Not Applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ormerod, C., Lottridge, S., Harris, A.E. et al. Automated Short Answer Scoring Using an Ensemble of Neural Networks and Latent Semantic Analysis Classifiers. Int J Artif Intell Educ 33, 467–496 (2023). https://doi.org/10.1007/s40593-022-00294-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s40593-022-00294-2

Keywords

Navigation