A novel automated essay scoring approach for reliable higher educational assessments


E-learning is gradually gaining prominence in higher education, with universities enlarging provision and more students getting enrolled. The effectiveness of automated essay scoring (AES) is thus holding a strong appeal to universities for managing an increasing learning interest and reducing costs associated with human raters. The growth in e-learning systems in the higher education system and the demand for consistent writing assessments has spurred research interest in improving the accuracy of AES systems. This paper presents a transformer-based neural network model for improved AES performance using Bi-LSTM and RoBERTa language model based on Kaggle’s ASAP dataset. The proposed model uses Bi-LSTM model over pre-trained RoBERTa language model to address the coherency issue in essays that is ignored by traditional essay scoring methods, including traditional NLP pipelines, deep learning-based methods, a mixture of both. The comparison of the experimental results on essay scoring with human raters concludes that the proposed model outperforms the existing methods in essay scoring in terms of QWK score. The comparative analysis of results demonstrates the applicability of the proposed model in automated essay scoring at higher education level.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9


  1. Alikaniotis, D., Yannakoudakis, H., & Rei, M. (2016). Automatic text scoring using neural networks. arXiv preprint arXiv:1606.04289.

  2. Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater R v. 2. The Journal of Technology, Learning and Assessment, 4(3).

  3. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

  4. Bansal, M., & Passonneau, R. J. (2018). Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: Tutorial abstracts. In Proceedings of the 2018 conference of the north american chapter of the association for computational linguistics: Tutorial abstracts (2018).

  5. Bennett, R. E., & Bejar, I. I. (1998). Validity and automad scoring: It’s not only the scoring. Educational Measurement: Issues and Practice, 17(4), 9–17.

    Article  Google Scholar 

  6. Bernstein, J., Van Moere, A., & Cheng, J. (2010). Validating automated speaking tests. Language Testing, 27(3), 355–377.

    Article  Google Scholar 

  7. Beseiso, M., & Alzahrani, S. (2020). An empirical analysis of bert embedding for automated essay scoring. International Journal of Advanced Computer Science and Applications. https://doi.org/10.14569/IJACSA.2020.0111027.

  8. Bond, C. F., & Richardson, K. (2004). Seeing the fisherz-transformation. Psychometrika, 69(2), 291–303.

    Article  Google Scholar 

  9. Cer, D., Yang, Y., Kong, S. Y., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., et al. (2018). Universal sentence encoder. arXiv preprint arXiv:1803.11175.

  10. Citawan, R. S., Mawardi, V. C., & Mulyawan, B. (2018). Automatic essay scoring in e-learning system using lsa method with n-gram feature for Bahasa Indonesia. In MATEC web of conferences, vol. 164, p. 01037. EDP Science.

  11. Cummins, R., & Rei, M. (2018). Neural multi-task learning in automated assessment. arXiv preprint arXiv:1801.06830.

  12. Cushing Weigle, S. (2010). Validation of automated scores of TOEFL iBT tasks against non-test indicators of writing ability. Language Testing, 27(3), 335–353.

    Article  Google Scholar 

  13. Dascalu, M., Dessus, P., Bianco, M., Trausan-Matu, S., & Nardy, A. (2014). Mining texts, learner productions and strategies with readerbench. In Educational data mining, pp. 345–377. Springer.

  14. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

  15. Dong, F., Zhang, Y., & Yang, J. (2017). Attention-based recurrent convolutional neural network for automatic essay scoring. In Proceedings of the 21st conference on computational natural language learning (CoNLL 2017), pp. 153–162.

  16. Dos Santos, C., & Gatti, M. (2014). Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of COLING 2014, the 25th international conference on computational linguistics: Technical papers, pp. 69–78.

  17. Farag, Y., Yannakoudakis, H., & Briscoe, T. (2018). Neural automated essay scoring and coherence modeling for adversarially crafted input. arXiv preprint arXiv:1804.06898.

  18. Foltz, P. W., Laham, D., & Landauer, T. K. (1999). Automated essay scoring: Applications to educational technology. In EdMedia+ innovate learning, pp. 939–944. Association for the Advancement of Computing in Education (AACE).

  19. Gierl, M. J., Latifi, S., Lai, H., Boulais, A. P., & De Champlain, A. (2014). Automated essay scoringand the future of educational assessment in medical education. Medical Education, 48(10), 950–962.

    Article  Google Scholar 

  20. Ginther, A., Dimova, S., & Yang, R. (2010). Conceptual and empirical relationships between temporal measures of fluency and oral English proficiency with implications for automated scoring. Language Testing, 27(3), 379–399.

    Article  Google Scholar 

  21. Hartley, D. J. (2004). Automated language and interface independent software testing tool (2004). US Patent 6,763,360.

  22. Hazar, M. J., Toman, Z. H., & Toman, S. H. (2019). Automated scoring for essay questions in e-learning. In Journal of Physics: Conference Series, vol. 1294, p. 042014. IOP Publishing.

  23. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Journal of Neural Computing, 9(8), 1735–1780.

    Article  Google Scholar 

  24. Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.

  25. Hussein, M. A., Hassan, H., & Nassef, M. (2019). Automated language essay scoring systems: A literature review. PeerJ Computer Science, 5, e208.

    Article  Google Scholar 

  26. Kwong, A., Muzamal, J. H., & Khan, U. G. (2019). Automated language scoring system by employing neural network approaches. In 2019 15th international conference on emerging technologies (ICET), pp. 1–6. IEEE.

  27. Lample, G., & Conneau, A. (2019). Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291.

  28. Larkey, L. S. (1998). Automatic essay grading using text categorization techniques. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 90–95.

  29. Latifi, S., Gierl, M. J., Boulais, A. P., & De Champlain, A. F. (2016). Using automated scoring to evaluate written responses in English and French on a high-stakes clinical competency examination. Evaluation & the Health Professions, 39(1), 100–113.

    Article  Google Scholar 

  30. Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In International conference on machine learning, pp. 1188–1196.

  31. Li, D., Zhong, S., Song, Z., & Guo, Y. (2020). Computer-aided English education in china: An online automatic essay scoring system. In International conference on innovative mobile and internet services in ubiquitous computing, pp. 264–278. Springer.

  32. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

  33. Machicao, J. C. (2019). Higher education challenge characterization to implement automated essay scoring model for universities with a current traditional learning evaluation system. In International conference on information technology & systems, pp. 835–844. Springer.

  34. Mahlangu, V. P. (2018). The good, the bad, and the ugly of distance learning in higher education. Trends in E-learning pp. 17–29.

  35. Nadeem, F., Nguyen, H., Liu, Y., & Ostendorf, M. (2019). Automated essay scoring with discourse- aware neural models. In Proceedings of the fourteenth workshop on innovative use of NLP for building educational applications, pp. 484–493.

  36. Nadeem, F., & Ostendorf, M. (2018). Estimating linguistic complexity for science texts. In Proceedings of the thirteenth workshop on innovative use of NLP for building educational applications, pp. 45–55.

  37. Ng, S. Y., Bong, C. H., Hong, K. S., & Lee, N. K. (2019). Developing an automated essay scorer with feedback (aesf) for Malaysian university English test (muet): A design-based research approach. Pertanika Journal of Social Sciences & Humanities, 27(2).

  38. Page, E. B. (1967). Grading essays by computer: Progress report. In Proceedings of the invitational conference on testing problems.

  39. Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543.

  40. Perin, D., & Lauterbach, M. (2018). Assessing text-based writing of low-skilled college students. International Journal of Artificial Intelligence in Education, 28(1), 56–78.

    Article  Google Scholar 

  41. Phandi, P., Chai, K. M. A., & Ng, H. T. (2015). Flexible domain adaptation for automated essay scoring using correlated linear regression. In Proceedings of the 2015 conference on empirical methods in natural language processing, pp. 431–439.

  42. Ramineni, C., Trapani, C. S., Williamson, D. M., Davey, T., & Bridgeman, B. (2012). Evaluation of the e-rater R scoring engine for the gre R issue and argument prompts. ETS Research Report Series, 2012(1), i–106.

  43. Reilly, E. D., Williams, K. M., Stafford, R. E., Corliss, S. B., Walkow, J. C., & Kidwell, D. K. (2016). Global times call for global measures: Investigating automated essay scoring in linguistically-diverse moocs. Online Learning, 20(2), 217–229.

    Article  Google Scholar 

  44. Rong, X. (2014). word2vec parameter learning explained. arXiv preprint arXiv:1411.2738.

  45. Rudner, L. M., & Liang, T. (2002). Automated essay scoring using Bayes’ theorem. The Journal of Technology, Learning and Assessment, 1(2).

  46. Shermis, M. D., Hamner, B. (2013). 19 Contrasting state-of-the-art automated scoring of essays. Handbook of automated essay evaluation: Current applications and new directions, p. 313.

  47. Shi, W., & Demberg, V. (2019). Next sentence prediction helps implicit discourse relation classification within and across domains. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp. 5794–5800.

  48. Su, M. H., Wu, C. H., & Zheng, Y. T. (2016). Exploiting turn-taking temporal evolution for personality trait perception in dyadic conversations. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(4), 733–744.

    Article  Google Scholar 

  49. Taghipour, K., Ng, H. T. (2016). A neural approach to automated essay scoring. In Proceedings of the 2016 conference on empirical methods in natural language processing, pp. 1882–1891.

  50. Tay, Y., Phan, M. C., Tuan, L. A., & Hui, S. C. (2017). Skipflow: Incorporating neural coherence features for end-to-end automatic text scoring. arXiv preprint arXiv:1711.04981.

  51. Tobback, E., Naudts, H., Daelemans, W., de Fortuny, E. J., & Martens, D. (2018). Belgian economic policy uncertainty index: Improvement through text mining. International Journal of Forecasting, 34(2), 355–365.

    Article  Google Scholar 

  52. Uzun, K. (2018). Home-grown automated essay scoring in the literature classroom: A solution for managing the crowd? Contemporary Educational Technology, 9(4), 423–436.

    Article  Google Scholar 

  53. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008.

  54. Vaughn, D., & Justice, D. (2015). On the direct maximization of quadratic weighted kappa. arXiv preprint arXiv:1509.07107.

  55. Wang, Y., Wei, Z., Zhou, Y., & Huang, X. J. (2018). Automatic essay scoring incorporating rating schema via reinforcement learning. In Proceedings of the 2018 conference on empirical methods in natural language processing, pp. 791–797.

  56. Wirth, C., & Fürnkranz, J. (2014). On learning from game annotations. IEEE Transactions on Computational Intelligence and AI in Games, 7(3), 304–316.

    Article  Google Scholar 

  57. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp. 5753–5763.

  58. Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., & Hovy, E. (2016). Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Human language technologies, pp. 1480–1489.

  59. Yin, W., Ebert, S., & Schütze, H. (2016). Attention-based convolutional neural network for machine comprehension. arXiv preprint arXiv:1602.04341.

  60. Yu, W., & Barker, T. (2020). A study on the effectiveness of automated essay marking in the context of a blended learning course design. Education Language and Sociology Research, 1(1), 20.

    Google Scholar 

  61. Zechner, K., Higgins, D., Xi, X., & Williamson, D. M. (2009). Automatic scoring of non-native spontaneous speech in tests of spoken English. Speech Communication, 51(10), 883–895.

    Article  Google Scholar 

  62. Zhang, H., Magooda, A., Litman, D., Correnti, R., Wang, E., Matsmura, L., Howe, E., & Quintana, R. (2019). erevise: Using natural language processing to provide formative feedback on text evidence usage in student writing. In Proceedings of the AAAI conference on artificial intelligence, vol. 33, pp. 9619–9625.

  63. Zhang, M. (2013). Contrasting automated and human scoring of essays. R & D Connections, 21(2), 1–11.

    Google Scholar 

  64. Zhang, Y., & Wallace, B. (2015). A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820.

  65. Zhu, W. (2019). A study on the application of automated essay scoring in college English writing based on PIGAI. In 2019 5th International conference on social science and higher education (ICSSHE 2019), pp. 451–454. Atlantis Press.

Download references

Author information



Corresponding author

Correspondence to Majdi Beseiso.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Beseiso, M., Alzubi, O.A. & Rashaideh, H. A novel automated essay scoring approach for reliable higher educational assessments. J Comput High Educ (2021). https://doi.org/10.1007/s12528-021-09283-1

Download citation


  • Automated essay scoring (AES)
  • Deep learning
  • Essay scoring
  • Long short-term memory
  • Neural network
  • Transfer learning