Skip to main content

The use of semantic similarity tools in automated content scoring of fact-based essays written by EFL learners


This study searched for open-source semantic similarity tools and evaluated their effectiveness in automated content scoring of fact-based essays written by English-as-a-Foreign-Language (EFL) learners. Fifty writing samples under a fact-based writing task from an academic English course in a Japanese university were collected and a gold standard was produced by a native expert. A shortlist of carefully selected tools, including InferSent, spaCy, DKPro, ADW, SEMILAR and Latent Semantic Analysis, generated semantic similarity scores between student writing samples and the expert sample. Three teachers who were lecturers of the course manually graded the student samples on content. To ensure validity of human grades, samples with discrepant agreement were excluded and an inter-rater reliability test was conducted on remaining samples with quadratic weighted kappa. After the grades of the remaining samples were proven valid, a Pearson correlation analysis between semantic similarity scores and human grades was conducted and results showed that InferSent was the most effective tool in predicting the human grades. The study further pointed to the limitations of the six tools and suggested three alternatives to traditional methods in turning semantic similarity scores into reporting grades on content.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2

Data Availability

Data will be made available upon reasonable request.


  • Agresti, A., & Finlay, B. (2009). Statistical methods for the social sciences. Pearson Prentice Hall.

    Google Scholar 

  • Anthony, L., Rose, R., & Sheppard, C. (2019). Concept building and discussion — Foundations. DTP Publishing.

    Google Scholar 

  • Bär, D., Zesch, T., & Gurevych, I. (2013). Dkpro similarity: An open source framework for text similarity. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations. 121–126.

  • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.

    Article  Google Scholar 

  • Brown, H., & Bradford, A. (2018). Teaching subject content through English: CLIL and EMI courses in the Japanese university. In P. Wadden & C. C. Hale (Eds.), Teaching english at Japanese Universities (pp. 103–108). Routledge.

    Chapter  Google Scholar 

  • Burstein, J. (2012). Automated Essay Evaluation and Scoring. In The Encyclopedia of Applied Linguistics.

  • Camacho-Collados, J., & Pilehvar, M. T. (2018). From word to sense embeddings: A survey on vector representations of meaning. The Journal of Artificial Intelligence Research, 63, 743–788.

    MathSciNet  Article  Google Scholar 

  • Chandrasekaran, D., & Mago, V. (2020). Evolution of semantic similarity - A survey. ACM Computing Surveys, 54(2), 1–37.

    Article  Google Scholar 

  • Chen, C. F. E., & Cheng, W. Y. E. C. (2008). Beyond the design of automated writing evaluation: Pedagogical practices and perceived learning effectiveness in EFL writing classes. Language Learning & Technology, 12(2), 94–112.

    Google Scholar 

  • Cheville, J. (2004). Automated scoring technologies and the rising influence of error. English Journal, 93(4), 47–52.

    Article  Google Scholar 

  • Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 670–680.

  • Crossley, S. A., Kyle, K., & McNamara, D. S. (2016). The tool for the automatic analysis of text cohesion (TAACO): Automatic assessment of local, global, and text cohesion. Behavior Research Methods, 48(4), 1227–1237.

    Article  Google Scholar 

  • Crossley, S. A., Kyle, K., & Dascalu, M. (2019). The tool for the automatic analysis of cohesion 2.0: Integrating semantic similarity and text overlap. Behavior Research Methods, 51(1), 14–27.

    Article  Google Scholar 

  • Educational Testing Service. (n.d.) Mapping the TOEIC® Tests on the Common European Framework of Reference for Languages [Report]. Retrieved July 19, 2022, from

  • Fauzi, M. A., Utomo, D. C., Pramukantoro, E. S., & Setiawan, B. D. (2017). Automatic essay scoring system using N-GRAM and cosine similarity for gamification based elearning. ACM International Conference Proceeding Series, Part F, 1312, 151–155.

    Article  Google Scholar 

  • Foltz, P. W., Laham, D., & Landauer, T. K. (1999). The intelligent essay assessor: Applications to educational technology. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning., 1(2), 939–944.

    Google Scholar 

  • Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P., Liu, N. F., Peters, M., Schmitz, M. & Zettlemoyer, L. (2018). AllenNLP: A deep semantic natural language processing platform. In Proceedings of Workshop for NLP Open Source Software (NLP-OSS), 1–6.

  • Graham, M.J., Milanowski, A.T., & Miller, J.B. (2012). Measuring and promoting inter-rater agreement of teacher and principal performance ratings. Center for Educator Compensation and Reform. Retrieved July 19, 2022, from

  • Harispe, S., Ranwez, S., Janaqi, S., & Montmain, J. (2014). The semantic measures library and toolkit: Fast computation of semantic similarity and relatedness using biomedical ontologies. Bioinformatics, 30(5), 740–742.

    Article  Google Scholar 

  • Harispe, S., Ranwez, S., Janaqi, S., & Montmain, J. (2015). Semantic similarity from natural language and ontology analysis. Synthesis Lectures on Human Language Technologies, 8(1), 1–256.

    Article  Google Scholar 

  • Herrington, A., & Moran, C. (2012). Writing to a machine is not writing at all. In N. Elliot & L. Perelman (Eds.), Writing assessment in the twenty- first century: Essays in honor of Edward M. White (pp. 219–32). Hampton Press.

    Google Scholar 

  • Hockly, N. (2019). Automated writing evaluation. ELT Journal, 73(1), 82–88.

    Article  Google Scholar 

  • Iacobacci, I., Pilehvar, M. T., & Navigli, R. (2015). SensEmbed: Learning sense embeddings for word and relational similarity. ACL, 1, 95–105.

    Google Scholar 

  • Johnson, R. L., Penny, J. A., Gordon, B., Shumate, S. R., & Fisher, S. P. (2005). Resolving score differences in the rating of writing samples: Does discussion improve the accuracy of scores? Language Assessment Quarterly, 2(2), 117–146.

    Article  Google Scholar 

  • Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The Latent Semantic Analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104, 211–240.

    Article  Google Scholar 

  • Landauer, T. K., Laham, D., & Foltz, P. W. (2000). The intelligent essay assessor. IEEE Intelligent Systems, 15, 27–31.

    Google Scholar 

  • Lastra-Díaz, J. J., García-Serrano, A., Batet, M., Fernández, M., & Chirigati, F. (2017). HESML: A scalable ontology-based semantic similarity measures library with a set of reproducible experiments and a replication dataset. Information Systems, 66, 97–118.

    Article  Google Scholar 

  • Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37(4), 385–405.

    Article  Google Scholar 

  • Lockwood, J. (2014). Handbook of automated essay evaluation: Current applications and new directions. Writing & Pedagogy, 6(2), 437–441.

    Article  Google Scholar 

  • Madnani, N., Loukina, A., & Cahill, A. (2017). A large scale quantitative exploration of modeling strategies for content scoring. Proceedings of 12th Workshop on Innovative Use of NLP for Building Educational Applications, 457–467.

  • Mahmood Abdullah, S., Mazin Ali, S., & Abduljaleel Makttof, M. (2019). Modifying Jaccard Coefficient for Texts Similarity. Opción, 35, 2899–2921.

    Google Scholar 

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781.

  • Mohammad, S. M. & Hirst, G. (2012). Distributional measures of semantic distance: A survey. In: arXiv preprint arXiv:1203.1858.

  • Mohler, M. & Mihalcea, R. (2009). Text-to-text semantic similarity for automatic short answer grading. Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, 567–575.

  • Monjurul Islam, M., & Latiful Hoque, A. S. M. (2012). Automated essay scoring using Generalized Latent Semantic Analysis. Journal of Computers, 7(3), 616–626.

    Article  Google Scholar 

  • Olmos, R., León, J. A., Escudero, I., & Jorge-Botana, G. (2011). Using latent semantic analysis to grade brief summaries: Some proposals. International Journal of Continuing Engineering Education and Life-Long Learning, 21(2–3), 192–209.

    Article  Google Scholar 

  • Page, E. B. (1966). The imminence of… grading essays by computer. The Phi Delta Kappan, 47(5), 238–243.

    Article  Google Scholar 

  • Pedersen, T. (2015). WordNet::Similarity - Perl modules for computing measures of semantic relatedness.

  • Pennington, J., Socher, R. & Manning, C. D. (2014). Glove: Global Vectors for Word Representation. EMNLP, 1532–1543.

  • Perelman, L. (2020). The BABEL generator and e-Rater: 21st Century writing constructs and automated essay scoring (AES). The Journal of Writing Assessment, 13(1).

  • Pilehvar, M. T., Jurgens, D., & Navigli, R. (2013). Align, disambiguate and walk: A unified approach for measuring semantic similarity. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 1, 1341–1351.

  • Rada, R., Mili, H., Bicknell, E., & Blettner, M. (1989). Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man, and Cybernetics, 19, 17–30.

    Article  Google Scholar 

  • Ranalli, J., Link, S., & Chukharev-Hudilainen, E. (2017). Automated writing evaluation for formative assessment of second language writing: Investigating the accuracy and usefulness of feedback as part of argument-based validation. Educational Psychology, 37(1), 8–25.

    Article  Google Scholar 

  • Ruegg, R., & Williams, C. (2018). Teaching English for Academic Purposes (EAP) in Japan. Springer.

    Book  Google Scholar 

  • Rus, V., Lintean, M., Banjade, R., Niraula, N., and Stefanescu, D. (2013). SEMILAR: The Semantic Similarity Toolkit. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 163–168.

  • Šarić, F., Glavaš, G., Karan, M., Šnajder, J., & Bašić, B. D. (2012). Takelab: Systems for measuring semantic text similarity. Proceedings of the Sixth International Workshop on Semantic Evaluation, 2, 441–448.

    Google Scholar 

  • Schnabel, T., Labutov, I., Mimno, D. & Joachims, T. (2015). Evaluation methods for unsupervised word embeddings. Proceedings of the 2015 conference on empirical methods in natural language processing, 298–307.

  • Senthilnathan, S. (2019). Usefulness of correlation analysis. SSRN Electronic Journal.

    Article  Google Scholar 

  • Shermis, M. D., & Burstein, J. (2003). Automated essay scoring: A cross disciplinary perspective. Lawrence Erlbaum Associates.

    Book  Google Scholar 

  • Sinoara, R. A., Camacho-Collados, J., Rossi, R. G., Navigli, R., & Rezende, S. O. (2019). Knowledge-enhanced document embeddings for text classification. Knowledge-Based Systems, 163, 955–971.

    Article  Google Scholar 

  • Sukkarieh, J. Z., Pulman, S. G., & Raikes, N. (2004). Auto-marking 2: An update on the UCLES-Oxford university research into using computational linguistics to score short, free text responses [Conference presentation]. The 29th annual conference of the International Association of Educational Assessment, Manchester, UK.

  • Suphakit, N., Jatsada, S., Ekkachai, N. and Supachanun, W. (2013). Using of Jaccard Coefficient for Keyword Similarity. Proceedings of the International MultiConference of Engineers and Computer Scientists, 1.

  • Tai, K. S., Socher, R., & Manning, D. C. (2015). Improved semantic representations from TreeStructured Long Short-Term Memory Networks. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, 1, 1556–1566.

  • Vigilante, R. (1999). Online Computer Scoring of Constructed-response Questions. Journal of Information Technology Impact, 1(2), 57–62.

    Google Scholar 

  • Vijaymeena, M. K., & Kavitha, K. (2016). A survey on similarity measures in text mining. Machine Learning and Applications: An International Journal, 3(1), 19–28.

    Article  Google Scholar 

  • Yao, L., Cahill, A., & McCaffrey, D. F. (2020). The impact of training data quality on automated content scoring performance. The AAAI Workshop on Artificial Intelligence for Education. Retrieved from

  • Zhao, C., & Wang, Z. (2018). GOGO: An improved algorithm to measure the semantic similarity between gene ontology terms. Science and Reports, 8, 1–10.

    Article  Google Scholar 

Download references


The author would like to thank the three raters in this study for their kind assistance in grading student writing samples.

Author information

Authors and Affiliations



Not applicable.

Corresponding author

Correspondence to Qiao Wang.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


Appendix 1 The writing task

Fig. 3
figure 3

The writing task in the textbook (page 1)

Fig. 4
figure 4

The writing task in the textbook (page 2)

Appendix 2 The gold standard sample for the writing task

Making ethanol and its by-products from corn involves ten steps. First, corn is delivered by rail or truck to the receiving building. There, rocks and other debris are removed from the corn, which is then stored in storage bins. Next, the corn is sent to a hammer mill or grinder where it is ground into a fine powder called meal. After that, the meal is delivered to a slurry tank. Recycled, fresh water is added to the tank to mix with the meal. Following this step, the meal is heated with steam in order to kill bacteria. The meal is then sent to a processing tank. In the processing tank, ammonia is added to control acidity and break down the meal into a liquid called mash. Next, the mash is cooled and transferred to fermenters. In the fermenters, ammonia is added again along with yeast in order to begin the process of turning sugar in the corn into ethanol and carbon dioxide. After this step, the product is transferred to another tank where the liquid ethanol is separated from the stillage. Now that the ethanol is 95% pure, a special system is used to remove any water. To complete the process, 5% gasoline is added to the ethanol.

Appendix 3 Percentage agreement of human grading results


Table 6 Percentage agreement of human grading results


Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wang, Q. The use of semantic similarity tools in automated content scoring of fact-based essays written by EFL learners. Educ Inf Technol (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI:


  • Automated writing evaluation
  • Automated content scoring
  • Fact-based writing
  • EFL learners
  • Semantic similarity
  • Open-source semantic similarity tools