Assessing Students’ Use of Evidence and Organization in Response-to-Text Writing: Using Natural Language Processing for Rubric-Based Automated Scoring

  • Zahra RahimiEmail author
  • Diane Litman
  • Richard Correnti
  • Elaine Wang
  • Lindsay Clare Matsumura


This paper presents an investigation of score prediction based on natural language processing for two targeted constructs within analytic text-based writing: 1) students’ effective use of evidence and, 2) their organization of ideas and evidence in support of their claim. With the long-term goal of producing feedback for students and teachers, we designed a task-dependent model, for each dimension, that aligns with the scoring rubric and makes use of the source material. We believe the model will be meaningful and easy to interpret given the writing task. We used two datasets of essays written by students in grades 5–6 and 6–8. Our experimental results show that our task-dependent model (consistent with the rubric) performs as well as if not outperforms competitive baselines. We also show the potential generalizability of the rubric-based model by performing cross-corpus experiments. Finally, we show that the predictive utility of different feature groups in our rubric-based modeling approach is related to how much each feature group covers a rubric’s criteria.


Automatic essay assessment Analytical writing in response to text Evidence Organization Task-dependent Feedback Natural language processing 



This work was supported by the Learning Research and Development Center at the University of Pittsburgh.


  1. Attali, Y. (2011). A differential word use measure for content analysis in automated essay scoring. ETS Research Report Series, 2011(2), i–19.Google Scholar
  2. Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater v. 2. The Journal of Technology, Learning and Assessment, 4(3), 1–29.Google Scholar
  3. Attali, Y., & Powers, D. (2008). A developmental writing scale. Wiley Online Library. ETS Research Report Series RR-08-19, 2008(1).
  4. Attali, Y., Lewis, W., & Steier, M. (2013). Scoring with the computer: Alternative procedures for improving the reliability of holistic essay scoring. Language Testing, 30(1), 125–141. Sage Publications Sage UK: London, England.CrossRefGoogle Scholar
  5. Bacha, N. (2001). Writing evaluation: what can analytic versus holistic essay scoring tell us? System, 29(3), 371–383.CrossRefGoogle Scholar
  6. Barzilay, R., & Lapata, M. (2005). Modeling local coherence: An entity-based approach, Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05 (pp. 141–148).CrossRefGoogle Scholar
  7. Bouckaert, R.R., & Frank, E. (2004). Evaluating the replicability of significance tests for comparing learning algorithms, Advances in knowledge discovery and data mining (pp. 3–12).CrossRefGoogle Scholar
  8. Burstein, J., Kukich, K., Wolff, S., Chi, L., & Chodorow, M. (1998). Enriching automated essay scoring using discourse marking, Proceedings of the Workshop on Discourse Relations and Discourse Marking Annual Meeting of the Association of Computational Linguistics.Google Scholar
  9. Burstein, J., Braden-Harder, L., Chodorow, M., Hua, S., Kaplan, B., Kukich, K., Chi, L., Nolan, J., Rock, D., & Wolff, S. (1999). Computer analysis of essay content for automated score prediction. TOEFL Monograph Series Report No. 13.Google Scholar
  10. Burstein, J., Chodorow, M., & Leacock, C. (2003a). Criterion sm : Online essay evaluation: An application for automated evaluation of student essays, Proceedings of the 15th Annual Conference on Innovative Applications of Artificial Intelligence.Google Scholar
  11. Burstein, J., Marcu, D., & Knight, K. (2003b). Finding the write stuff: Automatic identification of discourse structure in student essays. IEEE Intelligent Systems, 18(1), 32–39.Google Scholar
  12. Burstein, J., Tetreault, J., & Andreyev, S. (2010). Using entity-based features to model coherence in student essays, Human Language Technologies The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10 (pp. 681–684).Google Scholar
  13. Burstein, J., Tetreault, J., & Chodorow, M. (2013). Holistic discourse coherence annotation for noisy essay writing. Dialogue & Discourse, 4(2), 34–52.CrossRefGoogle Scholar
  14. Chawla, N.V., Bowyer, K.W., Hall, L.O., & Kegelmeyer, P.W. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.zbMATHGoogle Scholar
  15. Cohen, R. (1987). Analyzing the structure of argumentative discourse. Computational linguistics, 13(1-2), 11–24.Google Scholar
  16. Condon, W. (2013). Large-scale assessment, locally-developed measures, and automated scoring of essays: Fishing for red herrings? Assessing Writing, 18(1), 100–108.CrossRefGoogle Scholar
  17. Correnti, R., Matsumura, L.C., Hamilton, L., & Wang, E. (2013). Assessing students’ skills at writing analytically in response to texts. The Elementary School Journal, 114(2), 142–177. JSTOR.CrossRefGoogle Scholar
  18. Crossley, S.A., Varner, L.K., Roscoe, R.D., & McNamara, D.S. (2013). Using automated indices of cohesion to evaluate an intelligent tutoring system and an automated writing evaluation system, International Conference on Artificial Intelligence in Education. Springer (pp. 269–278)CrossRefGoogle Scholar
  19. Deane, P. (2013). On the relation between automated essay scoring and modern views of the writing construct. Elsevier. Assessing Writing, 18(1), 7–24.CrossRefGoogle Scholar
  20. Deane, P., Williams, F., Weng, V., & Trapani, C. S. (2013). Automated essay scoring in innovative assessments of writing from sources. Writing Assessment, 6(1), 40–56.Google Scholar
  21. Elliot, S. (2003). Intellimetric: from here to validity In M.D. Shermis & J.Burstein (Eds.), Automated Essay Scoring: A Cross Disciplinary Perspective, (pp. 71–86). Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
  22. Foltz, P. W., Kintsch, W., & Landauer, T. K. (1998). The measurement of textual coherence with latent semantic analysis. Discourse processes, 25(2-3), 285–307.CrossRefGoogle Scholar
  23. Galley, M., & Mckeown, K. (2003). Improving word sense disambiguation in lexical chaining, Proceedings of IJCAI (pp. 1486–1488).Google Scholar
  24. Grosz, B. J., Weinstein, S., & Joshi, A. K. (1995). Centering: A framework for modeling the local coherence of discourse. Computational Linguistics, 21(2), 203–225.Google Scholar
  25. Higgins, D., Burstein, J., Marcu, D., & Gentile, C. (2004). Evaluating multiple aspects of coherence in student essays, HLT-NAACL (pp. 185–192).Google Scholar
  26. Higgins, D., Burstein, J., & Attali, Y. (2006). Identifying off-topic student essays without topic-specific training data. Natural Language Engineering, 12(02), 145–159.CrossRefGoogle Scholar
  27. Kakkonen, T., Myller, N., Timonen, J., & Sutinen, E. (2005). Automatic essay grading with probabilistic latent semantic analysis, Proceedings of the 2nd workshop on Building Educational Applications Using NLP (pp. 29–36) . Association for Computational Linguistics.Google Scholar
  28. Klebanov, B.B., & Higgins, D. (2012). Measuring the use of factual information in test-taker essays, Proceedings of the 7th Workshop on Building Educational Applications Using NLP (pp. 63–72): Association for Computational Linguistics.Google Scholar
  29. Klebanov, B.B., Madnani, N., Burstein, J., & Somasundaran, S. (2014). Content importance models for scoring writing from sources, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 247–252). Baltimore, Maryland: Association for Computational Linguistics.CrossRefGoogle Scholar
  30. Landauer, T.K., Foltz, P.W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2-3), 259–284.CrossRefGoogle Scholar
  31. Lee, Y.-W., Gentile, C., & Kantor, R. (2008). Analytic scoring of toefl®; cbt essays: Scores from humans and e-rater®;. ETS Research Report Series, 2008(1), i–71.Google Scholar
  32. Lemaire, B., & Dessus, P. (2001). A system to assess the semantic content of student essays. Journal of Educational Computing Research, 24(3), 305–320.CrossRefGoogle Scholar
  33. Liu, L.O., Brew, C., Blackmore, J., Gerard, L., Madhok, J., & Linn, M.C. (2014). Automated scoring of constructed-response science items Prospects and obstacles. Educational Measurement: Issues and Practice, 33(2), 19–28.CrossRefGoogle Scholar
  34. Louis, A., & Higgins, D. (2010). Off-topic essay detection using short prompt texts, Proceedings of the NAACL HLT 5th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 92–95) . Association for Computational Linguistics.Google Scholar
  35. Loukina, A., Zechner, K., Chen, L., & Heilman, M. (2015). Feature selection for automated speech scoring, Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 12–19).CrossRefGoogle Scholar
  36. Miltsakaki, E., & Kukich, K. (2000). Automated evaluation of coherence in student essays, Proceedings of LREC 2000.Google Scholar
  37. Morris, J., & Hirst, G. (1991). Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Computational Linguistics, 17(1), 21–48.Google Scholar
  38. Ong, N., Litman, D., & Brusilovsky, A. (2014). Ontology-based argument mining and automatic essay scoring, Proceedings of the 1st Workshop on Argumentation Mining (pp. 24–28).CrossRefGoogle Scholar
  39. Page, E.B. (2003). Project Essay Grade: PEG In M.D. Shermis & J. Burstein (Eds.) Automated Essay Scoring: A Cross Disciplinary Perspective. (pp. 43–54) . Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
  40. Perelman, L. (2012). Construct validity, length, score, and time in holistically graded writing assessments: The case against automated essay scoring (AES). In Bazerman, C, Dean, C, Early, J., Lunsford, K., Null, S., Rogers, P., & Stansell, A. (Eds.) International advances in writing research: Cultures, places, measures (pp. 121–131) . WAC Clearinghouse/Parlor Press Fort Collins, Colorado/Anderson, SC.Google Scholar
  41. Perelman, L. (2013). Critique of Mark D. Shermis & Ben Hammer, Contrasting State-of-the-Art Automated Scoring of Essays: Analysis. Journal of Writing Assessment, 6(1).
  42. Persing, I., & Ng, V. (2014). Modeling prompt adherence in student essays, ACL (1) (pp. 1534–1543).Google Scholar
  43. Persing, I., & Ng, V. (2015). Modeling argument strength in student essays, Proceedings of ACL.Google Scholar
  44. Pitler, E., & Nenkova, A. (2009). Using syntax to disambiguate explicit discourse connectives in text, Proceedings of the ACL-IJCNLP 2009 Conference Short Papers (pp. 13–16).CrossRefGoogle Scholar
  45. Rahimi, Z., & Litman, D. (2016). Automatically extracting topical components for a response-to-text writing assessment, Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications.Google Scholar
  46. Rahimi, Z., Litman, D., Correnti, R., Matsumura, L.C., Wang, E., & Kisa, Z. (2014). Automatic Scoring of an Analytical Response-To-Text Assessment, Intelligent Tutoring Systems. doi: 10.1007/978-3-319-07221-0_76, Springer (pp. 601–610).CrossRefGoogle Scholar
  47. Rahimi, Z., Litman, D., Wang, E., & Correnti, R. (2015). Incorporating coherence of topics as a criterion in automatic response-to-text assessment of the organization of writing, Proceedings of the 10th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 20–30).CrossRefGoogle Scholar
  48. Scott, C.A., & McNamara, D.S. (2011). Text coherence and judgments of essay quality: Models of quality and coherence, Proceedings of the 29th Annual Conference of the Cognitive Science Society (pp. 1236–1241).Google Scholar
  49. Shermis, M.D., & Burstein, J. (2003). Automated essay scoring: A cross-disciplinary perspective, Routledge.Google Scholar
  50. Shermis, M.D., & Hamner, B. (2012). Contrasting state-of-the-art automated scoring of essays: Analysis, Annual national council on measurement in education meeting (pp. 14–16).Google Scholar
  51. Somasundaran, S., Burstein, J., & Chodorow, M. (2014). Lexical chaining for measuring discourse coherence quality in test-taker essays, COLING (pp. 950–961).Google Scholar
  52. Song, Y., Heilman, M., Beigman, B., & Deane, K.P. (2014). Applying argumentation schemes for essay scoring. In Proceedings of the First Workshop on Argumentation Mining, ACL 2014 (pp. 69–78): Citeseer.Google Scholar
  53. Stab, C., & Gurevych, I. (2014a). Annotating argument components and relations in persuasive essays, COLING (pp. 1501–1510).Google Scholar
  54. Stab, C., & Gurevych, I. (2014b). Identifying argumentative discourse structures in persuasive essays, EMNLP (pp. 46–56).Google Scholar
  55. Weigle, S.C. (2002). Assessing writing. New York: Cambridge University Press.CrossRefGoogle Scholar
  56. Xie, S., Evanini, K., & Zechner, K. (2012). Exploring content features for automated speech scoring, Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (p. 2012): Association for Computational Linguistics.Google Scholar
  57. Zhang, M., Feng, V.W., Qin, B., Hirst, G., Liu, T., & Huang, J. (2015). Encoding world knowledge in the evaluation of local coherence, NAACL.Google Scholar

Copyright information

© International Artificial Intelligence in Education Society 2017

Authors and Affiliations

  • Zahra Rahimi
    • 1
    Email author
  • Diane Litman
    • 1
  • Richard Correnti
    • 1
  • Elaine Wang
    • 1
  • Lindsay Clare Matsumura
    • 1
  1. 1.Learning Research and Development CenterPittsburghUSA

Personalised recommendations