Wise Crowd Content Assessment and Educational Rubrics

  • Rebecca J. PassonneauEmail author
  • Ananya Poddar
  • Gaurav Gite
  • Alisa Krivokapic
  • Qian Yang
  • Dolores Perin


Development of reliable rubrics for educational intervention studies that address reading and writing skills is labor-intensive, and could benefit from an automated approach. We compare a main ideas rubric used in a successful writing intervention study to a highly reliable wise-crowd content assessment method developed to evaluate machine-generated summaries. The ideas in the educational rubric were extracted from a source text that students were asked to summarize. The wise-crowd content assessment model is derived from summaries written by an independent group of proficient students who read the same source text, and followed the same instructions to write their summaries. The resulting content model includes a ranking over the derived content units. All main ideas in the rubric appear prominently in the wise-crowd content model. We present two methods that automate the content assessment. Scores based on the wise-crowd content assessment, both manual and automated, have high correlations with the main ideas rubric. The automated content assessment methods have several advantages over related methods, including high correlations with corresponding manual scores, a need for only half a dozen models instead of hundreds, and interpretable scores that independently assess content quality and coverage.


Automated content analysis Writing intervention Wise-crowd content assessment Writing rubrics 



This paper is an extended version of an oral presentation made at an NSF-funded workshop held May 7-8, 2015 entitled MARWiSE: Multidisciplinary Advances in Reading and Writing for Science Education (Award IIS-1455533). The authors thank members of the workshop for their constructive feedback. We also thank Weiwei Guo for input regarding his Weighted Matrix Factorization method, and his suggestions for related work. Finally, we thank three anonymous reviewers for their constructive criticism.


  1. Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V. 2. Journal of Technology Learning and Assessment, 4(3), 3–39.Google Scholar
  2. Bangert-Drowns, R.L., Hurley, M.M., & Wilkinson, B. (2004). The effects of school-based writing-to-learn interventions on academic achievement: A meta-analysis. Review of Educational Research, 74(1), 29–58.CrossRefGoogle Scholar
  3. Beers, S.F., & Nagy, W.E. (2009). Syntactic complexity as a predictor of adolescent writing quality: Which measures? which genre? Reading and Writing, 22, 185–200.CrossRefGoogle Scholar
  4. Beers, S.F., & Nagy, W.E. (2011). Writing development in four genres from grades three to seven: syntactic complexity and genre differentiation. Reading and Writing, 24, 183–202.CrossRefGoogle Scholar
  5. Beigman-Klebanov, B. (2015). Towards automated evaluation of writing along STEM-relevant dimensions. MARWiSE: Multidisciplinary Advances in Reading and Writing for Science Education Workshop. May 7-8, 2015, Columbia University.Google Scholar
  6. Beigman-Klebanov, B., Madnani, N., Burstein, J., & Somasundaran, S. (2014). Content importance models for scoring writing from sources. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 247–252.Google Scholar
  7. Berland, L.K., & McNeill, K.L. (2010). A learning progression for scientific argumentation: Understanding student work and designing supportive instructional contexts. Science Education, 94(5), 765–793.CrossRefGoogle Scholar
  8. Brown, A.L., & Day, J.D. (1983). Macrorules for summarizing texts: The development of expertise. Journal of Verbal Learning and Verbal Behavior, 22, 1–14.CrossRefGoogle Scholar
  9. Brown, A.L., Day, J.D., & Jones, R.S. (1983). The development of plans for summarizing texts. Child Development, 54, 968–979.CrossRefGoogle Scholar
  10. Burstein, J., Elliot, N., & Molloy, H. (2016). Informing automated writing evaluation using the lens of genre: Two studies. CALICO Journal, 33(1).Google Scholar
  11. Burstein, J., Kukich, K., Wolff, S., Lu, C., & Chodorow, M. (1998). Enriching automated essay scoring using discourse marking. In Stede, M., Wanner, L., & Hovy, E. (Eds.) Workshop on Discourse Relations and Discourse Marking, pages 15–21. Association for Computational Linguistics.Google Scholar
  12. Butcher, K.R., & Kintsch, W. (2001). Support of content and rhetorical processes of writing: Effects on the writing process and the written product. Cognition and Instruction, 19(3), 277–322.CrossRefGoogle Scholar
  13. Cicchetti, D.V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychologial Assessment, 6(4), 284–290.CrossRefGoogle Scholar
  14. Corro, L.D., & Gemulla, R. (2013). ClausIE: Clause-based open information extraction. In Proceedings of the 22nd International Conference on World Wide Web (WWW ’13), pages 355–366.Google Scholar
  15. Day, J.D. (1986). Teaching summarization skills: Influences of student ability level and strategy difficulty. Cognition and Instruction, 3(3), 193–210.CrossRefGoogle Scholar
  16. Deane, P. (2013). On the relation between automated essay scoring and modern views of the writing construct. Assessing Writing, 18(1), 7–24.CrossRefGoogle Scholar
  17. Deane, P., Odendahl, N., Quinlan, T., Fowles, M., Welsh, C., & Bivens-Tatum, J. (2008). Cognitive models of writing: Writing proficiency as a complex integrated skill. Technical Report 2, ETS Research Report Series, Princeton, NJ.Google Scholar
  18. Fellbaum, C. (Ed.) (1998). WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA.Google Scholar
  19. Foltz, P.W., Laham, D., & Landauer, T.K. (1999). The Intelligent Essay Assessor: Applications to educational technology. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, 1(2).Google Scholar
  20. Garner, R. (1985). Text summarization deficiencies among older students: Awareness or production ability American Educational Research Journal, 22(4), 549–560.MathSciNetCrossRefGoogle Scholar
  21. Gerard, L.F., Ryoo, K., McElhaney, K.W., Liu, O.L., Rafferty, A.N., & Linn, M.C. (2016). Automated guidance for student inquiry. Journal of Educational Psychology, 108(1), 60–81.CrossRefGoogle Scholar
  22. Gil, L., Brȧten, I., Vidal-Abarca, E., & Strømsø, H.I. (2010). Understanding and integrating multiple science texts: Summary tasks are sometimes better than argument tasks. Reading Psychology, 31(1), 30–68.CrossRefGoogle Scholar
  23. Gillespie, A., Graham, S., Kiuhara, S., & Hebert, M. (2014). High school teachers’ use of writing to support students’ learning: a national survey. Reading and Writing, 27(6), 1043–1072.CrossRefGoogle Scholar
  24. Glymph, A. (2010). The nation’s report card: Reading 2009. Technical Report NCES 2010-458. Washington: National Center for Education Statistics (NCES).Google Scholar
  25. Glymph, A. (2013). The nation’s report card: Reading 2012. Technical Report NCES 2012-457. Washington: National Center for Education Statistics (NCES).Google Scholar
  26. Glymph, A., & Burg, S. (2013). The nation’s report card: A first look: 2013 mathematics and reading. Technical Report NCES 2014-451. Washington: National Center for Education Statistics (NCES).Google Scholar
  27. Graham, S., Capizzi, A., Harris, K., Hebert, M., & Morphy, P. (2014). Teaching writing to middle school students: a national survey. Reading and Writing, 27(6), 1015–1042.CrossRefGoogle Scholar
  28. Graham, S., & Perin, D. (2007a). A meta-analysis of writing instruction for adolescent students. Journal of Educational Psychology, 99(3), 445–476.Google Scholar
  29. Graham, S., & Perin, D. (2007b). Writing next: Effective strategies to improve writing of adolescents in middle and high schools. New York: Technical report, Carnegie Corporation of New York.Google Scholar
  30. Guo, W., & Diab, M. (2012). Modeling sentences in the latent space. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL), pages 864–872.Google Scholar
  31. Hand, B.M., Hohenshell, L., & Prain, V. (2004). Exploring students’ responses to conceptual questions when engaged with planned writing experiences: A study with year 10 science students. Journal of Research in Science Teaching, 41(2), 186–210.CrossRefGoogle Scholar
  32. Hughes, S., Hastings, P., Magliano, J., Goldman, S., & Lawless, K. (2012). Automated approaches for detecting integration in student essays. In Automated approaches for detecting integration in student essays. Springer-Verlag.Google Scholar
  33. Johnson, R.E. (1970). Recall of prose as a function of the structural importance of the linguistic units. Journal of Verbal Learning and Verbal Behavior, 9(1), 12–20.CrossRefGoogle Scholar
  34. Kellogg, R.T. (2008). Training writing skills: a cognitive development perspective. Journal of Writing Research, 1(1), 1–26.CrossRefGoogle Scholar
  35. Kharkwal, G., & Muresan, S. (2014). Surprisal as a predictor of essay quality. In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications, pages 54–60.Google Scholar
  36. Kintsch, W., & van Dijk, T.A. (1978). Toward a model of text comprehension and production. Psychological Review, 85, 364–394.Google Scholar
  37. Klein, P.D., & Rose, M.A. (2010). Teaching argument and explanation to prepare junior students for writing to learn. Reading Research Quarterly, 45(4), 433–461.CrossRefGoogle Scholar
  38. Krippendorff, K. (1980). Content Analysis: An Introduction to Its Methodology. Beverly Hills: Sage Publications.zbMATHGoogle Scholar
  39. Kuhn, H.W. (1955). The Hungarian Method for the assignment problem. Naval Research Logistics Quarterly, 2, 83–97.MathSciNetCrossRefzbMATHGoogle Scholar
  40. Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37(4), 389–405.CrossRefGoogle Scholar
  41. Lin, C.-Y. (2004). ROUGE: a package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004).Google Scholar
  42. Liu, O.L., Brew, C., Blackmore, J., Gerard, L.F., Madhok, J., & Linn, M.C. (2014). Automated scoring of constructed-response science items: Propsects and obstacles. Educational Measurement: Issues and Practice, 33(2), 19–28.CrossRefGoogle Scholar
  43. Louis, A., & Nenkova, A. (2009). Automatically evaluating content selection in summarization without human models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 306–314, Singapore. Association for Computational Linguistics.Google Scholar
  44. Louis, A., & Nenkova, A. (2013). Automatically assessing machine summary content without a gold standard. Computational Linguistics, 39(2), 267–300.CrossRefGoogle Scholar
  45. Ma, X., & Hovy, E.H. (2016). End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). To Appear.Google Scholar
  46. Magliano, J.P., Trabasso, T., & Graesser, A.C. (1999). Strategic processing during comprehension. Journal of Educational Psychology, 91, 615–629.CrossRefGoogle Scholar
  47. Mazzeo, C., Rab, S.Y., & Alssid, J.L. (2003). Building bridges to college and careers: Contextualized basic skills programs at community colleges. Technical report. Brooklyn: Workforce Strategy Center.Google Scholar
  48. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality.Google Scholar
  49. NCES (2012). The nation’s report card: Writing 2011.Google Scholar
  50. Nenkova, A., & Passonneau, R.J. (2004). Evaluating content selection in summarization: The pyramid method. In Susan Dumais, D.M., & Roukos, S. (Eds.) HLT-NAACL 2004: Main Proceedings, pages 145–152, Boston, Massachusetts, USA. Association for Computational Linguistics.Google Scholar
  51. Nenkova, A., Passonneau, R.J., & McKeown, K. (2007). The pyramid method: incorporating human content selection variation in summarization evaluation. ACM Transactions on Speech and Language Processing, 4(2).Google Scholar
  52. Norris, S.P., & Phillips, L.M. (2003). How literacy in its fundamental sense is central to scientific literacy. Science Education, 87(2), 224–240.CrossRefGoogle Scholar
  53. Olinghouse, N.G., Graham, S., & Gillespie, A. (2015). The relationship of discourse and topic knowledge to fifth graders’ writing performance. Journal of Educational Psychology, 107(2), 391–406.CrossRefGoogle Scholar
  54. Olinghouse, N.G., & Wilson, J. (2013). The relationship between vocabulary and writing quality in three genres. Reading and Writing, 26(1), 45–65.CrossRefGoogle Scholar
  55. Owczarzak, K., Conroy, J.M., Dang, H.T., & Nenkova, A. (2012). An assessment of the accuracy of automatic evaluation in summarization. In Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization, pages 1–9. Association for Computational Linguistics.Google Scholar
  56. Page, E.B. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 47(5), 238–243.Google Scholar
  57. Page, E.B. (1968). The use of the computer in analyzing student essays. International Review of Education, 14, 210–225.CrossRefGoogle Scholar
  58. Page, E.B. (1994). Computer grading of student prose, using modern concepts and software. The Journal of experimental education, 62(2), 127–142.CrossRefGoogle Scholar
  59. Passonneau, R., & Carpenter, B. (2014). The benefits of a model of annotation. Transactions of the Association for Computational Linguistics, 2, 311–326.Google Scholar
  60. Passonneau, R.J. (2006). Measuring agreement on set-valued items (masi) for semantic and pragmatic annotation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), pages 831–836, Genoa, Italy.Google Scholar
  61. Passonneau, R.J. (2010). Formal and functional assessment of the pyramid method for summary content evaluation. Natural Language Engineering, 16, 107–131.CrossRefGoogle Scholar
  62. Passonneau, R.J., Baker, C.F., Fellbaum, C., & Ide, N. (2012). The masc word sense corpus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC ’12). European Language Resources Association (ELRA).Google Scholar
  63. Passonneau, R.J., Chen, E., Guo, W., & Perin, D. (2013). Automated pyramid scoring of summaries using distributional semantics. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 143–147, Sofia, Bulgaria. Association for Computational Linguistics.Google Scholar
  64. Passonneau, R.J., Goodkind, A., & Levy, E. (2007). Annotation of children’s oral narrations: Modeling emergent narrative skills for computational applications. In Proceedings of the Twentieth Annual Meeting of the Florida Artificial Intelligence Research Society (FLAIRS-20), pages 253–258.Google Scholar
  65. Passonneau, R.J., McKeown, K., & Sigelman, S. (2006). Applying the pyramid method in the 2006 Document Understanding Conference. In Proceedings of the 2006 Workshop of the Document Understanding Conference (DUC).Google Scholar
  66. Passonneau, R.J., Nenkova, A., McKeown, K., & Sigelman, S. (2005). Applying the pyramid method in DUC 2005. In Proceedings of the 2005 Workshop of the Document Understanding Conference (DUC).Google Scholar
  67. Perin, D., Bork, R.H., Peverly, S.T., & Mason, L.H. (2013). A contextualized curricular supplement for developmental reading and writing. Journal of College Reading and Learning, 43(2), 8–38.CrossRefGoogle Scholar
  68. Persky, H.R., Daane, M.C., & Jin, Y. (2003). The nation’s report card: Writing 2002. Technical Report NCES 2003-529. Washington: National Center for Education Statistics (NCES).Google Scholar
  69. Pilehvar, M.T., Jurgens, D., & Navigli, R. (2013). Align, disambiguate and walk: A unified approach for measuring semantic similarity. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1341–1351, Sofia, Bulgaria. Association for Computational Linguistics.Google Scholar
  70. Proske, A., Narciss, S., & McNamara, D.S. (2012). Computer-based scaffolding to facilitate students’ development of expertise in academic writing. Journal of Research in Reading, 35(2), 136–152.CrossRefGoogle Scholar
  71. Qazvinian, V., & Radev, D.R. (2012). A computational analysis of collective discourse. In Proceedings of the 2012 Conference on Collective Intelligence, Cambridge MA.Google Scholar
  72. Reiser, B.J., & Kenyon, L.K.B.L. (2012). Engaging students in the scientific practices of explanation and argumentation. Science and Children, 49(8), 8–13.Google Scholar
  73. Roscoe, R.D., Allen, L.K., Weston, J.L., Crossley, S.A., & McNamara, D.S. (2015a). The Writing Pal intelligent tutoring system: Usability testing and development. Computers and Composition, 34, 39–59.Google Scholar
  74. Roscoe, R.D., & McNamara, D.S. (2013). Writing Pal: Feasibility of an intelligent writing strategy tutor in the high school classroom. Journal of Educational Psychology, 105(4), 1–16.CrossRefGoogle Scholar
  75. Roscoe, R.D., Snow, E.L., Allen, L.K., & McNamara, D.S. (2015b). Automated detection of essay revising patterns: applications for intelligent feedback in a writing tutor. Technology, Instruction, Cognition, and Learning, 10(1), 59–79.Google Scholar
  76. Rosé, C., & VanLehn, K. (2005). An evaluation of a hybrid language understanding approach for robust selection of tutoring goals. International Journal of Artificial Intelligence in Education, 15(4), 325.Google Scholar
  77. Rudner, L.M., Garcia, V., & Welch, C. (2006). An evaluation of Intellimetric TM essay scoring system. The Journal of Technology Learning and Assessment, 4(4).Google Scholar
  78. Saggion, H., Torres-Moreno, J.-M., da Cunha, I., SanJuan, E., & Velázquez-Morales, P. (2010). Multilingual summarization evaluation without human models. In Proceedings of Coling 2010, pages 1059–1067.Google Scholar
  79. Sakai, S., Togasaki, M., & Yamazaki, K. (2003). A note on greedy algorithms for the maximum weighted independent set problem. Discrete Applied Mathematics, 126(2-3), 313–322.MathSciNetCrossRefzbMATHGoogle Scholar
  80. Salahu-Din, D., Persky, H.R., & Miller, J. (2008). The nation’s report card: Writing 2007. Technical Report NCES 2008-468. Washington: National Center for Education Statistics (NCES).Google Scholar
  81. Sampson, V., Enderle, P., Grooms, J., & Witte, S. (2013). Writing to learn by learning to write during the school science laboratory: Helping middle and high school students develop argumentative writing skills as they learn core ideas. Science Education, 97(5), 643–670.CrossRefGoogle Scholar
  82. Shermis, M., & Burstein, J. (2013). Handbook of Automated Essay Evaluation: Current Applications and Future Directions. New York: New York: Routledge.Google Scholar
  83. Slotta, J.D., & Linn, M.C. (2009). WISE Science. New York: Teachers College Press.Google Scholar
  84. Surowiecki, J. (2004). The Wisdom of Crowds. New York: Doubleday.Google Scholar
  85. Teufel, S., & van Halteren, H. (2004). Evaluating information content by factoid analysis: Human annotation and stability. In Lin, D., & Wu, D. (Eds.) Proceedings of EMNLP 2004, pages 419–426, Barcelona, Spain. Association for Computational Linguistics.Google Scholar
  86. Turner, A.A. (1987). The propositional analysis system. Technical Report 87-2, University of Colorado. Boulder: Department of Psychology and Institute of Cognitive Science.Google Scholar
  87. Turner, A.A., & Greene, E. (1978). The construction and use of a propositional analysis system. Technical Report JSAS Catalog of Selected Documents in Psychology, no. 1713. Washington, DC: American Psychological Association.Google Scholar
  88. van Dijk, T.A., & Kintsch, W. (1977). Cognitive psychology and discourse: Recalling and summarizing stories. In Dressier, W.U. (Ed.) Trends in text-linguistics, pages 61–80. De Gruyter, New York.Google Scholar
  89. van Dijk, T.A., & Kintsch, W. (1983). Strategies of Discourse Comprehension. New York: Academic Press.Google Scholar
  90. VanLehn, K., Jordan, P., & Rosé, C.P. (2002). The architecture of Why2-Atlas: a coach for qualitative physics essay writing. In Cerri, S.A., Gouarderes, G., & Paraguacu, F. (Eds.) Intelligent Tutoring Systems, 2002: 6th International Conference, pages 158–167, Berlin. Springer.Google Scholar
  91. Westby, C., Culatta, B., Lawrence, B., & Hall-Kenyon, K. (2010). Summarizing expository texts. Topics in Language Disorders, 30(4), 275–287.CrossRefGoogle Scholar
  92. Wiemer-Hastings, P., Wiemer-Hastings, K., & Graesser, A.C. (1999). Improving an intelligent tutor’s comprehension of students with latent semantic analysis. In Lajoie, S.P., & Vivet, M. (Eds.) Artificial Intelligence in Education, pages 535–542. IOS Press, Amsterdam.Google Scholar
  93. Wiley, J., & Voss, J.F. (1996). The effects of playing historian on learning in history. Applied Cognitive Psychology, 10, 63–72.CrossRefGoogle Scholar
  94. Yang, Q., Passonneau, R.J., & de Melo, G. (2016). PEAK: Pyramid evaluation via automated knowledge extraction. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI 2016). AAAI Press.Google Scholar
  95. Yore, L.D., Hand, B.M., & Florence, M.K. (2004). Scientists’ views of science, models of writing, and science writing practices. Journal of Research in Science Teaching, 41(4), 338–369.CrossRefGoogle Scholar
  96. Zipf, G.K. (1949). Human Behaviour and the Principle of Least Effort. Cambridge: Addison-Wesley.Google Scholar

Copyright information

© International Artificial Intelligence in Education Society 2016

Authors and Affiliations

  • Rebecca J. Passonneau
    • 1
    Email author
  • Ananya Poddar
    • 1
  • Gaurav Gite
    • 1
  • Alisa Krivokapic
    • 1
  • Qian Yang
    • 2
  • Dolores Perin
    • 3
  1. 1.Columbia UniversityNew YorkUSA
  2. 2.Tsinghua UniversityBeijingChina
  3. 3.Teachers College of Columbia UniversityNew YorkUSA

Personalised recommendations