Wise Crowd Content Assessment and Educational Rubrics


Development of reliable rubrics for educational intervention studies that address reading and writing skills is labor-intensive, and could benefit from an automated approach. We compare a main ideas rubric used in a successful writing intervention study to a highly reliable wise-crowd content assessment method developed to evaluate machine-generated summaries. The ideas in the educational rubric were extracted from a source text that students were asked to summarize. The wise-crowd content assessment model is derived from summaries written by an independent group of proficient students who read the same source text, and followed the same instructions to write their summaries. The resulting content model includes a ranking over the derived content units. All main ideas in the rubric appear prominently in the wise-crowd content model. We present two methods that automate the content assessment. Scores based on the wise-crowd content assessment, both manual and automated, have high correlations with the main ideas rubric. The automated content assessment methods have several advantages over related methods, including high correlations with corresponding manual scores, a need for only half a dozen models instead of hundreds, and interpretable scores that independently assess content quality and coverage.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5


  1. 1.

    Personal communication with Perin.

  2. 2.

    The guidelines at http://www1.ccls.columbia.edu/beck/DUC2006/2006-pyramid-guidelines.htmlwere prepared for the 2006 Document Understanding Conference organized by NIST. Designers of approximately two dozen systems participated in the 2006 evaluation (Passonneau et al. 2006).

  3. 3.

    It also produces consistent rankings of summarization systems given different sets of annotations, which is less relevant here (Passonneau 2010).

  4. 4.

    The correlation of the comprehensive score from Passonneau et al. (2013) was 0.85, which has been corrected to 0.89.

  5. 5.

    The peak results reported in Yang et al. (2016) rely on an earlier version of adw. The score correlation of 0.81 for Devel. 20 reported there is higher than the 0.78 we get here.

  6. 6.

    Downloadable packages for PyrScore and peak will be available from the Columbia University Academic Commons, and The Pennsylvania State University Data Commons, where Passonneau has recently moved.


  1. Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V. 2. Journal of Technology Learning and Assessment, 4(3), 3–39.

  2. Bangert-Drowns, R.L., Hurley, M.M., & Wilkinson, B. (2004). The effects of school-based writing-to-learn interventions on academic achievement: A meta-analysis. Review of Educational Research, 74(1), 29–58.

    Article  Google Scholar 

  3. Beers, S.F., & Nagy, W.E. (2009). Syntactic complexity as a predictor of adolescent writing quality: Which measures? which genre? Reading and Writing, 22, 185–200.

    Article  Google Scholar 

  4. Beers, S.F., & Nagy, W.E. (2011). Writing development in four genres from grades three to seven: syntactic complexity and genre differentiation. Reading and Writing, 24, 183–202.

    Article  Google Scholar 

  5. Beigman-Klebanov, B. (2015). Towards automated evaluation of writing along STEM-relevant dimensions. MARWiSE: Multidisciplinary Advances in Reading and Writing for Science Education Workshop. May 7-8, 2015, Columbia University.

  6. Beigman-Klebanov, B., Madnani, N., Burstein, J., & Somasundaran, S. (2014). Content importance models for scoring writing from sources. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 247–252.

  7. Berland, L.K., & McNeill, K.L. (2010). A learning progression for scientific argumentation: Understanding student work and designing supportive instructional contexts. Science Education, 94(5), 765–793.

    Article  Google Scholar 

  8. Brown, A.L., & Day, J.D. (1983). Macrorules for summarizing texts: The development of expertise. Journal of Verbal Learning and Verbal Behavior, 22, 1–14.

    Article  Google Scholar 

  9. Brown, A.L., Day, J.D., & Jones, R.S. (1983). The development of plans for summarizing texts. Child Development, 54, 968–979.

    Article  Google Scholar 

  10. Burstein, J., Elliot, N., & Molloy, H. (2016). Informing automated writing evaluation using the lens of genre: Two studies. CALICO Journal, 33(1).

  11. Burstein, J., Kukich, K., Wolff, S., Lu, C., & Chodorow, M. (1998). Enriching automated essay scoring using discourse marking. In Stede, M., Wanner, L., & Hovy, E. (Eds.) Workshop on Discourse Relations and Discourse Marking, pages 15–21. Association for Computational Linguistics.

  12. Butcher, K.R., & Kintsch, W. (2001). Support of content and rhetorical processes of writing: Effects on the writing process and the written product. Cognition and Instruction, 19(3), 277–322.

    Article  Google Scholar 

  13. Cicchetti, D.V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychologial Assessment, 6(4), 284–290.

    Article  Google Scholar 

  14. Corro, L.D., & Gemulla, R. (2013). ClausIE: Clause-based open information extraction. In Proceedings of the 22nd International Conference on World Wide Web (WWW ’13), pages 355–366.

  15. Day, J.D. (1986). Teaching summarization skills: Influences of student ability level and strategy difficulty. Cognition and Instruction, 3(3), 193–210.

    Article  Google Scholar 

  16. Deane, P. (2013). On the relation between automated essay scoring and modern views of the writing construct. Assessing Writing, 18(1), 7–24.

    Article  Google Scholar 

  17. Deane, P., Odendahl, N., Quinlan, T., Fowles, M., Welsh, C., & Bivens-Tatum, J. (2008). Cognitive models of writing: Writing proficiency as a complex integrated skill. Technical Report 2, ETS Research Report Series, Princeton, NJ.

  18. Fellbaum, C. (Ed.) (1998). WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA.

  19. Foltz, P.W., Laham, D., & Landauer, T.K. (1999). The Intelligent Essay Assessor: Applications to educational technology. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, 1(2).

  20. Garner, R. (1985). Text summarization deficiencies among older students: Awareness or production ability American Educational Research Journal, 22(4), 549–560.

    MathSciNet  Article  Google Scholar 

  21. Gerard, L.F., Ryoo, K., McElhaney, K.W., Liu, O.L., Rafferty, A.N., & Linn, M.C. (2016). Automated guidance for student inquiry. Journal of Educational Psychology, 108(1), 60–81.

    Article  Google Scholar 

  22. Gil, L., Brȧten, I., Vidal-Abarca, E., & Strømsø, H.I. (2010). Understanding and integrating multiple science texts: Summary tasks are sometimes better than argument tasks. Reading Psychology, 31(1), 30–68.

    Article  Google Scholar 

  23. Gillespie, A., Graham, S., Kiuhara, S., & Hebert, M. (2014). High school teachers’ use of writing to support students’ learning: a national survey. Reading and Writing, 27(6), 1043–1072.

    Article  Google Scholar 

  24. Glymph, A. (2010). The nation’s report card: Reading 2009. Technical Report NCES 2010-458. Washington: National Center for Education Statistics (NCES).

    Google Scholar 

  25. Glymph, A. (2013). The nation’s report card: Reading 2012. Technical Report NCES 2012-457. Washington: National Center for Education Statistics (NCES).

    Google Scholar 

  26. Glymph, A., & Burg, S. (2013). The nation’s report card: A first look: 2013 mathematics and reading. Technical Report NCES 2014-451. Washington: National Center for Education Statistics (NCES).

    Google Scholar 

  27. Graham, S., Capizzi, A., Harris, K., Hebert, M., & Morphy, P. (2014). Teaching writing to middle school students: a national survey. Reading and Writing, 27(6), 1015–1042.

    Article  Google Scholar 

  28. Graham, S., & Perin, D. (2007a). A meta-analysis of writing instruction for adolescent students. Journal of Educational Psychology, 99(3), 445–476.

  29. Graham, S., & Perin, D. (2007b). Writing next: Effective strategies to improve writing of adolescents in middle and high schools. New York: Technical report, Carnegie Corporation of New York.

  30. Guo, W., & Diab, M. (2012). Modeling sentences in the latent space. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL), pages 864–872.

  31. Hand, B.M., Hohenshell, L., & Prain, V. (2004). Exploring students’ responses to conceptual questions when engaged with planned writing experiences: A study with year 10 science students. Journal of Research in Science Teaching, 41(2), 186–210.

    Article  Google Scholar 

  32. Hughes, S., Hastings, P., Magliano, J., Goldman, S., & Lawless, K. (2012). Automated approaches for detecting integration in student essays. In Automated approaches for detecting integration in student essays. Springer-Verlag.

  33. Johnson, R.E. (1970). Recall of prose as a function of the structural importance of the linguistic units. Journal of Verbal Learning and Verbal Behavior, 9(1), 12–20.

    Article  Google Scholar 

  34. Kellogg, R.T. (2008). Training writing skills: a cognitive development perspective. Journal of Writing Research, 1(1), 1–26.

    Article  Google Scholar 

  35. Kharkwal, G., & Muresan, S. (2014). Surprisal as a predictor of essay quality. In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications, pages 54–60.

  36. Kintsch, W., & van Dijk, T.A. (1978). Toward a model of text comprehension and production. Psychological Review, 85, 364–394.

  37. Klein, P.D., & Rose, M.A. (2010). Teaching argument and explanation to prepare junior students for writing to learn. Reading Research Quarterly, 45(4), 433–461.

    Article  Google Scholar 

  38. Krippendorff, K. (1980). Content Analysis: An Introduction to Its Methodology. Beverly Hills: Sage Publications.

    Google Scholar 

  39. Kuhn, H.W. (1955). The Hungarian Method for the assignment problem. Naval Research Logistics Quarterly, 2, 83–97.

    MathSciNet  Article  MATH  Google Scholar 

  40. Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37(4), 389–405.

    Article  Google Scholar 

  41. Lin, C.-Y. (2004). ROUGE: a package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004).

  42. Liu, O.L., Brew, C., Blackmore, J., Gerard, L.F., Madhok, J., & Linn, M.C. (2014). Automated scoring of constructed-response science items: Propsects and obstacles. Educational Measurement: Issues and Practice, 33(2), 19–28.

    Article  Google Scholar 

  43. Louis, A., & Nenkova, A. (2009). Automatically evaluating content selection in summarization without human models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 306–314, Singapore. Association for Computational Linguistics.

  44. Louis, A., & Nenkova, A. (2013). Automatically assessing machine summary content without a gold standard. Computational Linguistics, 39(2), 267–300.

    Article  Google Scholar 

  45. Ma, X., & Hovy, E.H. (2016). End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). To Appear.

  46. Magliano, J.P., Trabasso, T., & Graesser, A.C. (1999). Strategic processing during comprehension. Journal of Educational Psychology, 91, 615–629.

    Article  Google Scholar 

  47. Mazzeo, C., Rab, S.Y., & Alssid, J.L. (2003). Building bridges to college and careers: Contextualized basic skills programs at community colleges. Technical report. Brooklyn: Workforce Strategy Center.

    Google Scholar 

  48. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality.

  49. NCES (2012). The nation’s report card: Writing 2011.

  50. Nenkova, A., & Passonneau, R.J. (2004). Evaluating content selection in summarization: The pyramid method. In Susan Dumais, D.M., & Roukos, S. (Eds.) HLT-NAACL 2004: Main Proceedings, pages 145–152, Boston, Massachusetts, USA. Association for Computational Linguistics.

  51. Nenkova, A., Passonneau, R.J., & McKeown, K. (2007). The pyramid method: incorporating human content selection variation in summarization evaluation. ACM Transactions on Speech and Language Processing, 4(2).

  52. Norris, S.P., & Phillips, L.M. (2003). How literacy in its fundamental sense is central to scientific literacy. Science Education, 87(2), 224–240.

    Article  Google Scholar 

  53. Olinghouse, N.G., Graham, S., & Gillespie, A. (2015). The relationship of discourse and topic knowledge to fifth graders’ writing performance. Journal of Educational Psychology, 107(2), 391–406.

    Article  Google Scholar 

  54. Olinghouse, N.G., & Wilson, J. (2013). The relationship between vocabulary and writing quality in three genres. Reading and Writing, 26(1), 45–65.

    Article  Google Scholar 

  55. Owczarzak, K., Conroy, J.M., Dang, H.T., & Nenkova, A. (2012). An assessment of the accuracy of automatic evaluation in summarization. In Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization, pages 1–9. Association for Computational Linguistics.

  56. Page, E.B. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 47(5), 238–243.

    Google Scholar 

  57. Page, E.B. (1968). The use of the computer in analyzing student essays. International Review of Education, 14, 210–225.

    Article  Google Scholar 

  58. Page, E.B. (1994). Computer grading of student prose, using modern concepts and software. The Journal of experimental education, 62(2), 127–142.

    Article  Google Scholar 

  59. Passonneau, R., & Carpenter, B. (2014). The benefits of a model of annotation. Transactions of the Association for Computational Linguistics, 2, 311–326.

    Google Scholar 

  60. Passonneau, R.J. (2006). Measuring agreement on set-valued items (masi) for semantic and pragmatic annotation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), pages 831–836, Genoa, Italy.

  61. Passonneau, R.J. (2010). Formal and functional assessment of the pyramid method for summary content evaluation. Natural Language Engineering, 16, 107–131.

    Article  Google Scholar 

  62. Passonneau, R.J., Baker, C.F., Fellbaum, C., & Ide, N. (2012). The masc word sense corpus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC ’12). European Language Resources Association (ELRA).

  63. Passonneau, R.J., Chen, E., Guo, W., & Perin, D. (2013). Automated pyramid scoring of summaries using distributional semantics. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 143–147, Sofia, Bulgaria. Association for Computational Linguistics.

  64. Passonneau, R.J., Goodkind, A., & Levy, E. (2007). Annotation of children’s oral narrations: Modeling emergent narrative skills for computational applications. In Proceedings of the Twentieth Annual Meeting of the Florida Artificial Intelligence Research Society (FLAIRS-20), pages 253–258.

  65. Passonneau, R.J., McKeown, K., & Sigelman, S. (2006). Applying the pyramid method in the 2006 Document Understanding Conference. In Proceedings of the 2006 Workshop of the Document Understanding Conference (DUC).

  66. Passonneau, R.J., Nenkova, A., McKeown, K., & Sigelman, S. (2005). Applying the pyramid method in DUC 2005. In Proceedings of the 2005 Workshop of the Document Understanding Conference (DUC).

  67. Perin, D., Bork, R.H., Peverly, S.T., & Mason, L.H. (2013). A contextualized curricular supplement for developmental reading and writing. Journal of College Reading and Learning, 43(2), 8–38.

    Article  Google Scholar 

  68. Persky, H.R., Daane, M.C., & Jin, Y. (2003). The nation’s report card: Writing 2002. Technical Report NCES 2003-529. Washington: National Center for Education Statistics (NCES).

    Google Scholar 

  69. Pilehvar, M.T., Jurgens, D., & Navigli, R. (2013). Align, disambiguate and walk: A unified approach for measuring semantic similarity. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1341–1351, Sofia, Bulgaria. Association for Computational Linguistics.

  70. Proske, A., Narciss, S., & McNamara, D.S. (2012). Computer-based scaffolding to facilitate students’ development of expertise in academic writing. Journal of Research in Reading, 35(2), 136–152.

    Article  Google Scholar 

  71. Qazvinian, V., & Radev, D.R. (2012). A computational analysis of collective discourse. In Proceedings of the 2012 Conference on Collective Intelligence, Cambridge MA.

  72. Reiser, B.J., & Kenyon, L.K.B.L. (2012). Engaging students in the scientific practices of explanation and argumentation. Science and Children, 49(8), 8–13.

    Google Scholar 

  73. Roscoe, R.D., Allen, L.K., Weston, J.L., Crossley, S.A., & McNamara, D.S. (2015a). The Writing Pal intelligent tutoring system: Usability testing and development. Computers and Composition, 34, 39–59.

  74. Roscoe, R.D., & McNamara, D.S. (2013). Writing Pal: Feasibility of an intelligent writing strategy tutor in the high school classroom. Journal of Educational Psychology, 105(4), 1–16.

    Article  Google Scholar 

  75. Roscoe, R.D., Snow, E.L., Allen, L.K., & McNamara, D.S. (2015b). Automated detection of essay revising patterns: applications for intelligent feedback in a writing tutor. Technology, Instruction, Cognition, and Learning, 10(1), 59–79.

  76. Rosé, C., & VanLehn, K. (2005). An evaluation of a hybrid language understanding approach for robust selection of tutoring goals. International Journal of Artificial Intelligence in Education, 15(4), 325.

    Google Scholar 

  77. Rudner, L.M., Garcia, V., & Welch, C. (2006). An evaluation of Intellimetric TM essay scoring system. The Journal of Technology Learning and Assessment, 4(4).

  78. Saggion, H., Torres-Moreno, J.-M., da Cunha, I., SanJuan, E., & Velázquez-Morales, P. (2010). Multilingual summarization evaluation without human models. In Proceedings of Coling 2010, pages 1059–1067.

  79. Sakai, S., Togasaki, M., & Yamazaki, K. (2003). A note on greedy algorithms for the maximum weighted independent set problem. Discrete Applied Mathematics, 126(2-3), 313–322.

    MathSciNet  Article  MATH  Google Scholar 

  80. Salahu-Din, D., Persky, H.R., & Miller, J. (2008). The nation’s report card: Writing 2007. Technical Report NCES 2008-468. Washington: National Center for Education Statistics (NCES).

    Google Scholar 

  81. Sampson, V., Enderle, P., Grooms, J., & Witte, S. (2013). Writing to learn by learning to write during the school science laboratory: Helping middle and high school students develop argumentative writing skills as they learn core ideas. Science Education, 97(5), 643–670.

    Article  Google Scholar 

  82. Shermis, M., & Burstein, J. (2013). Handbook of Automated Essay Evaluation: Current Applications and Future Directions. New York: New York: Routledge.

    Google Scholar 

  83. Slotta, J.D., & Linn, M.C. (2009). WISE Science. New York: Teachers College Press.

    Google Scholar 

  84. Surowiecki, J. (2004). The Wisdom of Crowds. New York: Doubleday.

    Google Scholar 

  85. Teufel, S., & van Halteren, H. (2004). Evaluating information content by factoid analysis: Human annotation and stability. In Lin, D., & Wu, D. (Eds.) Proceedings of EMNLP 2004, pages 419–426, Barcelona, Spain. Association for Computational Linguistics.

  86. Turner, A.A. (1987). The propositional analysis system. Technical Report 87-2, University of Colorado. Boulder: Department of Psychology and Institute of Cognitive Science.

    Google Scholar 

  87. Turner, A.A., & Greene, E. (1978). The construction and use of a propositional analysis system. Technical Report JSAS Catalog of Selected Documents in Psychology, no. 1713. Washington, DC: American Psychological Association.

    Google Scholar 

  88. van Dijk, T.A., & Kintsch, W. (1977). Cognitive psychology and discourse: Recalling and summarizing stories. In Dressier, W.U. (Ed.) Trends in text-linguistics, pages 61–80. De Gruyter, New York.

  89. van Dijk, T.A., & Kintsch, W. (1983). Strategies of Discourse Comprehension. New York: Academic Press.

  90. VanLehn, K., Jordan, P., & Rosé, C.P. (2002). The architecture of Why2-Atlas: a coach for qualitative physics essay writing. In Cerri, S.A., Gouarderes, G., & Paraguacu, F. (Eds.) Intelligent Tutoring Systems, 2002: 6th International Conference, pages 158–167, Berlin. Springer.

  91. Westby, C., Culatta, B., Lawrence, B., & Hall-Kenyon, K. (2010). Summarizing expository texts. Topics in Language Disorders, 30(4), 275–287.

    Article  Google Scholar 

  92. Wiemer-Hastings, P., Wiemer-Hastings, K., & Graesser, A.C. (1999). Improving an intelligent tutor’s comprehension of students with latent semantic analysis. In Lajoie, S.P., & Vivet, M. (Eds.) Artificial Intelligence in Education, pages 535–542. IOS Press, Amsterdam.

  93. Wiley, J., & Voss, J.F. (1996). The effects of playing historian on learning in history. Applied Cognitive Psychology, 10, 63–72.

    Article  Google Scholar 

  94. Yang, Q., Passonneau, R.J., & de Melo, G. (2016). PEAK: Pyramid evaluation via automated knowledge extraction. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI 2016). AAAI Press.

  95. Yore, L.D., Hand, B.M., & Florence, M.K. (2004). Scientists’ views of science, models of writing, and science writing practices. Journal of Research in Science Teaching, 41(4), 338–369.

    Article  Google Scholar 

  96. Zipf, G.K. (1949). Human Behaviour and the Principle of Least Effort. Cambridge: Addison-Wesley.

    Google Scholar 

Download references


This paper is an extended version of an oral presentation made at an NSF-funded workshop held May 7-8, 2015 entitled MARWiSE: Multidisciplinary Advances in Reading and Writing for Science Education (Award IIS-1455533). The authors thank members of the workshop for their constructive feedback. We also thank Weiwei Guo for input regarding his Weighted Matrix Factorization method, and his suggestions for related work. Finally, we thank three anonymous reviewers for their constructive criticism.

Author information



Corresponding author

Correspondence to Rebecca J. Passonneau.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Passonneau, R.J., Poddar, A., Gite, G. et al. Wise Crowd Content Assessment and Educational Rubrics. Int J Artif Intell Educ 28, 29–55 (2018). https://doi.org/10.1007/s40593-016-0128-6

Download citation


  • Automated content analysis
  • Writing intervention
  • Wise-crowd content assessment
  • Writing rubrics