Development of reliable rubrics for educational intervention studies that address reading and writing skills is labor-intensive, and could benefit from an automated approach. We compare a main ideas rubric used in a successful writing intervention study to a highly reliable wise-crowd content assessment method developed to evaluate machine-generated summaries. The ideas in the educational rubric were extracted from a source text that students were asked to summarize. The wise-crowd content assessment model is derived from summaries written by an independent group of proficient students who read the same source text, and followed the same instructions to write their summaries. The resulting content model includes a ranking over the derived content units. All main ideas in the rubric appear prominently in the wise-crowd content model. We present two methods that automate the content assessment. Scores based on the wise-crowd content assessment, both manual and automated, have high correlations with the main ideas rubric. The automated content assessment methods have several advantages over related methods, including high correlations with corresponding manual scores, a need for only half a dozen models instead of hundreds, and interpretable scores that independently assess content quality and coverage.
This is a preview of subscription content, log in to check access.
Buy single article
Instant access to the full article PDF.
Price includes VAT for USA
Subscribe to journal
Immediate online access to all issues from 2019. Subscription will auto renew annually.
This is the net price. Taxes to be calculated in checkout.
Personal communication with Perin.
The guidelines at http://www1.ccls.columbia.edu/beck/DUC2006/2006-pyramid-guidelines.htmlwere prepared for the 2006 Document Understanding Conference organized by NIST. Designers of approximately two dozen systems participated in the 2006 evaluation (Passonneau et al. 2006).
It also produces consistent rankings of summarization systems given different sets of annotations, which is less relevant here (Passonneau 2010).
The correlation of the comprehensive score from Passonneau et al. (2013) was 0.85, which has been corrected to 0.89.
The peak results reported in Yang et al. (2016) rely on an earlier version of adw. The score correlation of 0.81 for Devel. 20 reported there is higher than the 0.78 we get here.
Downloadable packages for PyrScore and peak will be available from the Columbia University Academic Commons, and The Pennsylvania State University Data Commons, where Passonneau has recently moved.
Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater® V. 2. Journal of Technology Learning and Assessment, 4(3), 3–39.
Bangert-Drowns, R.L., Hurley, M.M., & Wilkinson, B. (2004). The effects of school-based writing-to-learn interventions on academic achievement: A meta-analysis. Review of Educational Research, 74(1), 29–58.
Beers, S.F., & Nagy, W.E. (2009). Syntactic complexity as a predictor of adolescent writing quality: Which measures? which genre? Reading and Writing, 22, 185–200.
Beers, S.F., & Nagy, W.E. (2011). Writing development in four genres from grades three to seven: syntactic complexity and genre differentiation. Reading and Writing, 24, 183–202.
Beigman-Klebanov, B. (2015). Towards automated evaluation of writing along STEM-relevant dimensions. MARWiSE: Multidisciplinary Advances in Reading and Writing for Science Education Workshop. May 7-8, 2015, Columbia University.
Beigman-Klebanov, B., Madnani, N., Burstein, J., & Somasundaran, S. (2014). Content importance models for scoring writing from sources. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 247–252.
Berland, L.K., & McNeill, K.L. (2010). A learning progression for scientific argumentation: Understanding student work and designing supportive instructional contexts. Science Education, 94(5), 765–793.
Brown, A.L., & Day, J.D. (1983). Macrorules for summarizing texts: The development of expertise. Journal of Verbal Learning and Verbal Behavior, 22, 1–14.
Brown, A.L., Day, J.D., & Jones, R.S. (1983). The development of plans for summarizing texts. Child Development, 54, 968–979.
Burstein, J., Elliot, N., & Molloy, H. (2016). Informing automated writing evaluation using the lens of genre: Two studies. CALICO Journal, 33(1).
Burstein, J., Kukich, K., Wolff, S., Lu, C., & Chodorow, M. (1998). Enriching automated essay scoring using discourse marking. In Stede, M., Wanner, L., & Hovy, E. (Eds.) Workshop on Discourse Relations and Discourse Marking, pages 15–21. Association for Computational Linguistics.
Butcher, K.R., & Kintsch, W. (2001). Support of content and rhetorical processes of writing: Effects on the writing process and the written product. Cognition and Instruction, 19(3), 277–322.
Cicchetti, D.V. (1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychologial Assessment, 6(4), 284–290.
Corro, L.D., & Gemulla, R. (2013). ClausIE: Clause-based open information extraction. In Proceedings of the 22nd International Conference on World Wide Web (WWW ’13), pages 355–366.
Day, J.D. (1986). Teaching summarization skills: Influences of student ability level and strategy difficulty. Cognition and Instruction, 3(3), 193–210.
Deane, P. (2013). On the relation between automated essay scoring and modern views of the writing construct. Assessing Writing, 18(1), 7–24.
Deane, P., Odendahl, N., Quinlan, T., Fowles, M., Welsh, C., & Bivens-Tatum, J. (2008). Cognitive models of writing: Writing proficiency as a complex integrated skill. Technical Report 2, ETS Research Report Series, Princeton, NJ.
Fellbaum, C. (Ed.) (1998). WordNet: An Electronic Lexical Database. MIT Press, Cambridge, MA.
Foltz, P.W., Laham, D., & Landauer, T.K. (1999). The Intelligent Essay Assessor: Applications to educational technology. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, 1(2).
Garner, R. (1985). Text summarization deficiencies among older students: Awareness or production ability American Educational Research Journal, 22(4), 549–560.
Gerard, L.F., Ryoo, K., McElhaney, K.W., Liu, O.L., Rafferty, A.N., & Linn, M.C. (2016). Automated guidance for student inquiry. Journal of Educational Psychology, 108(1), 60–81.
Gil, L., Brȧten, I., Vidal-Abarca, E., & Strømsø, H.I. (2010). Understanding and integrating multiple science texts: Summary tasks are sometimes better than argument tasks. Reading Psychology, 31(1), 30–68.
Gillespie, A., Graham, S., Kiuhara, S., & Hebert, M. (2014). High school teachers’ use of writing to support students’ learning: a national survey. Reading and Writing, 27(6), 1043–1072.
Glymph, A. (2010). The nation’s report card: Reading 2009. Technical Report NCES 2010-458. Washington: National Center for Education Statistics (NCES).
Glymph, A. (2013). The nation’s report card: Reading 2012. Technical Report NCES 2012-457. Washington: National Center for Education Statistics (NCES).
Glymph, A., & Burg, S. (2013). The nation’s report card: A first look: 2013 mathematics and reading. Technical Report NCES 2014-451. Washington: National Center for Education Statistics (NCES).
Graham, S., Capizzi, A., Harris, K., Hebert, M., & Morphy, P. (2014). Teaching writing to middle school students: a national survey. Reading and Writing, 27(6), 1015–1042.
Graham, S., & Perin, D. (2007a). A meta-analysis of writing instruction for adolescent students. Journal of Educational Psychology, 99(3), 445–476.
Graham, S., & Perin, D. (2007b). Writing next: Effective strategies to improve writing of adolescents in middle and high schools. New York: Technical report, Carnegie Corporation of New York.
Guo, W., & Diab, M. (2012). Modeling sentences in the latent space. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL), pages 864–872.
Hand, B.M., Hohenshell, L., & Prain, V. (2004). Exploring students’ responses to conceptual questions when engaged with planned writing experiences: A study with year 10 science students. Journal of Research in Science Teaching, 41(2), 186–210.
Hughes, S., Hastings, P., Magliano, J., Goldman, S., & Lawless, K. (2012). Automated approaches for detecting integration in student essays. In Automated approaches for detecting integration in student essays. Springer-Verlag.
Johnson, R.E. (1970). Recall of prose as a function of the structural importance of the linguistic units. Journal of Verbal Learning and Verbal Behavior, 9(1), 12–20.
Kellogg, R.T. (2008). Training writing skills: a cognitive development perspective. Journal of Writing Research, 1(1), 1–26.
Kharkwal, G., & Muresan, S. (2014). Surprisal as a predictor of essay quality. In Proceedings of the Ninth Workshop on Innovative Use of NLP for Building Educational Applications, pages 54–60.
Kintsch, W., & van Dijk, T.A. (1978). Toward a model of text comprehension and production. Psychological Review, 85, 364–394.
Klein, P.D., & Rose, M.A. (2010). Teaching argument and explanation to prepare junior students for writing to learn. Reading Research Quarterly, 45(4), 433–461.
Krippendorff, K. (1980). Content Analysis: An Introduction to Its Methodology. Beverly Hills: Sage Publications.
Kuhn, H.W. (1955). The Hungarian Method for the assignment problem. Naval Research Logistics Quarterly, 2, 83–97.
Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37(4), 389–405.
Lin, C.-Y. (2004). ROUGE: a package for automatic evaluation of summaries. In Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004).
Liu, O.L., Brew, C., Blackmore, J., Gerard, L.F., Madhok, J., & Linn, M.C. (2014). Automated scoring of constructed-response science items: Propsects and obstacles. Educational Measurement: Issues and Practice, 33(2), 19–28.
Louis, A., & Nenkova, A. (2009). Automatically evaluating content selection in summarization without human models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pages 306–314, Singapore. Association for Computational Linguistics.
Louis, A., & Nenkova, A. (2013). Automatically assessing machine summary content without a gold standard. Computational Linguistics, 39(2), 267–300.
Ma, X., & Hovy, E.H. (2016). End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). To Appear.
Magliano, J.P., Trabasso, T., & Graesser, A.C. (1999). Strategic processing during comprehension. Journal of Educational Psychology, 91, 615–629.
Mazzeo, C., Rab, S.Y., & Alssid, J.L. (2003). Building bridges to college and careers: Contextualized basic skills programs at community colleges. Technical report. Brooklyn: Workforce Strategy Center.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality.
NCES (2012). The nation’s report card: Writing 2011.
Nenkova, A., & Passonneau, R.J. (2004). Evaluating content selection in summarization: The pyramid method. In Susan Dumais, D.M., & Roukos, S. (Eds.) HLT-NAACL 2004: Main Proceedings, pages 145–152, Boston, Massachusetts, USA. Association for Computational Linguistics.
Nenkova, A., Passonneau, R.J., & McKeown, K. (2007). The pyramid method: incorporating human content selection variation in summarization evaluation. ACM Transactions on Speech and Language Processing, 4(2).
Norris, S.P., & Phillips, L.M. (2003). How literacy in its fundamental sense is central to scientific literacy. Science Education, 87(2), 224–240.
Olinghouse, N.G., Graham, S., & Gillespie, A. (2015). The relationship of discourse and topic knowledge to fifth graders’ writing performance. Journal of Educational Psychology, 107(2), 391–406.
Olinghouse, N.G., & Wilson, J. (2013). The relationship between vocabulary and writing quality in three genres. Reading and Writing, 26(1), 45–65.
Owczarzak, K., Conroy, J.M., Dang, H.T., & Nenkova, A. (2012). An assessment of the accuracy of automatic evaluation in summarization. In Proceedings of Workshop on Evaluation Metrics and System Comparison for Automatic Summarization, pages 1–9. Association for Computational Linguistics.
Page, E.B. (1966). The imminence of grading essays by computer. Phi Delta Kappan, 47(5), 238–243.
Page, E.B. (1968). The use of the computer in analyzing student essays. International Review of Education, 14, 210–225.
Page, E.B. (1994). Computer grading of student prose, using modern concepts and software. The Journal of experimental education, 62(2), 127–142.
Passonneau, R., & Carpenter, B. (2014). The benefits of a model of annotation. Transactions of the Association for Computational Linguistics, 2, 311–326.
Passonneau, R.J. (2006). Measuring agreement on set-valued items (masi) for semantic and pragmatic annotation. In Proceedings of the International Conference on Language Resources and Evaluation (LREC), pages 831–836, Genoa, Italy.
Passonneau, R.J. (2010). Formal and functional assessment of the pyramid method for summary content evaluation. Natural Language Engineering, 16, 107–131.
Passonneau, R.J., Baker, C.F., Fellbaum, C., & Ide, N. (2012). The masc word sense corpus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC ’12). European Language Resources Association (ELRA).
Passonneau, R.J., Chen, E., Guo, W., & Perin, D. (2013). Automated pyramid scoring of summaries using distributional semantics. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 143–147, Sofia, Bulgaria. Association for Computational Linguistics.
Passonneau, R.J., Goodkind, A., & Levy, E. (2007). Annotation of children’s oral narrations: Modeling emergent narrative skills for computational applications. In Proceedings of the Twentieth Annual Meeting of the Florida Artificial Intelligence Research Society (FLAIRS-20), pages 253–258.
Passonneau, R.J., McKeown, K., & Sigelman, S. (2006). Applying the pyramid method in the 2006 Document Understanding Conference. In Proceedings of the 2006 Workshop of the Document Understanding Conference (DUC).
Passonneau, R.J., Nenkova, A., McKeown, K., & Sigelman, S. (2005). Applying the pyramid method in DUC 2005. In Proceedings of the 2005 Workshop of the Document Understanding Conference (DUC).
Perin, D., Bork, R.H., Peverly, S.T., & Mason, L.H. (2013). A contextualized curricular supplement for developmental reading and writing. Journal of College Reading and Learning, 43(2), 8–38.
Persky, H.R., Daane, M.C., & Jin, Y. (2003). The nation’s report card: Writing 2002. Technical Report NCES 2003-529. Washington: National Center for Education Statistics (NCES).
Pilehvar, M.T., Jurgens, D., & Navigli, R. (2013). Align, disambiguate and walk: A unified approach for measuring semantic similarity. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1341–1351, Sofia, Bulgaria. Association for Computational Linguistics.
Proske, A., Narciss, S., & McNamara, D.S. (2012). Computer-based scaffolding to facilitate students’ development of expertise in academic writing. Journal of Research in Reading, 35(2), 136–152.
Qazvinian, V., & Radev, D.R. (2012). A computational analysis of collective discourse. In Proceedings of the 2012 Conference on Collective Intelligence, Cambridge MA.
Reiser, B.J., & Kenyon, L.K.B.L. (2012). Engaging students in the scientific practices of explanation and argumentation. Science and Children, 49(8), 8–13.
Roscoe, R.D., Allen, L.K., Weston, J.L., Crossley, S.A., & McNamara, D.S. (2015a). The Writing Pal intelligent tutoring system: Usability testing and development. Computers and Composition, 34, 39–59.
Roscoe, R.D., & McNamara, D.S. (2013). Writing Pal: Feasibility of an intelligent writing strategy tutor in the high school classroom. Journal of Educational Psychology, 105(4), 1–16.
Roscoe, R.D., Snow, E.L., Allen, L.K., & McNamara, D.S. (2015b). Automated detection of essay revising patterns: applications for intelligent feedback in a writing tutor. Technology, Instruction, Cognition, and Learning, 10(1), 59–79.
Rosé, C., & VanLehn, K. (2005). An evaluation of a hybrid language understanding approach for robust selection of tutoring goals. International Journal of Artificial Intelligence in Education, 15(4), 325.
Rudner, L.M., Garcia, V., & Welch, C. (2006). An evaluation of Intellimetric TM essay scoring system. The Journal of Technology Learning and Assessment, 4(4).
Saggion, H., Torres-Moreno, J.-M., da Cunha, I., SanJuan, E., & Velázquez-Morales, P. (2010). Multilingual summarization evaluation without human models. In Proceedings of Coling 2010, pages 1059–1067.
Sakai, S., Togasaki, M., & Yamazaki, K. (2003). A note on greedy algorithms for the maximum weighted independent set problem. Discrete Applied Mathematics, 126(2-3), 313–322.
Salahu-Din, D., Persky, H.R., & Miller, J. (2008). The nation’s report card: Writing 2007. Technical Report NCES 2008-468. Washington: National Center for Education Statistics (NCES).
Sampson, V., Enderle, P., Grooms, J., & Witte, S. (2013). Writing to learn by learning to write during the school science laboratory: Helping middle and high school students develop argumentative writing skills as they learn core ideas. Science Education, 97(5), 643–670.
Shermis, M., & Burstein, J. (2013). Handbook of Automated Essay Evaluation: Current Applications and Future Directions. New York: New York: Routledge.
Slotta, J.D., & Linn, M.C. (2009). WISE Science. New York: Teachers College Press.
Surowiecki, J. (2004). The Wisdom of Crowds. New York: Doubleday.
Teufel, S., & van Halteren, H. (2004). Evaluating information content by factoid analysis: Human annotation and stability. In Lin, D., & Wu, D. (Eds.) Proceedings of EMNLP 2004, pages 419–426, Barcelona, Spain. Association for Computational Linguistics.
Turner, A.A. (1987). The propositional analysis system. Technical Report 87-2, University of Colorado. Boulder: Department of Psychology and Institute of Cognitive Science.
Turner, A.A., & Greene, E. (1978). The construction and use of a propositional analysis system. Technical Report JSAS Catalog of Selected Documents in Psychology, no. 1713. Washington, DC: American Psychological Association.
van Dijk, T.A., & Kintsch, W. (1977). Cognitive psychology and discourse: Recalling and summarizing stories. In Dressier, W.U. (Ed.) Trends in text-linguistics, pages 61–80. De Gruyter, New York.
van Dijk, T.A., & Kintsch, W. (1983). Strategies of Discourse Comprehension. New York: Academic Press.
VanLehn, K., Jordan, P., & Rosé, C.P. (2002). The architecture of Why2-Atlas: a coach for qualitative physics essay writing. In Cerri, S.A., Gouarderes, G., & Paraguacu, F. (Eds.) Intelligent Tutoring Systems, 2002: 6th International Conference, pages 158–167, Berlin. Springer.
Westby, C., Culatta, B., Lawrence, B., & Hall-Kenyon, K. (2010). Summarizing expository texts. Topics in Language Disorders, 30(4), 275–287.
Wiemer-Hastings, P., Wiemer-Hastings, K., & Graesser, A.C. (1999). Improving an intelligent tutor’s comprehension of students with latent semantic analysis. In Lajoie, S.P., & Vivet, M. (Eds.) Artificial Intelligence in Education, pages 535–542. IOS Press, Amsterdam.
Wiley, J., & Voss, J.F. (1996). The effects of playing historian on learning in history. Applied Cognitive Psychology, 10, 63–72.
Yang, Q., Passonneau, R.J., & de Melo, G. (2016). PEAK: Pyramid evaluation via automated knowledge extraction. In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI 2016). AAAI Press.
Yore, L.D., Hand, B.M., & Florence, M.K. (2004). Scientists’ views of science, models of writing, and science writing practices. Journal of Research in Science Teaching, 41(4), 338–369.
Zipf, G.K. (1949). Human Behaviour and the Principle of Least Effort. Cambridge: Addison-Wesley.
This paper is an extended version of an oral presentation made at an NSF-funded workshop held May 7-8, 2015 entitled MARWiSE: Multidisciplinary Advances in Reading and Writing for Science Education (Award IIS-1455533). The authors thank members of the workshop for their constructive feedback. We also thank Weiwei Guo for input regarding his Weighted Matrix Factorization method, and his suggestions for related work. Finally, we thank three anonymous reviewers for their constructive criticism.
About this article
Cite this article
Passonneau, R.J., Poddar, A., Gite, G. et al. Wise Crowd Content Assessment and Educational Rubrics. Int J Artif Intell Educ 28, 29–55 (2018). https://doi.org/10.1007/s40593-016-0128-6
- Automated content analysis
- Writing intervention
- Wise-crowd content assessment
- Writing rubrics