Different Approaches to Assessing the Quality of Explanations Following a Multiple-Document Inquiry Activity in Science

  • Jennifer Wiley
  • Peter Hastings
  • Dylan Blaum
  • Allison J. Jaeger
  • Simon Hughes
  • Patricia Wallace
  • Thomas D. Griffin
  • M. Anne Britt


This article describes several approaches to assessing student understanding using written explanations that students generate as part of a multiple-document inquiry activity on a scientific topic (global warming). The current work attempts to capture the causal structure of student explanations as a way to detect the quality of the students’ mental models and understanding of the topic by combining approaches from Cognitive Science and Artificial Intelligence, and applying them to Education. First, several attributes of the explanations are explored by hand coding and leveraging existing technologies (LSA and Coh-Metrix). Then, we describe an approach for inferring the quality of the explanations using a novel, two-phase machine-learning approach for detecting causal relations and the causal chains that are present within student essays. The results demonstrate the benefits of using a machine-learning approach for detecting content, but also highlight the promise of hybrid methods that combine ML, LSA and Coh-Metrix approaches for detecting student understanding. Opportunities to use automated approaches as part of Intelligent Tutoring Systems that provide feedback toward improving student explanations and understanding are discussed.


Automatic assessment Mental models Explanations Causal structure Causal relations Machine learning Natural language processing 



Portions of this work were supported by the Institute of Education Sciences (R305B070460, R305F100007) and the National Science Foundation (1535299). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of these organizations. Allison J. Jaeger is now at the Spatial Intelligence Learning Center (SILC), Temple University. The authors thank Karyn Higgs, Kristopher Kopp, Mike Mensink, Tegan Michl, Carlos Salas, Brent Steffens, and Andrew Taylor for their contributions on this project.


  1. Bejan, C.A., & Hathaway, C. (2007). UTD-SRL: A pipeline architecture for extracting frame semantic structures. Proceedings of the 4th International Workshop on Semantic Evaluations (pp. 460–463). Prague, Czech Republic: Association for Computational Linguistics.Google Scholar
  2. Bennington, B.J. (2009). The carbon cycle and climate change. Retrieved from
  3. Braaten, M., & Windschitl, M. (2011). Working toward a stronger conceptualization of scientific explanation for science education. Science Education, 95, 639–669.CrossRefGoogle Scholar
  4. Bråten, I., Strømsø, H. I., & Britt, M. A. (2009). Trust matters: examining the role of source evaluation in students’ construction of meaning within and across multiple texts. Reading Research Quarterly, 44, 6–28.CrossRefGoogle Scholar
  5. Britt, M. A., & Aglinskas, C. (2002). Improving student’s ability to use source information. Cognition and Instruction, 20, 485–522.CrossRefGoogle Scholar
  6. Britt, M. A., Wiemer-Hasting, P., Larson, A., & Perfetti, C. A. (2004). Automated feedback on source citation in essay writing. International Journal of Artificial Intelligence in Education, 14, 359–374.Google Scholar
  7. Chklovski, T., & Pantel, P. (2004). VerbOcean: Mining the web for fine-grained semantic verb relations. Proceedings of the Conference of Empirical Methods in Natural Language Process (pp. 33–40). Barcelona, Spain.Google Scholar
  8. Condon, W. (2013). Large-scale assessment, locally-developed measures, and automated scoring of essays: fishing for red herrings? Assessing Writing, 18(1), 100–108.CrossRefGoogle Scholar
  9. Crossley, S.A., Kyle, K., & McNamara, D.S. (2015). The tool for the automatic analysis of text cohesion (TAACO): automatic assessment of local, global, and text cohesion. Behavior Research Methods, 1–11.Google Scholar
  10. Crossley, S.A., & McNamara, D.S. (2010). Cohesion, coherence, and expert evaluations of writing proficiency. Proceedings of the 32nd annual conference of the Cognitive Science Society (pp. 984–989).Google Scholar
  11. Crossley, S.A., & McNamara, D.S. (2011). Text coherence and judgments of essay quality: Models of quality and coherence. Proceedings of the 29th Annual Conference of the Cognitive Science Society (pp. 1236–1241).Google Scholar
  12. Dascalu, M., Stavarache, L.L., Dessus, P., Trausan-Matu, S., McNamara, D.S., & Bianco, M. (2015). Predicting comprehension from students’ summaries. In International Conference on Artificial Intelligence in Education, Madrid (pp. 95–104). Springer International Publishing.Google Scholar
  13. Deane, P. (2013). On the relation between automated essay scoring and modern views of the writing construct. Assessing Writing, 18(1), 7–24.CrossRefGoogle Scholar
  14. Dikli, S. (2006). Automated essay scoring. Turkish Online Journal of Distance Education, 7(1), 49–62.Google Scholar
  15. Farra, N., Somasundaran, S., & Burstein, J. (2015). Scoring persuasive essays using opinions and their targets. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 64–74).Google Scholar
  16. Ferris, D. (2007). Preparing teachers to respond to student writing. Journal of Second Language Writing, 16, 165–193.Google Scholar
  17. Flower, L., & Hayes, J. R. (1981). A cognitive process theory of writing. College Composition and Communication, 32, 365–387.Google Scholar
  18. Foltz, P. W., Britt, M. A., & Perfetti, C. A. (1996). Reasoning from multiple texts: An automatic analysis of readers’ situation models. In G. W. Cottrell (Ed.), Proceedings of the 18th Annual Cognitive Science Conference (pp. 110–115). Mahwah, NJ: Lawrence Erlbaum Associates.Google Scholar
  19. Graesser, A. C., Wiemer-Hastings, P., Wiemer-Hastings, K., Harter, D., Person, N., & the Tutoring Research Group. (2000). Using latent semantic analysis to evaluate the contributions of students in AutoTutor. Interactive Learning Environments, 8, 129–147.CrossRefGoogle Scholar
  20. Graesser, A. C., & McNamara, D. S. (2012). Automated analysis of essays and open-ended verbal responses. In H. Cooper & A. T. Panter (Eds.), APA handbook of research methods in psychology (pp. 307–325). Washington: American Psychological Association.Google Scholar
  21. Graesser, A. C., McNamara, D. S., & VanLehn, K. (2005). Scaffolding deep comprehension strategies through AutoTutor, and iSTART. Educational Psychologist, 40, 225–234.CrossRefGoogle Scholar
  22. Graham, S., & Perin, D. (2007). A meta-analysis of writing instruction for adolescent students. Journal of Educational Psychology, 99, 445–476.Google Scholar
  23. Greene, S. (1994). Students as authors in the study of history. In G. Leinhardt, I. Beck, & C. Stainton (Eds.), Teaching and learning in history (pp. 137–170). Hillsdale, NJ: Erlbaum.Google Scholar
  24. Griffin, T. D., Wiley, J., Britt, M. A., & Salas, C. (2012). The role of CLEAR thinking in learning science from multiple-document inquiry tasks. International Electronic Journal of Elementary Education, 5, 63–78.Google Scholar
  25. Hastings, P., Hughes, S., Magliano, J. P., Goldman, S. R., & Lawless, K. (2012). Assessing the use of multiple sources in student essays. Behavior Research Methods, 44, 622–633.CrossRefGoogle Scholar
  26. Hastings, P., Hughes, S., Britt, A., Blaum, D., & Wallace, P. (2014). In S. Trausan-Matu & K. Boyer (Eds.), Proceedings of Intelligent Tutoring Systems 2014, Honolulu, HI (pp. 266–271). Berlin: Springer.Google Scholar
  27. Hastings, P. Hughes, S., Blaum, D., Wallace, P., & Britt, M.A. (2016). Stratified learning for reducing training set size. In A. Micarelli, J. Stamper, and K. Panourgia (Eds.), Proceedings of Intelligent Tutoring Systems 2016. Paper presented at the 13th International Conference, Zagreb (pp. 341–346). Berlin: Springer.Google Scholar
  28. Hemmerich, J., & Wiley, J. (2002). Do argumentation tasks promote conceptual change about volcanoes? In W. D. Gray & C. D. Schunn (Eds.), Proceedings of the Twenty-Fourth Annual Conference of the Cognitive Science Society (pp. 453–458). Hillsdale, NJ: Erlbaum.Google Scholar
  29. Hughes, S., Hastings, P., Britt, M. A., Wallace, P., & Blaum, D. (2015). Machine learning for holistic evaluation of scientific essays. In C. Conati, N. Heffernan, A. Mitrovic, & M. F. Verdejo (Eds.), Proceedings of Artificial Intelligence in Education 2015 (pp. 165–175). Berlin: Springer.Google Scholar
  30. Huot, B. (1996). Toward a new theory of writing assessment. College Composition and Communication, 47, 549–566.Google Scholar
  31. Jaeger, A. J., & Wiley, J. (2015). Reading an analogy can cause the illusion of comprehension. Discourse Processes, 52, 376–405.CrossRefGoogle Scholar
  32. Kopp, K., Rupp, K., Blaum, D., Wallace, P., Hastings, P., & Britt, M.A. (2016, November). Assessing the influence of feedback during a multiple document writing task in science. Poster presented at the Annual Meeting of the Psychonomic Society, Boston, MA.Google Scholar
  33. Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to latent semantic analysis. Discourse Processes, 25, 259–284.CrossRefGoogle Scholar
  34. Larkey, L.S., & Croft, W.B. (2003). A text categorization approach to automated essay grading. Automated essay scoring: A cross-discipline perspective. Mahwah, NJ: Lawrence Erlbaum.Google Scholar
  35. Lintean, M., Rus, V., & Azevedo, R. (2011). Automatic detection of student mental models based on natural language student input during metacognitive skill training. International Journal of Artificial Intelligence in Education, 21, 169–190.Google Scholar
  36. Magliano, J. P., & Graesser, A. C. (2012). Computer-based assessment of student-constructed responses. Behavior Research Methods, 44, 608–621.CrossRefGoogle Scholar
  37. McNamara, D. S., Boonthum, C., Levinstein, I. B., & Millis, K. (2007a). Evaluating self-explanations in iSTART: Comparing word-based and LSA algorithms. In T. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of latent semantic analysis (pp. 227–241). Mahwah, NJ: Erlbaum.Google Scholar
  38. McNamara, D. S., Crossley, S. A., Roscoe, R. D., Allen, L. K., & Dai, J. (2015). A hierarchical classification approach to automated essay scoring. Assessing Writing, 23, 35–59.CrossRefGoogle Scholar
  39. McNamara, D. S., Graesser, A. C., McCarthy, P., & Cai, Z. (2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge: Cambridge University Press.CrossRefGoogle Scholar
  40. McNamara, D. S., O’Reilly, T., Rowe, M., Boonthum, C., & Levinstein, I. B. (2007b). iSTART: A web-based tutor that teaches self-explanation and metacognitive reading strategies. In D. S. McNamara (Ed.), Reading comprehension strategies: Theories, interventions, and technologies (pp. 397–421). Mahwah, NJ: Erlbaum.Google Scholar
  41. Meyer, B. J. F. (1985). Prose analysis: Purposes, procedures, and problems. In B. K. Britton & J. Black (Eds.), Analyzing and understanding expository text (pp. 11–64). Hillsdale, NJ: Erlbaum.Google Scholar
  42. Mihalcea, R., & Csomai, A. (2005). Senselearner: Word sense disambiguation for all words in unrestricted text. Proceedings of the ACL 2005 on Interactive Poster and Demonstration Sessions (pp. 53–56). Ann Arbor, Michigan, USA: Association for Computational Linguistics.Google Scholar
  43. Miller, G. (1995). WordNet: a lexical database for English. Communications of the ACM, 38, 39–41.CrossRefGoogle Scholar
  44. Page, E. B. (1994). Computer grading of student prose, using modern concepts and software. Journal of Experimental Education, 62, 127–142.CrossRefGoogle Scholar
  45. Pennebaker, J. W. (1993). Putting stress into words: health, linguistic, and therapeutic implications. Behaviour Research and Therapy, 31, 539–548.CrossRefGoogle Scholar
  46. Perfetti, C. A., Britt, M. A., & Georgi, M. C. (1995). Text-based learning and reasoning: Studies in history. Hillsdale, NJ: Lawrence Erlbaum Associates.Google Scholar
  47. Rink, B., Bejan, C.A., & Harabagiu, S. (2010). Learning textual graph patterns to detect causal event relations. In Proceedings of the Twenty-Third International Florida Artificial Intelligence Research Society Conference (FLAIRS), Daytona Beach, FL, USA (pp. 265–270). Applied Natural Language Processing Track. AAAI Press..Google Scholar
  48. Roscoe, R., Crossley, S. A., Snow, E. L., Varner, L. K., & McNamara, D. (2014). Writing quality, knowledge, and comprehension correlates of human and automated essay scoring. In Proceedings of the 27th International Florida Artificial Intelligence Research Society Conference, FLAIRS 2014, Pensacola, FL (pp. 393–398). AAAI Press.Google Scholar
  49. Rouet, J.-F., Britt, M. A., Mason, R. A., & Perfetti, C. A. (1996). Using multiple sources of evidence to reason about history. Journal of Educational Psychology, 88, 478–493.CrossRefGoogle Scholar
  50. Rouet, J.-F., Favart, M., Britt, M. A., & Perfetti, C. A. (1997). Studying and using multiple documents in history: effects of discipline expertise. Cognition and Instruction, 15, 85–106.CrossRefGoogle Scholar
  51. Royer, J. M., Carlo, M. S., Dufresne, R., & Mestre, J. (1996). The assessment of levels of domain expertise while reading. Cognition and Instruction, 14, 373–408.CrossRefGoogle Scholar
  52. Sanchez, C., & Wiley, J. (2006). An examination of the seductive details effect in terms of working memory capacity. Memory & Cognition, 34, 344–355.CrossRefGoogle Scholar
  53. Sanchez, C., & Wiley, J. (2009). To scroll or not to scroll: interactions of text presentation and working memory capacity. Human Factors, 51, 730–738.CrossRefGoogle Scholar
  54. Sanchez, C. A., & Wiley, J. (2010). Sex differences in science learning: closing the gap through animations. Learning and Individual Differences, 20, 271–275.CrossRefGoogle Scholar
  55. Sanchez, C. A., & Wiley, J. (2014). The role of dynamic spatial ability in geoscience text comprehension. Learning and Instruction, 31, 33–45.CrossRefGoogle Scholar
  56. Scardamalia, M., & Bereiter, C. (1987). Knowledge telling and knowledge transforming in written composition. Advances in Applied Psycholinguistics, 2, 142–175.Google Scholar
  57. Shermis, M. D., & Hamner, B. (2012). Contrasting state-of-the-art automated scoring of essays: Analysis. In Annual national council on measurement in education meeting (pp. 14–16).Google Scholar
  58. Sommers, N. (2008). The call of research: A longitudinal view of writing development. College Composition and Communication, 60, 152–164.Google Scholar
  59. Spivey, N. N. (1990). Transforming texts constructive processes in reading and writing. Written Communication, 7, 256–287.CrossRefGoogle Scholar
  60. VanLehn, K., Graesser, A. C., Jackson, G. T., Jordan, P., Olney, A., & Rose, C. P. (2007). When are tutorial dialogues more effective than reading? Cognitive Science, 31, 3–62.Google Scholar
  61. Ventura, M. J., Franchescetti, D. R., Pennumatsa, P., Graesser, A. C., Hu, G. J., & Cai, Z. (2004). Combining computational models of short essay grading for conceptual physics problems. In J. C. Lester, R. M. Vicari, & F. Paraguac¸U (Eds.), Proceedings of the Intelligent Tutoring Systems Conference (pp. 423–431). Berlin: Springer.CrossRefGoogle Scholar
  62. Voss, J. F., & Wiley, J. (1997). Developing understanding while writing essays in history. International Journal of Educational Research, 27, 255–265.CrossRefGoogle Scholar
  63. Voss, J. F., & Wiley, J. (2000). A case study of developing historical understanding via instruction: The importance of integrating text components and constructing arguments. In P. Stearns, S. Wineburg, & P. Seixas (Eds.), Knowing, teaching and learning in history (pp. 375–389). New York: NYU Press.Google Scholar
  64. Wade-Stein, D., & Kintsch, E. (2004). Summary street: interactive computer support for writing. Cognition and Instruction, 22, 333–362.CrossRefGoogle Scholar
  65. Wiley, J. (2001). Supporting understanding through task and browser design. In J. D. Moore & K. Stenning (Eds.), Proceedings of the Twenty-third Annual Conference of the Cognitive Science Society (pp. 1136–1143). Hillsdale, NJ: Erlbaum.Google Scholar
  66. Wiley, J., & Voss, J. F. (1996). The effects of “playing” historian on learning in history. Applied Cognitive Psychology, 10, 63–72.CrossRefGoogle Scholar
  67. Wiley, J., & Voss, J. F. (1999). Constructing arguments from multiple sources: tasks that promote understanding and not just memory for text. Journal of Educational Psychology, 91, 301–311.CrossRefGoogle Scholar
  68. Wiley, J., Ash, I. K., Sanchez, C. A., & Jaeger, A. (2011). Clarifying readers’ goals for learning from expository science texts. In M. McCrudden, J. Magliano, & G. Schraw (Eds.), Text relevance and learning from text (pp. 353–374). Greenwich, CT: Information Age Publishing.Google Scholar
  69. Wiley, J., Goldman, S., Graesser, A., Sanchez, C., Ash, I., & Hemmerich, J. (2009). Source evaluation, comprehension, and learning in internet science inquiry tasks. American Educational Research Journal, 46, 1060–1106.CrossRefGoogle Scholar
  70. Wiley, J., Steffens, B., Britt, M.A., & Griffin, T. D. (2014). Writing to learn from multiple-source inquiry activities in history. In G. Rijlaarsdam (Series Ed.) and P. Klein, P. Boscolo, C. Gelati, & L. Kilpatrick (Volume Eds.), Studies in writing, writing as a learning activity (pp. 120–148). Leiden/Boston: Brill.Google Scholar
  71. Zhang, M., & Deane, P. (2015). Process features in writing: Internal structure and incremental value over product features. ETS Research Report Series, 2015(2), 1–12.Google Scholar

Copyright information

© International Artificial Intelligence in Education Society 2017

Authors and Affiliations

  • Jennifer Wiley
    • 1
  • Peter Hastings
    • 2
  • Dylan Blaum
    • 3
  • Allison J. Jaeger
    • 1
  • Simon Hughes
    • 2
  • Patricia Wallace
    • 3
  • Thomas D. Griffin
    • 1
  • M. Anne Britt
    • 3
  1. 1.University of Illinois at ChicagoChicagoUSA
  2. 2.DePaul UniversityChicagoUSA
  3. 3.Northern Illinois UniversityDeKalbUSA

Personalised recommendations