Abstract
Machine learning algorithms that automatically score scientific explanations can be used to measure students’ conceptual understanding, identify gaps in their reasoning, and provide them with timely and individualized feedback. This paper presents the results of a study that uses Hebrew NLP to automatically score student explanations in Biology according to fine-grained analytic grading rubrics that were developed for formative assessment. The experimental results show that our algorithms achieve a high-level of agreement with human experts, on par with previous work on automated assessment of scientific explanations in English, and that \(\sim \)500 examples are typically enough to build reliable scoring models. The main contribution is twofold. First, we present a conceptual framework for constructing analytic grading rubrics for scientific explanations, which are composed of dichotomous categories that generalize across items. These categories are designed to support automated guidance, but can also be used to provide a composite score. Second, we apply this approach in a new context – Hebrew, which belongs to a group of languages known as Morphologically-Rich. In languages of this group, among them also Arabic and Turkish, each input token may consist of multiple lexical and functional units, making them particularly challenging for NLP. This is the first study on automatic assessment of scientific explanations (and more generally, of open-ended questions) in Hebrew, and among the firsts to do so in Morphologically-Rich Languages.
Similar content being viewed by others
References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., ..., Zheng, X. (2015). TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
Alexandron, G., Ruipérez-Valiente, J.A., Chen, Z., Muñoz-Merino, P. J., & Pritchard, D.E. (2017). Copying@ scale: Using harvesting accounts for collecting correct answers in a mooc. Computers & Education, 108, 96–114.
Alexandron, G., Wiltrout, M.E., Berg, A., & Ruipérez-Valiente, J.A. (2020). Assessment that matters: Balancing reliability and learner-centered pedagogy in mooc assessment. In Proceedings of the tenth international conference on learning analytics & knowledge (pp. 512–517).
Alexandron, G., Yoo, L.Y., Ruipérez-Valiente, J.A., Lee, S., & Pritchard, D.E. (2019). Are mooc learning analytics results trustworthy? with fake learners, they might not be!. International Journal of Artificial Intelligence in Education, 29(4), 484–506.
Allen, L.K., Jacovina, M.E., & McNamara, D.S. (2016). Computer-based writing instruction. In C. A. MacArthur, S. Graham, & J. Fitzgerald (Eds.) Handbook of writing research, chapter 21. 2nd edn. (pp. 316–329). Guilford Press.
Ariely, M., Nazaretsky, T., & Alexandron, G. (2020). First steps towards nlp-based formative feedback to improve scientific writing in hebrew. In A.N. Rafferty, J. Whitehill, C. Romero, & V. Cavalli-Sforza (Eds.) Proceedings of the 13th international conference on educational data mining (EDM 2020) (pp. 565–568).
Baker, R., Walonoski, J., Heffernan, N., Roll, I., Corbett, A., & Koedinger, K. (2008). Why students engage in “gaming the system” behavior in interactive learning environments. Journal of Interactive Learning Research, 19(2), 185–224.
Berland, L.K., & Reiser, B.J. (2009). Making sense of argumentation and explanation. Science Education, 93(1), 26–55.
Black, P., & Wiliam, D. (1998). Inside the black box: Raising standards through classroom assessment. The Phi Delta Kappan, 80(2), 139–148.
Bridgeman, B., Trapani, C., & Attali, Y. (2012). Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and country. Applied Measurement in Education, 25(1), 27–40.
Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and trends of automatic short answer grading. International Journal of Artificial Intelligence in Education, 25(1), 60–117.
Chollet, F., et al. (2015). Keras. https://github.com/fchollet/keras.
Çınar, A., Ince, E., Gezer, M., & Yılmaz, Ö. (2020). Machine learning algorithm for grading open-ended physics questions in turkish. Education and Information Technologies, 1–24.
Cohen, Y., & Ben-Simon, A. (2011). The hebrew language project: Automated essay scoring & readability analysis. In IAEA annual conference, Vienna, Austria.
Cohen, Y., Levi, E., & Ben-Simon, A. (2018). Validating human and automated scoring of essays against “true” scores. Applied Measurement in Education, 31(3), 241–250.
Ding, Y., Riordan, B., Horbach, A., Cahill, A., & Zesch, T. (2020). Don’t take “nswvtnvakgxpm” for an answer–the surprising vulnerability of automatic content scoring systems to adversarial input. In Proceedings of the 28th International Conference on Computational Linguistics (pp. 882–892).
Dzikovska, M.O., Nielsen, R.D., Brew, C., Leacock, C., Giampiccolo, D., Bentivogli, L., Clark, P., Dagan, I., & Dang, H.T. (2013). Semeval-2013 task 7: The joint student response analysis and 8th recognizing textual entailment challenge. Technical report, NORTH TEXAS STATE UNIV DENTON.
Filighera, A., Steuer, T., & Rensing, C. (2020). Fooling automatic short answer grading systems. In International Conference on Artificial Intelligence in Education (pp. 177–190). Springer.
Flor, M., & Cahill, A. (2020). Automated scoring of open-ended written responses – possibilities and challenges. Berlin: Springer Science.
Gerard, L.F., & Linn, M.C. (2016). Using automated scores of student essays to support teacher guidance in classroom inquiry. Journal of Science Teacher Education, 27(1), 111–129.
Goldberg, Y. (2016). A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research, 57, 345–420.
Gomaa, W.H., & Fahmy, A.A. (2014). Automatic scoring for answers to arabic test questions. Computer Speech & Language, 28(4), 833–857.
Graesser, A.C., McNamara, D.S., Cai, Z., Conley, M., Li, H., & Pennebaker, J. (2014). Coh-metrix measures text characteristics at multiple levels of language and discourse. The Elementary School Journal, 115(2), 210–229.
Greer, J., & Mark, M. (2016). Evaluation methods for intelligent tutoring systems revisited. International Journal of Artificial Intelligence in Education, 26(1), 387–392.
Ha, M., Nehm, R.H., Urban-Lurain, M., & Merrill, J.E. (2011). Applying computerized-scoring models of written biological explanations across courses and colleges: prospects and limitations. CBE—Life Sciences Education, 10(4), 379–393.
Hattie, J., & Timperley, H. (2007). The power of feedback. Review of Educational Research, 77(1), 81–112.
Heilman, M., & Madnani, N. (2015). The impact of training data on automated short answer scoring performance. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 81–85).
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735–1780.
Horbach, A., Palmer, A., & Pinkal, M. (2013). Using the text to evaluate short answers for reading comprehension exercises. In Second joint conference on lexical and computational semantics (* SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity (pp. 286–295).
Israeli Ministry of Education, a. (2011). Syllabus of Biological Studies (10th-12th grade). State of Israel Ministry of Education Curriculum Center, Jerusalem, Israel.
Jacobs, K., Itai, A., & Wintner, S. (2020). Acronyms: identification, expansion and disambiguation. Annals of Mathematics and Artificial Intelligence, 88(5), 517–532.
Jacovi, A., Sar Shalom, O., & Goldberg, Y. (2018). Understanding convolutional neural networks for text classification. In Proceedings of the 2018 EMNLP workshop BlackboxNLP: analyzing and interpreting neural networks for NLP (pp. 56–65). Brussels: Association for Computational Linguistics.
Jescovitch, L.N., Doherty, J.H., Scott, E.E., Cerchiara, J.A., Wenderoth, M.P., Urban-Lurain, M., Merrill, J., & Haudek, K.C. (2019a). Challenges in developing computerized scoring models for principle-based reasoning in a physiology context.
Jescovitch, L.N., Scott, E.E., Cerchiara, J.A., Doherty, J.H., Wenderoth, M.P., Merrill, J.E., Urban-Lurain, M., & Haudek, K.C. (2019b). Deconstruction of holistic rubrics into analytic rubrics for large-scale assessments of students’ reasoning of complex science concepts. Practical Assessment, Research, and Evaluation, 24(1), 7.
Jescovitch, L.N., Scott, E.E., Cerchiara, J.A., Merrill, J., Urban-Lurain, M., Doherty, J.H., & Haudek, K.C. (2021). Comparison of machine learning performance using analytic and holistic coding approaches across constructed response assessments aligned to a science learning progression. Journal of Science Education and Technology, 30(2), 150–167.
Johnson, R., & Zhang, T. (2014). Effective use of word order for text categorization with convolutional neural networks. arXiv:1412.1058.
Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: reliability, validity and educational consequences. Educational Research Review, 2 (2), 130–144.
Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. arXiv:1404.2188.
Kampourakis, K., & Neibert, K. (2018). Explanation in biology education. In K. Kampourakis M.J. Reiss (Eds.) Teaching biology in schools: Global research, issues and trends, chapter 19 (pp. 236–248). New York and Abingdon: Routledge.
Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1746–1751). Doha: Association for Computational Linguistics.
Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.
Klebanov, B.B., Burstein, J., Harackiewicz, J.M., Priniski, S.J., & Mulholland, M. (2017). Reflective writing about the utility value of science as a tool for increasing stem motivation and retention–can ai help scale up? International Journal of Artificial Intelligence in Education, 27(4), 791–818.
Klebanov, B.B., & Madnani, N. (2020). Automated evaluation of writing–50 years and counting. In Proceedings of the 58th annual meeting of the association for computational linguistics (pp. 7796–7810).
Landis, J.R., & Koch, G.G. (1977). The measurement of observer agreement for categorical data. biometrics. pp. 159–174.
Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer questions. Computers and the Humanities, 37(4), 389–405.
Li, H., Gobert, J., & Dickler, R. (2017), Automated assessment for scientific explanations in on-line science inquiry. International Educational Data Mining Society.
Litman, D.J. (2016). Natural language processing for enhancing teaching and learning. In AAAI (pp. 4170–4176).
Liu, O.L., Brew, C., Blackmore, J., Gerard, L., Madhok, J., & Linn, M.C. (2014). Automated scoring of constructed-response science items: Prospects and obstacles. Educational Measurement: Issues and Practice, 33(2), 19–28.
Liu, O.L., Rios, J.A., Heilman, M., Gerard, L., & Linn, M.C. (2016). Validation of automated scoring of science assessments. Journal of Research in Science Teaching, 53(2), 215–233.
Madnani, N., Loukina, A., & Cahill, A. (2017a). A large scale quantitative exploration of modeling strategies for content scoring. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications (pp. 457–467).
Madnani, N., Loukina, A., Von Davier, A., Burstein, J., & Cahill, A. (2017b). Building better open-source tools to support fairness in automated scoring. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing (pp. 41–52).
Maestrales, S., Zhai, X., Touitou, I., Baker, Q., Schneider, B., & Krajcik, J. (2021). Using machine learning to score multi-dimensional assessments of chemistry and physics. Journal of Science Education and Technology, 30(2), 239–254.
Matthews, K., Janicki, T., He, L., & Patterson, L. (2012). Implementation of an automated grading system with an adaptive learning component to affect student feedback and response time. Journal of Information Systems Education, 23(1), 71–84.
Mayfield, E., & Rosé, C. (2010). An interactive tool for supporting error analysis for text mining. In Proceedings of the NAACL HLT 2010 Demonstration Session (pp. 25–28).
Mayfield, E., & Rosé, C.P. (2013). Lightside: Open source machine learning for text. In Handbook of automated essay evaluation: Current applications and new directions (pp. 146–157). Routledge.
McNamara, D.S., Crossley, S.A., & Roscoe, R. (2013). Natural language processing in an intelligent writing strategy tutoring system. Behavior Research Methods, 45(2), 499–515.
McNeill, K.L., & Krajcik, J. (2008). Scientific explanations: Characterizing and evaluating the effects of teachers’ instructional practices on student learning. Journal of Research in Science Teaching, 45(1), 53–78.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781.
Moharreri, K., Ha, M., & Nehm, R.H. (2014). Evograder: an online formative assessment tool for automatically evaluating written evolutionary explanations. Evolution: Education and Outreach, 7(1), 1–14.
National Research Council (NRC) (2012). A framework for K-12 science education: Practices, crosscutting concepts, and core ideas. Cambridge: The National Academies Press.
Nehm, R.H., Ha, M., & Mayfield, E. (2012). Transforming biology assessment with machine learning: Automated scoring of written evolutionary explanations. Journal of Science Education and Technology, 21(1), 183–196.
Nehm, R.H., & Haertig, H. (2012). Human vs. computer diagnosis of students’ natural selection knowledge: testing the efficacy of text analytic software. Journal of Science Education and Technology, 21(1), 56–73.
Osborne, J.F., & Patterson, A. (2011). Scientific argument and explanation: a necessary distinction? Science Education, 95(4), 627–638.
Padó, U. (2016). Get semantic with me! the usefulness of different feature types for short-answer grading. In Proceedings of COLING 2016, the 26th international conference on computational linguistics: Technical Papers (pp. 2186–2195).
Pado, U., & Kiefer, C. (2015). Short answer grading: When sorting helps and when it doesn’t. In Proceedings of the fourth workshop on NLP for computer-assisted language learning (pp. 42–50).
Rahimi, Z., Litman, D., Correnti, R., Wang, E., & Matsumura, L.C. (2017). Assessing students’ use of evidence and organization in response-to-text writing: Using natural language processing for rubric-based automated scoring. International Journal of Artificial Intelligence in Education, 27 (4), 694–728.
Rehurek, R., & Sojka, P. (2010). Software Framework for topic modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Citeseer.
Roschelle, J., Dimitriadis, Y., & Hoppe, U. (2013). Classroom orchestration: synthesis. Computers & Education, 69, 523–526.
Roscoe, R.D., Varner, L.K., Crossley, S.A., & McNamara, D.S. (2013). Developing pedagogically-guided algorithms for intelligent writing feedback. International Journal of Learning Technology 25, 8(4), 362–381.
Ross, L.N. (2020). Causal concepts in biology: How pathways differ from mechanisms and why it matters. The British Journal for the Philosophy of Science.
Ryoo, K., & Linn, M.C. (2014). Designing guidance for interpreting dynamic visualizations: Generating versus reading explanations. Journal of Research in Science Teaching, 51(2), 147–174.
Sakaguchi, K., Heilman, M., & Madnani, N. (2015). Effective feature integration for automated short answer scoring. In Proceedings of the 2015 conference of the North American Chapter of the association for computational linguistics: Human language technologies (pp. 1049–1054).
Seddah, D., Tsarfaty, R., Kübler, S., Candito, M., Choi, J., Farkas, R., Foster, J., Goenaga, I., Gojenola, K., Goldberg, Y., & et al. (2013). Overview of the spmrl 2013 shared task: cross-framework evaluation of parsing morphologically rich languages. Association for Computational Linguistics.
Segal, A., Hindi, S., Prusak, N., Swidan, O., Livni, A., Palatnic, A., Schwarz, B., & et al. (2017). Keeping the teacher in the loop: Technologies for monitoring group learning in real-time. In International Conference on Artificial Intelligence in Education (pp. 64–76). Springer.
Sheinfux, L.H., Greshler, T.A., Melnik, N., & Wintner, S. (2015). Hebrew Verbal multi-word expressions. In Proceedings of the 22nd International Conference on Head-Driven Phrase Structure Grammar, Nanyang Technological University, NTU, Singapore (pp. 122–135).
Songer, N.B., & Gotwals, A.W. (2012). Guiding explanation construction by children at the entry points of learning progressions. Journal of Research in Science Teaching, 49(2), 141–165.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.
Strobl, C., Ailhaud, E., Benetos, K., Devitt, A., Kruse, O., Proske, A., & Rapp, C. (2019). Digital support for academic writing: a review of technologies and pedagogies. Computers & Education, 131, 33–48.
Sung, C., Dhamecha, T.I., & Mukhi, N. (2019). Improving short answer grading using transformer-based pre-training. In International Conference on Artificial Intelligence in Education (pp. 469–481). Springer.
Taghipour, K., & Ng, H.T. (2016). A neural approach to automated essay scoring. In Proceedings of the 2016 conference on empirical methods in natural language processing (pp. 1882–1891).
Tang, K. -S. (2016). Constructing scientific explanations through premise–reasoning–outcome (PRO): an exploratory study to scaffold students in structuring written explanations. International Journal of Science Education, 38(9), 1415–1440.
Tansomboon, C., Gerard, L.F., Vitale, J.M., & Linn, M.C. (2017). Designing automated guidance to promote productive revision of science explanations. International Journal of Artificial Intelligence in Education, 27(4), 729–757.
Taras, M. (2005). Assessment – summative and formative – some theoretical reflections. British Journal of Educational Studies, 53(4), 466–478.
Tompson, J., Goroshin, R., Jain, A., LeCun, Y., & Bregler, C. (2015). Efficient object localization using convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 648–656).
Tsarfaty, R., Bareket, D., Klein, S., & Seker, A. (2020). From spmrl to nmrl: What did we learn (and unlearn) in a decade of parsing morphologically-rich languages (mrls)? arXiv:2005.01330.
Tsarfaty, R., Sadde, S., Klein, S., & Seker, A. (2019). What’s Wrong with hebrew nlp? and how to make it right. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP); system demonstrations (pp. 259–264).
Tsarfaty, R., Seddah, D., Kübler, S., & Nivre, J. (2013). Parsing morphologically rich languages: Introduction to the special issue. Computational Linguistics, 39(1), 15–22.
Wang, C., Liu, X., Wang, L., Sun, Y., & Zhang, H. (2021). Automated scoring of chinese grades 7–9 students’ competence in interpreting and arguing from evidence. Journal of Science Education and Technology, 30(2), 269–282.
Weston, M., Parker, J., & Urban-Lurain, M. (2013). Comparing formative feedback reports: Human and automated text analysis of constructed response questions in biology. In NARST annual conference, Rio Grande, Puerto Rico.
Williamson, D.M., Xi, X., & Breyer, F.J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13.
Wilson, J., Roscoe, R., & Ahmed, Y. (2017). Automated formative writing assessment using a levels of language framework. Assessing Writing, 34, 16–36.
Woods, B., Adamson, D., Miel, S., & Mayfield, E. (2017). Formative essay feedback using predictive scoring models. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 2071–2080).
Yao, L., Cahill, A., & McCaffrey, D.F. (2020). The impact of training data quality on automated content scoring performance.
Yune, S.J., Lee, S.Y., Im, S.J., Kam, B.S., & Baek, S.Y. (2018). Holistic rubric vs. analytic rubric for measuring clinical performance levels in medical students. BMC Medical Education, 18(1), 1–6.
Zesch, T., Heilman, M., & Cahill, A. (2015). Reducing annotation efforts in supervised short answer scoring. In Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 124–132).
Zhai, X. (2021). Practices and theories: How can machine learning assist in innovative assessment practices in science education. Journal of Science Education and Technology, 30(2), 139–149.
Zhai, X., Yin, Y., Pellegrino, J.W., Haudek, K.C., & Shi, L. (2020). Applying machine learning in science assessment: a systematic review. Studies in Science Education, 56(1), 111–151.
Zhang, Y., & Wallace, B. (2015). A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. arXiv:1510.03820.
Zhu, M., Lee, H.-S., Wang, T., Liu, O.L., Belur, V., & Pallant, A. (2017). Investigating the impact of automated feedback on students’ scientific argumentation. International Journal of Science Education, 39(12), 1648–1668.
Zhu, M., Liu, O.L., & Lee, H. -S. (2020). The effect of automated feedback on revision behavior and learning gains in formative assessment of scientific argument writing. Computers & Education, 143, 103668.
Zhu, M., Liu, O.L., Mao, L., & Pallant, A. (2016). Use of automated scoring and feedback in online interactive earth science tasks. In 2016 IEEE Integrated STEM Education Conference (ISEC) (pp. 224–230). IEEE.
Acknowledgements
The authors thank Cipy Hofman for her contribution. The research of GA and MA was supported by the Willner Family Leadership Institute for the Weizmann Institute of Science and the Iancovici-Fallmann Memorial Fund, established by Ruth and Henry Yancovich. TN is grateful to the Azrieli Foundation for the award of an Azrieli Fellowship.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Moriah Ariely and Tanya Nazaretsky contributed equally to the paper.
Appendix
Appendix
Rights and permissions
About this article
Cite this article
Ariely, M., Nazaretsky, T. & Alexandron, G. Machine Learning and Hebrew NLP for Automated Assessment of Open-Ended Questions in Biology. Int J Artif Intell Educ 33, 1–34 (2023). https://doi.org/10.1007/s40593-021-00283-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40593-021-00283-x