Skip to main content

Using Machine Learning to Score Multi-Dimensional Assessments of Chemistry and Physics

Abstract

In response to the call for promoting three-dimensional science learning (NRC, 2012), researchers argue for developing assessment items that go beyond rote memorization tasks to ones that require deeper understanding and the use of reasoning that can improve science literacy. Such assessment items are usually performance-based constructed responses and need technology involvement to ease the burden of scoring placed on teachers. This study responds to this call by examining the use and accuracy of a machine learning text analysis protocol as an alternative to human scoring of constructed response items. The items we employed represent multiple dimensions of science learning as articulated in the 2012 NRC report. Using a sample of over 26,000 constructed responses taken by 6700 students in chemistry and physics, we trained human raters and compiled a robust training set to develop machine algorithmic models and cross-validate the machine scores. Results show that human raters yielded good (Cohen’s k = .40–.75) to excellent (Cohen’s k > .75) interrater reliability on the assessment items with varied numbers of dimensions. A comparison reveals that the machine scoring algorithms achieved comparable scoring accuracy to human raters on these same items. Results also show that responses with formal vocabulary (e.g., velocity) were likely to yield lower machine-human agreements, which may be associated with the fact that fewer students employed formal phrases compared with the informal alternatives.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2

References

  • AACR. (2020). September 4, 2020, Retrieved from https://apps.beyondmultiplechoice.org

  • Balfour, S. P. (2013). Assessing writing in MOOCs: Automated Essay Scoring and Calibrated Peer ReviewTM. Research & Practice in Assessment, 8, 40–48.

    Google Scholar 

  • Cheuk, T., Osborne, J., Cunningham, K., Haudek, K., Santiago, M., Urban-Lurain, M., Merril, J., Wilson,C., Stuhlsatz, M.,Donovan, B., Bracey, Z., & Gardner, A. (2019). Towards an Equitable Design Framework of Developing Argumentation in Science tasks and Rubrics for Machine Learning. Presented at the Annual meeting of the National Association for Research in Science Teaching (NARST). Baltimore, MD.

  • Fleiss, J.L. (1981). Statistical methods for rates and proportions (2nd ed.). New York: John Wiley. ISBN 978–0–471–26370–8.

  • Geiger, R. S., Yu, K., Yang, Y., Dai, M., Qiu, J., Tang, R., & Huang, J. (2020, January). Garbage in, garbage out? do machine learning application papers in social computing report where human-labeled training data comes from?. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (pp. 325–336).

  • Ha, M., & Nehm, R. H. (2016). The impact of misspelled words on automated computer scoring: a case study of scientific explanations. Journal of Science Education and Technology, 25(3), 358–374.

    Article  Google Scholar 

  • Harris, C. J., Krajcik, J. S., Pellegrino, J. W., & DeBarger, A. H. (2019). Designing knowledge-in-use assessments to promote deeper learning. Educational Measurement: Issues and Practice, 38(2), 53-67. https://doi.org/10.1111/emip.12253.

  • Haudek, K., Santiago, M., Wilson, C., Stuhlsatz, M.,Donovan, B., Bracey, Z., Gardner, A., Osborne, J., & Cheuk, T. (2019). Using Automated Analysis to Assess Middle School Students’ Competence with Scientific Argumentation, presented at the Annual Meeting of the National Council on Measurement in Education (NCME). Toronto, ON.

  • Large, J., Lines, J., & Bagnall, A. (2019). A probabilistic classifier ensemble weighting scheme based on cross-validated accuracy estimates. Data mining and knowledge discovery, 33(6), 1674–1709.

    Article  Google Scholar 

  • Lee, H. S., McNamara, D., Bracey, Z. B., Liu, O. L., Gerard, L., Sherin, B., Wilson, C., Pallant, A., Linn, M., Haudek, K., & Osborne, J. (2019a). Computerized text analysis: Assessment and research potentials for promoting learning.

  • Lee, H. S., Pallant, A., Pryputniewicz, S., Lord, T., Mulholland, M., & Liu, O. L. (2019b). Automated text scoring and real-time adjustable feedback: Supporting revision of scientific arguments involving uncertainty. Science Education, 103(3), 590–622.

    Article  Google Scholar 

  • Liu, O. L., Brew, C., Blackmore, J., & Gerard, L. (2014). Automated scoring of constructed response science items: Prospects and obstacles. Educational Measurement-Issues and Practices, 33(2), 19–28. https://doi.org/10.1111/emip.12028.

    Article  Google Scholar 

  • Lottridge, S., Wood, S., & Shaw, D. (2018). The effectiveness of machine score-ability ratings in predicting automated scoring performance. Applied Measurement in Education, 31(3), 215–232.

    Article  Google Scholar 

  • Mao, L., Liu, O. L., Roohr, K., Belur, V., Mulholland, M., Lee, H.-S., & Pallant, A. (2018). Validation of automated scoring for a formative assessment that employs scientific argumentation. Educational Assessment, 23(2), 121–138.

    Article  Google Scholar 

  • Mayfield, E., & Rosé, C. (2010, June). An interactive tool for supporting error analysis for text mining. In Proceedings of the NAACL HLT 2010 Demonstration Session (pp. 25–28).

  • Mayfield, E., & Rosé, C. P. (2013). Open source machine learning for text. Handbook of automated essay evaluation: Current applications and new directions.

    Google Scholar 

  • National Academies of Sciences, Engineering, and Medicine. (2019). Science and engineering for grades 6–12: Investigation and design at the center. National Academies Press.

  • National Research Council. (2012). A framework for K-12 science education: Practices, crosscutting concepts, and core ideas. National Academies Press.

  • National Research Council. (2014). Developing assessments for the next generation science standards. National Academies Press.

  • Nehm, R. H., & Haertig, H. (2012). Human vs. computer diagnosis of students’ natural selection knowledge: testing the efficacy of text analytic software. Journal of Science Education and Technology, 21(1), 56–73.

  • NGSS Lead States. (2013). Next generation science standards: For states, by states. Washington, DC: The National Academies Press.

    Google Scholar 

  • Pellegrino, J. W. (2013). Proficiency in science: Assessment challenges and opportunities. Science, 340(6130), 320–323.

    Article  Google Scholar 

  • Zhai, X., Haudek, K., Shi, L., Nehm, R., Urban-Lurain, M. (2020a). From substitution to redefinition: A framework of machine learning-based science assessment. Journal of Research in Science Teaching, 57(9), 1430-1459. https://doi.org/10.1002/tea.21658.

  • Zhai, X., Haudek, K., Stuhlsatz, M., Wilson, C. (2020b). Evaluation of construct-irrelevant variance yielded by machine and human scoring of a science teacher PCK constructed response assessment. Studies in Educational Evaluation, 67, 1-12. https://doi.org/10.1016/j.stueduc.2020.100916.

  • Zhai, X., Yin, Y., Pellegrino, J., Haudek, K., Shi., L. (2020c). Applying machine learning in science assessment: A systematic review. Studies in Science Education. 56(1), 111-151.

  • Zhu, M., Lee, H.-S., Wang, T., Liu, O. L., Belur, V., & Pallant, A. (2017). Investigating the impact of automated feedback on students’ scientific argumentation. International Journal of Science Education, 39(12), 1648–1668.

Download references

Acknowledgments

We are grateful to the team members: Dan Oleynik, Jacob Crosby, Taryn Stefanski, Cullen Hudson, Eleonora Baker, and Rachel Marias-Dezendorf.

Funding

This material is based upon work supported by the National Science Foundation under Grant No. OISE: 1545684.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sarah Maestrales.

Ethics declarations

Conflict of Interest

The authors of this paper have no conflict of interest in the publication of this paper.

Informed Consent

was obtained from all individual participants involved in the study.

Ethical Approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1

Item 1: Experimental Design Text and Rubric
Question text
Meg designs an experiment to see which of three types of sneakers provides the most friction. She uses the equipment listed below. 1. Sneaker 1 2. Sneaker 2 3. Sneaker 3 4. Spring scale. She uses the setup illustrated below and pulls the spring scale to the left.  Meg tests one type of sneaker on a gym floor, a second type of sneaker on a grass field, and a third type of sneaker on a cement sidewalk. Her teacher is not satisfied with the way Meg designed her experiment.
A. Describe one error in Meg's experiment.
Alignment to the NGSS (2013) Performance Expectations
Dimension Grade-band Performance expectation
DCI 6–8 ETS1.A Defining and Delimiting Engineering Problems
CC (N/A) (N/A)
SEP 3–5 Planning and Carrying Out Investigations
Student example and multi-dimensional rubric
Multi-dimensional correct Correct Incorrect
"Meg's error is that she is testing three experiments in separate and different settings, allowing the experiments to have different outcomes. This stops her from knowing if her other shoes work on a gym floor or grass field or a cement sidewalk." "Meg should have tested the sneakers in the same location for each test." "Meg should've used different types of sneakers, not the same."
DCI: Student correctly identifies the error in the experimental setup. DCI: Student correctly identifies an error in the experimental setup. Provides an incorrect response or irrelevant error in the experimental setup.
& &
SEP: Student explains this is a failure to control for variables or that the results cannot be compared. No SEP: Student does not explain that it controls for relevant variables.

Appendix 2

Item 2: Relative Motion Text and Rubric
Question text
Suppose you are riding in a car along the highway at 55 miles per hour when a truck pulls up along the side of your car. This truck seems to stand still for a moment, and then it seems to be moving backward.
A. Tell how the truck can look as if it is standing still when it is really moving forward.
Alignment to the NGSS (2013) Performance Expectations
Dimension Grade-band Performance expectation
DCI 6–8 PS2.A Forces and Motion
CC 6–8 Scale and Proportion
SEP (N/A) (N/A)
Student example and multi-dimensional rubric
Multi-dimensional correct Correct Incorrect
“The truck looks as if it is standing still as both your car and the truck are moving at 55 mph in the same direction.” "It is going 55 miles per hour, which is as fast as the car is going." “the truck looks like it is still because it is losing speed."
DCI: Student relates the truck's speed to the speed of the observer. DCI: Student relates the truck's speed to the speed of the observer. Student provides an incorrect/irrelevant explanation for the phenomena OR only restates the question.
& &
CC: Student states that equal relative speeds would cause the truck to appear as though it is standing still. No CC: Student does not discuss the visual phenomenon being caused by the relative speeds.

Appendix 3

Item 3: Properties of Solutions Text and Rubrics
Question text
Maria has one glass of pure water and one glass of salt water, which look exactly alike. Explain what Maria could do, without tasting the water, to find out which glass contains the salt water.
Alignment to the NGSS (2013) Performance Expectations
Dimension Grade-band Performance expectation
DCI 3–5, 6–8 PS1.A Structure and Properties of Matter
CC 6–8 Cause and effect
SEP 6–8 Planning and Carrying Out Investigations
Student example and multi-dimensional rubric
Multi-dimensional correct Correct Incorrect
"Maria could use two similar cups and weigh them both and the heavier one is saltwater." "Maria can weigh the cups that hold the water." "Your body floats easier in salt water."
SEP: Student response describes an experiment that controls for relevant variables. SEP: Student response describes an experiment that controls for relevant variables. Student response does not describe an experiment that will differentiate fresh water from salt water.
DCI: The experiment isolates a measurement that will differentiate fresh water from salt water. DCI: The experiment isolates a measurement that will differentiate fresh water from salt water.
CC: Student indicates the expected result that will allow them to differentiate the fresh water and salt water. No CC: Student does not indicate the expected result that will allow them to differentiate the fresh water and salt water.

Appendix 4

Item 4: States of Matter Text and Rubrics
Question text
Anita puts the same amount of water in two pots of the same size and type. She places one pot of water on the counter and one pot of water on a hot stove.
After ten minutes, Anita observes that there is less water in the pot on the hot stove than in the pot on the counter, as shown below.
A. Why is there less water in the pot on the hot stove?
B. Where did the water go?
 
Alignment to the NGSS (2013) Performance Expectations
Dimension Grade-band Performance expectation
DCI 6–8 PS1.A Structure and Properties of Matter
CC 6–8 Energy and Matter
SEP (N/A) (N/A)
Student example and multi-dimensional rubric
Multi-dimensional correct Correct Incorrect
"The heat caused it to evaporate." "The water evaporated." "it dried up."
DCI: Student says the water evaporated. DCI: Student says the water evaporated. Provides an incorrect/irrelevant explanation.
& OR  
CC: Attributes this to the heat from the stove. CC: Attributes this to the heat from the stove.  

Appendix 5

Machine Errors and Certainty of Score for Student Responses to Item 2: Relative Motion
Student response Predictive model Human score Machine score Machine probability
“You share the same velocity and thus from your relative position moving alongside you, it doesn't appear to move." 1st MDC Incorrect 0.65
2nd MDC Correct 0.66
3rd MDC Correct 0.62
"The speed of the truck in relation to your car is the same, not changing the distance between the two and creating the illusion that there is no movement of the vehicles." 1st N/A N/A N/A
2nd MDC Incorrect 0.80
3rd MDC Incorrect 0.83
"You and the truck are moving at very similar speeds, creating the illusion that it is still." 1st MDC Incorrect 0.86
2nd MDC Incorrect 0.77
3rd MDC Incorrect 0.82
"its going 55" 1st Correct Incorrect 0.94
2nd Correct Incorrect 0.93
3rd Correct Incorrect 0.95
"The truck can seem to be looking as if it were standing still when the car is moving at a slower velocity than 55 miles per hour." 1st N/A N/A N/A
2nd Incorrect MDC 0.90
3rd Incorrect MDC 0.90
"If the truck slows down and you are still going at the same speed it would appear that it stopped" 1st Incorrect MDC 0.72
2nd Incorrect MDC 0.85
3rd Incorrect MDC 0.86

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Maestrales, S., Zhai, X., Touitou, I. et al. Using Machine Learning to Score Multi-Dimensional Assessments of Chemistry and Physics. J Sci Educ Technol 30, 239–254 (2021). https://doi.org/10.1007/s10956-020-09895-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10956-020-09895-9

Keywords

  • Three-dimensional science learning
  • Machine learning
  • Automatic scoring