Journal of Science Education and Technology

, Volume 21, Issue 1, pp 56–73 | Cite as

Human vs. Computer Diagnosis of Students’ Natural Selection Knowledge: Testing the Efficacy of Text Analytic Software

  • Ross H. NehmEmail author
  • Hendrik Haertig


Our study examines the efficacy of Computer Assisted Scoring (CAS) of open-response text relative to expert human scoring within the complex domain of evolutionary biology. Specifically, we explored whether CAS can diagnose the explanatory elements (or Key Concepts) that comprise undergraduate students’ explanatory models of natural selection with equal fidelity as expert human scorers in a sample of >1,000 essays. We used SPSS Text Analysis 3.0 to perform our CAS and measure Kappa values (inter-rater reliability) of KC detection (i.e., computer–human rating correspondence). Our first analysis indicated that the text analysis functions (or extraction rules) developed and deployed in SPSSTA to extract individual Key Concepts (KCs) from three different items differing in several surface features (e.g., taxon, trait, type of evolutionary change) produced “substantial” (Kappa 0.61–0.80) or “almost perfect” (0.81–1.00) agreement. The second analysis explored the measurement of human–computer correspondence for KC diversity (the number of different accurate knowledge elements) in the combined sample of all 827 essays. Here we found outstanding correspondence; extraction rules generated using one prompt type are broadly applicable to other evolutionary scenarios (e.g., bacterial resistance, cheetah running speed, etc.). This result is encouraging, as it suggests that the development of new item sets may not necessitate the development of new text analysis rules. Overall, our findings suggest that CAS tools such as SPSS Text Analysis may compensate for some of the intrinsic limitations of currently used multiple-choice Concept Inventories designed to measure student knowledge of natural selection.


Computer scoring Text analysis Assessment Knowledge measurement Concept inventories Evolution Natural selection 



We thank Judy Ridgway for hand scoring some of the responses; Minsu Ha for help with the figures; the editors and anonymous reviewers for very helpful feedback on the manuscript, and the National Science Foundation REESE program (0909999) for funding parts of this study.


  1. Alberts B (2010) Reframing science standards. Science 329(5991):491CrossRefGoogle Scholar
  2. Altman DG (1991) Practical statistics for medical research. Chapman and Hall, LondonGoogle Scholar
  3. American Educational Research Association, American Psychological Association, National Council of Measurement in Education (1999) Standards for educational and psychological testing. AERA, Washington, D.CGoogle Scholar
  4. Anderson DL, Fisher KM, Norman GJ (2002) Development and evaluation of the conceptual inventory of natural science. J Res Sci Teach 39:952–978CrossRefGoogle Scholar
  5. Bejar II (1991) A methodology for scoring open-ended architectural design problems. J Appl Psychol 76(4):522–532CrossRefGoogle Scholar
  6. Bishop B, Anderson C (1990) Student conceptions of natural selection and its role in evolution. J Res Sci Teach 27:415–427CrossRefGoogle Scholar
  7. Braun HI, Bennett RE, Frye D, Soloway E (1990) Scoring constructed responses using expert system. J Educ Meas 27:93–108CrossRefGoogle Scholar
  8. Bridgeman B (1992) Conscious vs. unconscious processes. Theor Psychol 2(1):73–88CrossRefGoogle Scholar
  9. Brumby MN (1984) Misconceptions about the concept of natural selection by medical biology students. Sci Educ 68(4):493–503CrossRefGoogle Scholar
  10. Burstein J (2003) The e-rater scoring engine: automated essay scoring with natural language processing. In: Shermis MD, Burstein J (eds) Automated essay scoring: a cross-disciplinary perspective. Lawrence Erlbaum Associates, Inc, Mahwah, NJ, pp 113–122Google Scholar
  11. Caldwell JE (2007) Clickers in the large classroom: current research and best-practice tips. Life Sci Educ 6(1):9–20CrossRefGoogle Scholar
  12. Chi MTH, Feltovich PJ, Glaser R (1981) Categorization and representation of physics problems by experts and novices. Cognit Sci 5:121–152CrossRefGoogle Scholar
  13. Chodorow M, Burstein J (2004) Beyond essay length: evaluating e-rater’s performance on TOEFL essays (TOEFL Research Rep. No. RR-73). ETS, Princeton, NJGoogle Scholar
  14. Chung GKWK, Baker EL (2003) Issues in the reliability and validity of automated scoring of constructed responses. In: Shermis MD, Burstein J (eds) Automated essay scoring: a cross-disciplinary perspective. Erlbaum, Mahwah, NJ, pp 23–40Google Scholar
  15. Clauser BE, Harik P, Clyman SG (2000) The generalizability of scores for a performance assessment scored with a computer automated scoring system. J Educ Meas 37(3):245–261CrossRefGoogle Scholar
  16. Clough EE, Wood-Robinson C (1985) How secondary students interpret instances of biological adaptation. J Biol Educ 19(2):125–130CrossRefGoogle Scholar
  17. D’Avanzo, C., Morris, D., Anderson, A., Griffith, A. Williams, K., & Stamp, N. (2008). Diagnostic question clusters to improve student reasoning and understanding in general biology courses: Faculty Development Component. Proceedings of the CABS II conference. Available online at:
  18. Dagher ZR, BouJaoude S (1997) Scientific views and religious beliefs of college students: the case of biological evolution. J Res Sci Teach 34(5):429–445CrossRefGoogle Scholar
  19. Demastes SS, Good RG, Peebles P (1995) Students’ conceptual ecologies and the process of conceptual change in evolution. Sci Educ 79(6):637–666CrossRefGoogle Scholar
  20. Donnelly LA, Boone WJ (2007) Biology teachers’ attitudes toward and use of Indiana’s evolution standards. J Res Sci Teach 44(2):236–257CrossRefGoogle Scholar
  21. Endler JA (1992) Natural selection: current usages. In: Keller EF, Lloyd EA (eds) Keywords in evolutionary biology. Harvard, Cambridge, MA, pp 220–224Google Scholar
  22. Field AP (2009) Discovering statistics using SPSS. SAGE Publications, LondonGoogle Scholar
  23. Fleiss JL, Levin B, Paik MC (2003) Statistical methods for rates and proportions, 3rd edn. John Wiley & Sons, Inc., HobokenCrossRefGoogle Scholar
  24. Galt K (2008) SPSS text analysis for surveys 2.1 and qualitative and mixed methods analysis. J Mixed Methods Res 2(3):284–286CrossRefGoogle Scholar
  25. Gould SJ (2002) The structure of evolutionary theory. Harvard University Press, CambridgeGoogle Scholar
  26. Grose EC, Simpson RD (1982) Attitudes of introductory college biology students toward evolution. J Res Sci Teach 19(1):15–23CrossRefGoogle Scholar
  27. Ha M, Cha H (2009) Pre-service teachers’ synthetic view on Darwinism and Lamarckism. Paper presented at the National Association for Research in Science Teaching conference, Anaheim, CAGoogle Scholar
  28. Keith TZ (2003) Validity and automated essay scoring systems. In: Shermis MD, Burstein J (eds) Automated essay scoring: A cross-disciplinary perspective. Lawrence Erlbaum Associates, Inc, Mahwah, NJ, pp 147–168Google Scholar
  29. Kingston NM (2009) Comparability of computer- and paper-administered multiple-choice tests for K-12 populations: a synthesis. Appl Meas Educ 22(1):22–37CrossRefGoogle Scholar
  30. Kirsh D (2009) Problem solving and situated cognition. In: Philip Robbins, Aydede M (eds) The Cambridge handbook of situated cognition. Cambridge University Press, Cambridge, MA, pp 264–306Google Scholar
  31. Koedinger KR, Anderson JR, Hadley WH, Mark MA (1997) Intelligent tutoring goes to school in the big city. Int J Artif Intell Educ 8:30–43Google Scholar
  32. Krippendorff K (1980) Content analysis: an introduction to its methodology, 1st edn. Sage Publications, Thousand Oaks, LondonGoogle Scholar
  33. Krippendorff K (2004) Content analysis: an introduction to its methodology, 2nd edn. Sage Publications, Thousand Oaks, LondonGoogle Scholar
  34. Kuechler WL, Simkin MG (2004) How well do multiple choice tests evaluate student understanding in computer programming classes. J Infor Sys Educ 14:389–400Google Scholar
  35. Landauer TK, Laham D, Foltz PW (2000) The intelligent essay assessor. IEEE Intell Syst 15(5):27–31Google Scholar
  36. Landauer TK, Laham D, Foltz PW (2001) The intelligent essay assessor: putting knowledge to the test. Paper presented at the association of test publishers computer-based testing: emerging technologies and opportunities for diverse applications conference, Tucson, AZGoogle Scholar
  37. Landauer TK, Laham D, Foltz PW (2003) Automated scoring and annotation of essays with the Intelligent Essay Assessor. In: Shermis MD, Burstein J (eds) Automated essay scoring: a cross-disciplinary perspective. Lawrence Erlbaum Associates, Inc, Mahwah, NJ, pp 87–112Google Scholar
  38. Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33:159–174CrossRefGoogle Scholar
  39. Lewontin R (1978) Adaptation. Sci Am 239:212–228CrossRefGoogle Scholar
  40. Lewontin R (2010) Not so natural selection. New York Review of Books, New YorkGoogle Scholar
  41. Liu X (2010) Using and developing measurement instruments in science education: A Rasch modeling approach. Information Age Publishing, Charlotte, N.CGoogle Scholar
  42. Mislevy RJ, Steinberg LS, Almond RG (2002) Design and analysis in task-based language assessment. Language Test 19(4):477–496CrossRefGoogle Scholar
  43. Morgan R, Maneckshana B (1996) The psychometric perspective: meeting four decades of challenge. In Lessons learned from 40 years of constructed response testing in the advanced placement program. Symposium conducted at the NCME ConferenceGoogle Scholar
  44. National Research Council (2001) Knowing what students know: the science and design of educational assessment. National Academy Press, Washington, DCGoogle Scholar
  45. National Research Council (2007) Taking science to school: learning and teaching science in grades K-8. National Academy Press, Washington, DCGoogle Scholar
  46. Nehm RH (2006) Faith-based evolution education? BioScience 56(8):638–639CrossRefGoogle Scholar
  47. Nehm RH, Ha M (2011) Item feature effects in evolution assessment. J Res Sci Teach. doi: 10.1002/tea.20400
  48. Nehm RH, Reilly L (2007) Biology majors’ knowledge and misconceptions of natural selection. BioScience 57(3):263–272CrossRefGoogle Scholar
  49. Nehm RH, Schonfeld I (2007) Does increasing biology teacher knowledge about evolution and the nature of science lead to greater advocacy for teaching evolution in schools? J Sci Teac Educ 18(5):699–723CrossRefGoogle Scholar
  50. Nehm RH, Schonfeld IS (2008) Measuring knowledge of natural selection: a comparison of the CINS, an open-response instrument, and an oral interview. J Res Sci Teac 45(10):1131–1160CrossRefGoogle Scholar
  51. Nehm RH, Schonfeld IS (2010) The future of natural selection knowledge measurement: a reply to Anderson et al. J Res Sci Teach 47(3):358–362Google Scholar
  52. Nehm RH, Kim SY, Sheppard K (2009) Academic preparation in biology and advocacy for teaching evolution: biology versus non-biology teachers. Sci Educ 93(6):1122–1146CrossRefGoogle Scholar
  53. Nehm RH, Rector M, Ha M (2010a) ‘‘Force talk’’ in evolutionary explanation: metaphors and misconceptions. Evol Educ Outreach 3:605–613CrossRefGoogle Scholar
  54. Nehm RH, Ha M, Rector M, Opfer J, Perrin L, Ridgway J, Mollohan K (2010) Scoring guide for the open response instrument (ORI) and evolutionary gain and loss test (EGALT). Technical report of National Science Foundation REESE Project 0909999. Accessed online January 10, 2011 at:
  55. Newport F (2004) Third of Americans say evidence has supported Darwin’s evolution theory. The Gallup Organization, Princeton, NJGoogle Scholar
  56. Page EB (1966) The imminence of grading essays by computers. Phi Delta Kappan 47:238–243Google Scholar
  57. Page EB (2003) Project essay grade: PEG. In: Shermis MD, Burstein J (eds) Automated essay scoring: a cross-disciplinary perspective. Lawrence Erlbaum Associates, Mahwah, NJ, pp 43–54Google Scholar
  58. Patterson C (1978) Evolution. Cornell University Press, IthacaGoogle Scholar
  59. Pigliucci M, Kaplan J (2006) Making sense of evolution: the conceptual foundations of evolutionary biology. University of Chicago Press, ChicagoGoogle Scholar
  60. Powers DE, Burstein JC, Chodorow MS, Fowles ME, Kukich K (2002a) Comparing the validity of automated and human scoring of essays. J Educ Computing Res 26(4):407–425CrossRefGoogle Scholar
  61. Powers DE, Burstein JC, Chodorow M, Fowles ME, Kukich K (2002b) Stumping e-rater: challenging the validity of automated essay scoring. Comput Hum Behav 18(2):103–134CrossRefGoogle Scholar
  62. Resnick LB, Resnick DP (1992) Assessing the thinking curriculum: new tools for educational reform. In: Gilford BR, O’Conner MC (eds) Changing assessments: alternative views of aptitude achievement and instruction. Kluwer, Boston, pp 37–75Google Scholar
  63. Shermis MD, Burstein J (2003) Automated essay scoring: a cross-disciplinary perspective. Lawrence Erlbaum Associates, Inc, Mahwah, NJGoogle Scholar
  64. Sinatra GM, Southerland SA, McConaughy F, Demastes JW (2003) Intentions and beliefs in students’ understanding and acceptance of biological evolution. J Res Sci Teach 40(5):510–528CrossRefGoogle Scholar
  65. Spitznagel EL, Helzer JE (1985) A proposed solution to the base rate problem in the kappa statistic. Arch Gen Psychiatry 42:725–728CrossRefGoogle Scholar
  66. SPSS Inc (2006) SPSS text analysis for surveys™ 2.0 user’s guide. SPSS inc, Chicago, ILGoogle Scholar
  67. Sukkarieh J, Bolge E (2008). Leveraging c-rater’s automated scoring capability for providing instructional feedback for short constructed responses. In: Woolf BP, Aimeur E, Nkambou R, Lajoie S (eds) Lecture notes in computer science. Proceedings of the 9th international conference on intelligent tutoring systems, ITS 2008, Montreal, Canada, June 23–27, 2008, vol 5091. Springer-Verlag, New York, pp 779–783Google Scholar
  68. Traub RE, MacRury K (1990) Multiple-choice vs. free response in the testing of scholastic achievement. Test und Tends 8:128–159Google Scholar
  69. Wang HC, Chang CY, Li TY (2005) Automated scoring for creative problem solving ability with ideation-explanation modeling. Paper presented at the 13th International conference on computers in education, SingaporeGoogle Scholar
  70. Williamson DM, Bejar II, Hone AS (1999) ‘Mental model’ comparison of automated and human scoring. J Educ Meas 36:158–184CrossRefGoogle Scholar
  71. Witten IH, Frank E (2005) Data mining, 2nd edn. Elsevier, AmsterdamGoogle Scholar
  72. Wood WB (2004) Clickers: a teaching gimmick that works. Dev Cell 7(6):796–798Google Scholar
  73. Yang Y, Buckendahl CW, Juszkiewicz PJ, Bhola DS (2002) A review of strategies for validating computer automated scoring. App Meas Educ 15(4):391–412CrossRefGoogle Scholar
  74. Zimmerman M (1987) The evolution-creation controversy: opinions of Ohio high school biology teachers. Ohio J Sci 87(4):115–125Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  1. 1.School of Teaching and LearningColumbusUSA
  2. 2.Institute of Physics EducationUniversity of Duisburg-EssenEssenGermany

Personalised recommendations