Description of Subject Population and Data Collection
We recruited 637 students from 25 two- and four-year colleges and universities for this study. Students were 18 years of age or older and enrolled in introductory and upper-level biology courses. Henceforward, these students are referred to as “beginner” and “advanced,” respectively (beginner = 170, advanced = 426, unreported = 43). Course instructors volunteered to use the software in their courses; while some instructors had used previous versions of the software, others were newly recruited contacts. In exchange for their help, we provided the software to the instructors and students at no charge. However, instructors received no form of monetary payment. We directed instructors to give the pretest to their students no earlier than one week before using the Darwinian Snails Lab. Students were assigned the lab during the laboratory section of their biology courses and worked alone or with a partner. Within a week of completing the exercise, students were given the post test. A participant profile sheet was attached to each pretest, in which subjects were asked their gender. The majority of students completed the profile sheet (female = 327, male = 248).
The Committee on the Use of Humans as Experimental Subjects, the institutional review board at the Massachusetts Institute of Technology in Cambridge, MA, approved this study before data collection.
The Darwinian Snails Lab in EvoBeaker
Each of the EvoBeaker labs includes a series of interactive simulations with which students design experiments and collect data. Students are provided with a workbook for each lab that directs them through different experiments and asks them to organize and interpret data they collect. We (Herron, Maruca, Meir, Perry, Stal) designed the Darwinian Snails Lab to teach the basic principles of natural selection and to correct the most commonly held misconceptions about natural selection. In this lab, students are presented with a re-creation of a New England rocky shore habitat. The simulated habitat includes populations of the native flat periwinkle snail and their predator, the nonnative European green crab, and is based on the work of Seeley (1986) and, for the final exercise, Trussell (1996). The snails vary in their shell thickness, which affects the efficiency of predation by the European green crab.
Students first read a short section about the simulated system, and then they are introduced to the system by acting as European green crabs feeding on snails. Thicker shells require more effort by the students (i.e., number of mouse clicks needed to feed on the snail). As the students feed on snails, they observe changes in the average shell thickness in the snail population. The population of snails then reproduces, and students are shown how the traits of the remaining snails are inherited by their offspring.
After this initial exercise, students explore three basic requirements for natural selection based on shell thickness: variation in thickness, heritability of thickness, and differential survival of individuals with different shell thicknesses. During this portion of the lab, students sequentially violate each assumption and make predictions about what will then occur when the predator is introduced. Students then quantify changes in the average shell thickness in the snail population.
A third section demonstrates the origin of variation through mutation (but not genetic recombination through sexual reproduction). Students introduce crabs into the system but prevent mutation from occurring in the snail population. They quantify the change in average shell thickness after several generations. Students then allow mutations to occur in the snails, examine individual offspring to see that mutations are random, and compare the average snail shell thickness in the new population after several generations of predation from crabs.
Finally, students design their own experiments to determine if two snail populations differ in average shell thickness due to natural selection. Students can set up common garden experiments and include crabs or crabs with banded claws (crabs unable to feed on snails). Through the experiments, students test their hypotheses about the factors that are driving average shell thickness in the system. Figure 1 shows an example screenshot from this final section of the lab. The full lab takes students 1.5–2 hours to complete, with about 1 hour devoted to the first two sections, 10–15 minutes on the section about mutations, and 30 minutes or more on the final open-ended section
Instrument Design and Validation
The initial design of the test was a series of written open-response questions. We (Herron) asked for feedback on the test from several evolutionary biology instructors and refined the test based on their comments. We (Meir, Perry) then pilot-tested the exams with 20 Boston-area students to further refine problems with test items. We interviewed students after they took the exam, allowed them to further explain their answers in interviews, and used student responses in tests and interviews to develop distracters (incorrect options) for multiple-choice test items.
The final version of the tests designed for this study included nine multiple-choice and seven open-response items about natural selection principles. Four of the multiple-choice items were taken from Settlage and Odum (1995). The remaining multiple-choice and open-response test items presented scenarios from a hypothetical situation and asked students to analyze or predict situations based on the information provided. The majority of the multiple-choice questions included distracters based on student responses and the most common misconceptions about natural selection in the literature. The pre- and post test were structured identically, but we changed the specific information in each item. We analyzed the internal consistency of the pretest multiple-choice questions using the Kuder–Richardson 20 method, which yielded a reliability coefficient of 0.68. Sample short- and long-response test questions are shown in Appendix A. Full copies of our tests are available by writing to SimBiotic Software® (www.simbio.com)—we avoid posting openly to retain their usefulness for instructors.
To quantify the presence or absence of misconceptions on the open-response test items, we (Abraham, Herron, Meir) first developed a rubric based on the list of misconceptions culled from the literature (Table 1). We then independently coded misconceptions and correct concepts in student responses on a subset of tests. Initial agreement among the authors was 85%. Disagreements about misconceptions in the responses were discussed until all of the authors agreed on 100% of the coding. One author (Abraham, who did not participate in designing the lab or tests) then coded misconceptions in open responses based on the revised rubric (N = 338 students). Coding for the presence of misconceptions was done conservatively; we only assigned a misconception to a response when the student explicitly stated it. Instances where a misconception was indirectly suggested but not clearly stated and instances where an incorrect answer was provided that did not link to a misconception were both coded as unclassifiable.
Data Analysis
We analyzed student performance on the multiple-choice selection and open-response type items separately. We used the full dataset for the multiple-choice responses (N = 637) but used a subset of those exams for analysis of the open-responses (N = 338). The subset, including tests from twelve institutions, was chosen so that it included a sufficient number of exams from beginner and advanced students and represented the diversity of institution types we included in the study. We first compared the average proportion of correct answers between the pre- and post test multiple-choice items with a one-tailed paired sample Wilcoxon sign-rank nonparametric test. We then used two-tailed Wilcoxon rank sum nonparametric tests to compare the average improvement in student score on the multiple-choice items between two student subgroups: female (n = 187) and male (n = 145) and beginner (n = 128) and advanced (n = 210) students. We calculated Cohen′s d effect sizes for each comparison of pre- and post test scores (Cohen′s
, χ = mean, s = pooled standard deviation).
In one open-response test item, we asked students to describe what had occurred in the hypothetical situations in the pre- and post tests to elicit descriptions of the process of natural selection (Question 15, Appendix 1). We designated this question as a long-response question because a correct answer necessarily involved more than one sentence. We compared the number of correct concepts ((1) variation in traits; (2) heritability of traits; (3) differential survival to reproduction; (4) change in average trait value in population over generations) provided by students before and after instruction with a one-tailed paired sample Wilcoxon sign-rank test. We then compared average improvement between gender and academic level with two-tailed Wilcoxon rank-sum tests. We used the statistical software package JMP 7.0.2 (SAS Institute 2008) for each of the preceding analyses.
To analyze student performance on the other open-response questions (short response), we scored each misconception as present or absent in the exam. Thus, a student who used a misconception a single time was scored the same as a student who used it multiple times. Students sometimes failed to provide answers for some of the open-response test items. Before analysis, we compared the frequency of incomplete responses to open-response test items between pre- and post tests. We found no difference between tests, so we included incomplete exams in our analysis. We also compared student use of misconceptions between public and private institutions. We found no difference, so this factor was dropped from the analysis.
We compared the prevalence of the four most common misconceptions (MC1 = willful change, MC2 = directed variation, MC3 = intra generational change, MC4 = population change; defined in Table 1) between the pre- and post tests with a series of McNemar′s paired sample chi-square tests. Other misconceptions did not occur frequently enough to analyze statistically.
We next compared improvement in students who exhibited misconceptions in either test between male or female and beginner or advanced students with a series of chi-square tests. For these comparisons, we defined improvement as a misconception present in the pretest open responses that was absent in the post test. We defined a lack of improvement as either the presence of a misconception in both the pre- and post test or a misconception present in the post test that was absent in the pretest. Thus, students who did not exhibit a given misconception in either test were excluded from this portion of the analysis.