Introduction

Learners have well-documented problems with understanding and learning key scientific concepts like energy (e.g., Opitz et al. 2017; Wernecke et al. 2018), genetics (e.g., Schmiemann et al. 2017; Venville et al. 2005), and evolution (e.g., Gregory 2009; Rector et al. 2013; Rosengren et al. 2012). A shared aspect of these scientific concepts is that spatial and/or temporal dimensions of associated processes and structures prevent their direct perception. Hence, they can only be understood on an imaginary level, like all concepts beyond humans’ perceptual (especially visible) dimensions (Lakoff 1987; Lakoff and Johnson 1980). For instance, random mutations in DNA are important sources of variation in the key evolutionary process of natural selection (Heams 2014). However, these mutations are not visible to the naked human eye, although they can be visualized technologically (e.g., using DNA sequencing techniques). The consequent lack of possibilities for students to observe these phenomena in everyday situations may result in misunderstanding of the importance of random processes in evolution (Garvin-Doxas and Klymkowsky 2008). Furthermore, students tend to frequently misunderstand general abstract concepts that underlie biological processes like randomness and probability (Garvin-Doxas and Klymkowsky 2008; Mead and Scott 2010). Thus, it may be essential to address these underlying abstract concepts to overcome problems in learning evolution (Tibell and Harms 2017). Appropriate visualizations such as simulations may help in overcoming these limitations and make the concepts tangible.

Researchers involved in the EvoVis-project (EvoVis: Challenging Threshold Concepts in Life Science - enhancing understanding of evolution by visulaization) have developed interactive, web-based simulation software, called EvoSketch, which allows learners to explore random and probabilistic phenomena associated with the process of natural selection. The software generates a line (representing a reproducing organism) that is replicated by the user for 20 generations. After these 20 generations, the line will normally be shifted either to the right or to the left due to the combination of copying error and a selection process.

Our general expectations were that abstracting the processes of randomness and probability in the context of evolution (by EvoSketch) and actively working with EvoSketch will help students to learn these concepts. As research has shown that simulations are seldomly effective in improving knowledge without instructional support (e.g., Eckhardt et al. 2013; Wouters and van Oostendorp 2013), we also expected that additional self-directed instructional support may better facilitate learners’ self-directed simulation-based learning than a learning opportunity without this support.

Background

Learning evolution and the notion of threshold concepts

Over the past decades, a large body of work on evolution education has indicated several difficulties for learning its essential tenets and examined the diversity of students’ alternative conceptions (e.g., Baalmann et al. 2004; Beggrow and Nehm 2012; Bishop and Anderson 1990; Gregory 2009; Kampourakis and Zogza 2008; Nehm and Schonfeld 2008; Opfer et al. 2012; Shtulman 2006; Spindler and Doherty 2009; Yates and Marek 2015). One problem is that many words in science lessons such as adaptation or fitness also appear in other contexts or everyday language with slightly different meanings. This can confuse students and lead to misused scientific terminology (Rector et al. 2013; To et al. 2017). If instructors target these alternative conceptions and meanings to cause cognitive conflicts in students, the students are likely to experience conceptual change (e.g., Posner et al. 1982; Sinatra et al. 2008), which means to replace or reorganize old conceptions with new (scientifically accurate) ones.

Current research also mentioned learning difficulties with those evolutionary concepts that are strongly related to underlying abstract concepts like randomness and probability, so-called threshold concepts (Mead and Scott 2010; Ross et al. 2010). Threshold concepts are described as conceptual gateways that, once passed, open up a new way of thinking and are distinguished from “key” or “core” concepts as they are more than mere building blocks toward understanding within a discipline (Meyer and Land 2003, 2006). Threshold concepts are proposed to have eight characteristics: transformative (occasioning a shift in perception and practice), probably irreversible (unlikely to be forgotten or unlearned), integrative (surfacing patterns and connections), often disciplinary bounded, troublesome (dealing with counter-intuitive or alien knowledge), reconstructive (reconfiguring learners’ prior knowledge), discursive (extended language usage), and crossing through liminal space (chaotic progress across conceptual terrains; Land 2011; Meyer and Land 2003, 2006; Taylor 2006).

Even though evolution is widely considered as troublesome to learn and teach, evolution itself is not suggested to be a threshold concept, but rather consists of a web of interconnected threshold concepts such as temporal scale, spatial scale, probability, and randomness (Ross et al. 2010; Tibell and Harms 2017). Tibell and Harms (2017) developed a two-dimensional framework connecting evolutionary key concepts with these threshold concepts. They propose that a complete understanding of evolution through natural selection requires the development of knowledge concerning both evolutionary key and threshold concepts, and the ability to freely navigate through this two-dimensional framework. Moreover, they claim that the conceptual change theory can be connected to the notion of threshold concepts. Thus, understanding threshold concepts is a prerequisite for conceptual change concerning the understanding of particular evolutionary concepts; hence changing alternative conceptions to scientifically sophisticated ones (Tibell and Harms 2017).

In this study, we focused on the threshold concepts of randomness and probability, since research reveals that students particularly struggle with the importance and nature of randomness (Garvin-Doxas and Klymkowsky 2008; Robson and Burns 2011). The term randomness is often used in everyday language to explain that a phenomenon is purposeless as well as without order, predictability or pattern, while scientists use the term to suggest unpredictability without referring to purposelessness (Buiatti and Longo 2013; Mead and Scott 2010; Wagner 2012). In fact, the notion of randomness in evolution is rather specific by speaking about events (e.g., mutations or genetic drift) that are independent of an organisms’ need or the directionality provided by the process of natural selection (Heams 2014; Mead and Scott 2010). Thus, mutations are called random because they are not directed to an organisms’ adaptation, and it cannot be predicted precisely where and when a mutation will appear (Heams 2014). Although it is, of course, possible to predict the likelihood of a mutation occurring at a particular site in a specific sequence, this would fit better into the concept of probability rather than randomness. In fact, the term probability refers to the likelihood of a particular outcome in a long run (over multiple events), and is assigned a numerical value between zero and one (Feller 1968). The closer a probability value is to one, the more likely the outcome is. In evolution, natural selection itself can be described as a probabilistic process, if the process of selection is defined as individuals’ probabilities to survive and reproduce in a specific environment depending on their particular traits (Tibell and Harms 2017). Thus, evolution through natural selection depends on random genetic mutation leading to a heritable variation on which the probabilistic process of selection can act upon (Andrews et al. 2012; Mix and Masel 2014). Therefore, a clear understanding of randomness and probability is essential for understanding evolution.

Computer simulations as tool to enhance understanding

Computer simulations can be effective tools to handle the intangible nature of scientific concepts such as mutations (Ainsworth and VanLabeke 2004; Plass et al. 2012). They also allow students to visualize processes occurring at spatial and temporal scales that are difficult or impossible to observe directly (Rutten et al. 2012). Simulations have several advantages over reading textbooks or attending lectures, because they provide opportunities to explore theoretical situations, interact with a simplified version of the focal process(es), and/or change time-scales of events (van Berkum and de Jong 1991). However, research on simulation-based learning has revealed that learners may encounter difficulties during the learning process for two contrasting reasons (de Jong and van Joolingen 1998). One is that simulations can involve complex learning environments, which may overwhelm the learner due to the high amount of information that is conveyed and must be processed (Wouters and van Oostendorp 2013). In stark contrast, minimizing guidance (and thus reducing the amount of information) may reduce the effectiveness of simulation based-learning (Rutten et al. 2012). Therefore, instructional support may be needed to provide suitable learning environments and overcome students’ learning difficulties (Kombartzky et al. 2010; Urhahne and Harms 2006). Several kinds of support may be provided in different phases of the learning process in efforts to enhance self-directed simulation-based learning (Zhang et al. 2004):

Interpretative support, given before the interaction, can provide scaffolding for learners to activate prior knowledge, and generate appropriate hypotheses. One way to provide effective interpretative support is to offer accessible domain-specific background information (Reid et al. 2003). Research by Leutner (1993) and Lazonder et al. (2010) indicated that the timing of providing these background information is a critical aspect. Students gained higher knowledge outcomes when the domain-specific background information were accessible before and during the learning process. Then, providing worked examples can also have positive effects on learning outcomes (Spanjers et al. 2011; Yaman et al. 2008). Worked examples consist of a problem followed by a worked-out solution, normally presented in a step-by-step format to the learner (Renkl 2005). A study by Lee et al. (2004) state that students receiving worked examples scored higher in a common assessment, while students working with inquiry discovery scored lower.

Experimental support is provided during an interaction and can scaffold learners’ process of scientific inquiry during simulation-based learning by helping them to design verifiable experiments, predict and observe the outcomes, and draw appropriate conclusions. Students often have inefficient experimentation behaviors (e.g., vary too many variables at the same time; de Jong and van Joolingen 1998). Effective experimental support for knowledge acquisition may include gradual, cumulative introductions to handle a simulation and/or requests for learners to predict and describe the outcome (Urhahne and Harms 2006; Wang et al. 2017). Such experimental prompts are particularly effective for learners with low ability and inefficient discovery learning strategies (Chang 2017; Veenman and Elshout 1995).

Reflective support is provided after an interaction and may foster learners’ integration of their discoveries. Such support scaffolds the integration of new information arising from discoveries after learners’ interaction with a simulation. It involves promoting reflective processes (sometimes also connected to metacognitive processes), which may be done through a reflective assignment tool or opportunities to discuss the results (de Jong and van Joolingen 1998; White and Frederiksen 1998). Indeed, studies by Eckhardt et al. (2013) and Zhang et al. (2004) concluded that prompting students to reflect upon and justify their experimental activities and outcomes raise their self-awareness and contribute to higher knowledge acquisition. Moreover, Chang and Linn (2013) showed that criticizing someone else’s experiment can foster students to recognize poorly arranged experiments so that they can create better experiments on their own, and hence knowledge acquisition is enhanced.

Simulations to support students’ understanding of evolution and threshold concepts

Although the number of available online educational videos increases, they often lack explanations regarding underlying threshold concepts or, if mentioned, they are communicated orally only (Bohlin et al. 2017). For evolution education, there are few computer simulations available for free such as Evolve (Price and Vaughn 2010), Avida-ED (Pennock 2007, 2018), or evolution readiness activities (Concord Consortium 2018). The conducted research studies indicated positive learning gains after using these simulations (Horwitz et al. 2013; Soderberg and Price 2003; Speth et al. 2009). Nevertheless, they were designed to focus on evolutionary (key) concepts without focusing on particular underlying threshold concepts such as randomness and probability. For instance, the activities of evolution readiness focus on the process of (natural) selection, variation within species (without referring to the origin of variation), and inheritance of various traits (Horwitz et al. 2013). This also counts for evolve, which is designed to focus on the effects of selection, genetic drift, and migration of a population over time without modeling mutations or their random nature (Soderberg and Price 2003). In contrast, Avida-ED includes random mutations occurring in the organisms’ genome, while students can also observe evolution in action (Speth et al. 2009). Still, the above-mentioned simulations do not imply the underlying threshold concepts such as randomness or probability. In addition, these simulations are rather time-consuming (e.g., it takes some time to handle the software properly). Particularly Avida-ED seems to work well for lectors in universities, where students can use this tool across several lab lessons, but this simulation software might be too complex and time-consuming to be used by teachers in ordinary school lessons. Therefore, there is a need for simulation software that is easy to handle and visualizes the notion of randomness in evolution.

EvoSketch

The simulation software

EvoSketch is a project-developed, interactive, web-based simulation software, free of charge and available online in an English (EvoSketch English 2018) or German version (used in this study; EvoSketch German 2018), that allows learners to explore random and probabilistic phenomena associated with the process of natural selection. The software (which can be used on various electronic devices, such as smartphones, tablets, laptops, and desktop computers) generates a line (representing a reproducing organism) that is replicated by the user for 20 generations.

Every generation consists of four replications of a parent line drawn with a mouse or finger, resulting in four offspring lines (Fig. 1). Since copying errors inevitably occur while drawing, each replication varies and drifts slightly to the right or left of the parent line. These shifts in offspring lines represent the concept of the origin of variation, and hence random processes in evolution. After each generation has been completed by drawing four replications, one of the four offspring lines is selected (by the software) to continue the parent line, and thus represents the next reproducing “organism” in the simulation. The selected line is closest to a point (indicated by the red dot in Fig. 1) indicating optimal fitness for the offspring in the surrounding environment. The organism represented by the selected line has the highest probability to survive and reproduce, and there is selective pressure on the line to move towards the point (probabilistic processes). After 20 generations, the line will normally be shifted either to the right or to the left due to the combination of copying error and selection.

Fig. 1
figure 1

Screen displays of an EvoSketch simulation at the beginning (left) and during the 7th generation (right). The main box shows the offspring line (black) drawn (with a mouse or finger) from the parent line (grey). After saving the line, it is visualized in one of the four offspring boxes (offspring 1–4). To the right of the main box are displayed all offspring lines that have been generated and saved so far. The red framed boxes show the offspring selected as parent lines for successive generations. The red dot (only seen after pushing the button show point) serves as a selection factor. The offspring line that is nearest to the dot (indicating optimal fitness for the surrounding environment) is selected as the new parent line

The idea behind the simulation

The idea for EvoSketch emerged from a video clip focusing on the visualization and importance of randomness for natural selection to occur (BBC and Open University 2011). Although this video clip was easy to understand, we wanted to create a hands-on activity (i.e., EvoSketch) for students to experience these changes on their own. As—according to the theory of embodied cognition (Gropengießer 2007; Lakoff 1987; Lakoff and Johnson 1980)—repeated action within a specific environment (e.g., simulation) helps to create an understanding of scientific concepts. We intended to provide users with a possibility to realize on their own how even tiny mistakes (copying errors) could change the shape of a line across several generations. In addition, students have to draw four offspring lines in contrast to just one line after the other. This should include the aspect of variation across offspring and serves as a basic framework for the process of natural selection.

Although realistic visualizations may facilitate the recognition of the visualized process in the real world (e.g., Höffler and Leutner 2007), realistic visualizations often also entail irrelevant details resulting in distractions of the learners from the relevant parts (Dwyer 1976). In contrast, nonrealistic or schematic visualizations may present such aspects in a way that is easier to realize by the learner (Scheiter et al. 2009). Moreover, decreasing the number of dispensable elements in the learning material might help in reducing extraneous cognitive load (Sweller 1994). Thus, we used a nonrealistic visualization approach for EvoSketch to focus on the notion of randomness and probability.

EvoSketch worksheets

In general, users are guided through an EvoSketch exercise by an accompanying worksheet (EvoSketch worksheet). This worksheet consists of two introductory texts explaining random processes (specifically, mutations) in evolution [English version: 132 words; German version: 128 words (used in this study)], and the probabilistic process of natural selection [English version: 107 words; German version: 98 words (used in this study)]. Both texts are directly followed by a task asking learners to make predictions about the outcome of the simulation, run the simulation and/or observe the outcome, and explain the outcome (predict-observe-explain strategy; White and Gunstone 1992). The English version of the EvoSketch Worksheet is available as Additional file 1.

Research aim

Our general expectations were that abstracting the processes of randomness and probability in the context of evolution (by EvoSketch) and actively working with EvoSketch will help students to learn these concepts. Further, additional instructional support may better facilitate learners’ self-directed simulation-based learning. Since EvoSketch software provides integrated experimental support (EvoSketch Worksheet tasks), we addressed the potential utility of additional interpretative and reflective instructional support in this study. To evaluate the effectiveness of EvoSketch for teaching and learning the roles and importance of randomness and probability in evolutionary contexts, we used knowledge test performance on three occasions, time-on-task, and perceived cognitive load of the learners. We compared pre-, post- and follow-up performance scores of students who learned with EvoSketch, with and without additional instructional support (i.e., interpretative or reflective), to those of students who used text-based learning of the same topics.

Methods

Design and interventions

The main aim of this study was to assess the effectiveness of EvoSketch for fostering students’ conceptual knowledge of randomness, probability, and evolution. An additional aim was to identify which type of instructional support (if any) most effectively promotes self-directed learning with EvoSketch. For these purposes, we used an experimental repeated measures design approach and assigned students to the following four kinds of self-directed learning interventions (i.e., no additional support of any instructor): text-based, simulation-based, simulation-based with interpretative support, and simulation-based with reflective support. Students participating in all four groups received an overview of the topic of evolution by means of a short, standardized introductory text (cf. Neubrand et al. 2016) to reactivate prior knowledge. The students of each group subsequently individually addressed the following worksheets and tasks:

Text-based intervention (hereafter, text)

Textbooks are still a central teaching resource in science education, and thus, learning is often organized around text-based instructions (McDonald 2016). Learners of this intervention group worked with a worksheet (including the two introductory EvoSketch Worksheet texts mentioned above) and a Powerpoint presentation on the roles of randomness (specifically, mutation) and probability (specifically, selection) in evolution. The presentation included only written texts and pictures (i.e., no audio or video components). Afterwards, learners were asked to answer three questions regarding the information given in the presentation, and two questions regarding evolution. The respective questions as English version are available as Additional file 2.

Simulation-based intervention (hereafter, simulation)

Learners in this group were asked to follow the instructions of the EvoSketch worksheet (mentioned above in section EvoSketch software). They started by reading the introductory text on the topic of randomness in evolution and worked through the first task. During this task, they also progressed through the EvoSketch simulations. They then read the second text on the role of probability in evolution and addressed the second task regarding selective pressure (indicated by the distance from the red point in their simulation). Learners in this group did not receive additional instructional support but had to solve on their own how the basic information of randomness and probability in evolution (i.e., texts in EvoSketch worksheet) and EvoSketch simulations were connected to each other.

Simulation-based intervention with interpretative support (hereafter, sim-interpret)

This learning group was identical to the simulation intervention, except that learners were provided interpretative support in the form of a worked example on the roles of randomness and probability in evolution before starting to work with the EvoSketch simulations. Clark et al. (2011) described a worked example as “a step-by-step demonstration of how to perform a task or how to solve a problem” (p. 190; see also Atkinson et al. 2000). Thus, worked examples can help novices (i.e., non-experts) to understand how a formulated problem can be solved through introducing the formulated problem, the relevant solution steps, and the final solution (Renkl 2005; see also Fig. 2). We used a German worked example created by Neubrand et al. (2016) with revised and supplementary sections added in efforts to increase the focus on randomness and probability aspects, and to establish helpful connections to EvoSketch simulations. Concerning the threshold concepts (i.e., randomness and probability) and EvoSketch simulations, our worked example starts with explaining the conditional factors of evolution through natural selection (i.e., origin of variation, individual variation, heredity, and differential reproduction and survival; see also Fig. 2) with explanations connected to EvoSketch simulations (see Additional file 3 for an overview of two example pages). This followed a worked example of the peppered moths’ evolution. Βy reading the worked examples, learners already received the information of how to connect threshold concepts and evolutionary concepts in the context of the EvoSketch software as well as in a biological context, before beginning EvoSketch simulations.

Fig. 2
figure 2

The structure of a worked example (left) and the respective steps of the EvoSketch worked example used in the study (right; see also Additional file 3 for an overview of two example pages)

Simulation-based intervention with reflective support (hereafter, sim-reflect)

The last group of learners, the sim-reflect intervention group, also worked through the mentioned EvoSketch worksheet (including EvoSketch simulations). However, in contrast to the simulation and sim-interpret groups, learners received additional reflective support in the form of reflective questions after each task while working with EvoSketch simulations (e.g., “Explain the role of randomness or random processes in the line’s evolution.”). These learners have to describe and interpret their own simulation outcomes with respect to the two threshold concepts in question.

Participants

The sample consisted of 14 classes from nine comprehensive schools (“Gemeinschaftsschulen”) in northern Germany. In total, 269 tenth grade students aged between 14 and 18 years (M = 15.6 years, SD = 0.6 years; 47.19% female) participated in the study. Students of each class were randomly assigned to one of the four intervention groups: text (n = 43), simulation (n = 70), sim-interpret (n = 79), and sim-reflect (n = 77). The study was conducted during regular science lessons between November 2016 and March 2017. All students were informed that participation was voluntary and that their results would not affect their final grades. Students had received no formal instruction on evolutionary theory before. Nevertheless, we assume that they had some fragmentary knowledge on topics related to aspects of evolutionary theory (e.g., genetics), although evolutionary theory is not specifically included in the German curriculum before the tenth grade (Secretariat of the Standing Conference of the Ministers of Education and Cultural Affairs of the Länder in the Federal Republic of Germany 2005).

Instruments

The instruments we used to study the effectiveness of the interventions and potentially influential variables are outlined below, while additional descriptions of the test instruments and item fit values are available as Additional file 4.

Randomness and probability test in the context of evolution (RaProEvo)

RaProEvo is a test instrument designed to measure students’ conceptual knowledge of randomness and probability in evolutionary contexts (Fiedler et al. 2017). It comprises 21 items (16 multiple-choice, three open response and two matching items) that focus on five aspects in which randomness and probability play important roles: the origin of variation, accidental death, random phenomena, the process of natural selection, and the probability of events. Fiedler et al. (2017) reported that instrument validation was originally performed using expert rating, psychometric analyses of university student responses using item response theory (Rasch modeling), and criterion-related validity measures, and that the test had satisfactory reliability. The items of the German version are scored dichotomously, and we used a reduced set of 19 items (excluded the two matching items). The internal consistency (reliability) measured by Kuder-Richardson 20 (KR-20) for the data presented in this study was moderate ranging from 0.44 and 0.63.

Conceptual inventory of natural selection (CINS)

The CINS is a diagnostic test designed to assess students’ understanding of evolution through natural selection (Anderson et al. 2002). It consists of 20 multiple-choice questions that focus on common misconceptions pertaining to 10 key conceptual aspects of natural selection, variation, and speciation. The inventory is structured so that each of the 10 concepts is assessed once in items 1–10 (CINS-A) and once again in items 11–20 (CINS-B).The original test instrument was verified by independent content experts (i.e., face validity evidence), student interviews, and statistical analyses based on classical test theory with satisfactory reliability (Anderson et al. 2002). A German translation of the CINS was prepared for a previous study with university students (Großschedl et al. 2014, 2018). The authors used the reverse translation method to generate a valid translation of the target instrument (e.g., Berry 1989; Su and Parham 2002). The results indicated that the translated test instrument generated reliable and valid inferences of university students’ evolutionary knowledge. We used the translated CINS-A and CINS-B sets of items in the pretests and posttests, respectively, to minimize the influence of pretest on posttest scores and students’ fatigue by reading the same items. All items are dichotomously scored, and KR-20 ranged from 0.23 and 0.40, suggesting that effects in this study may be somewhat misestimated due to lower-than-desired reliability.

General biological content knowledge test (GBCK)

The German GBCK test was designed to measure tenth grade students’ existing content knowledge of biological topics included in up to tenth grade curricula (such as genetics or plant and animal ecology), and consists of 19 dichotomously scored items (16 multiple-choice items, two matching items, and one open response item; Neubrand et al. 2016). We used the GBCK to control for differences in students’ existing prior knowledge of the subject and to test if this knowledge is related to the learning of randomness, probability, and evolution. The results obtained with our students indicate that the test has an internal consistency (KR-20) of 0.36, lower than the level (0.51) reported by Neubrand et al. (2016) in applications with other samples of tenth grade German students.

Students’ general language proficiency (C-test)

C-tests are designed to measure students’ general language proficiency (Eckes and Grotjahn 2006), which may affect their performance in other diagnostic test instruments (Härtig et al. 2015). Therefore, we assessed our students’ general German language proficiency using C-tests based on two German texts, each including 20 words with missing letters (Wockenfuß and Raatz 2006). Since learners’ ability to read items or texts and produce answers is highly relevant in a study such as this, the responses were screened for both orthographical and grammatical errors. The students’ answers were dichotomously scored, and KR-20 reliability of the test was found to be 0.78.

Perceived cognitive load (PCL)

Cognitive load can affect learning (Sweller 1994), but it can be reduced by providing instructional support for learning with simulations (Leutner 1993). Therefore, students’ PCL during the intervention was assessed using an adapted 5-point rating scale instrument (Urhahne 2002) consisting of eight items that allow differentiation of participants’ PCL with a Cronbach’s Alpha of 0.87.

Self-reported test-taking effort (effort)

Scores obtained by takers of any test are likely to depend on the effort they expend while taking it (Wise and Kong 2005). Thus, students’ self-reported test-taking effort was appraised on one 10-point scale item (Organization for Economic Co-operation and Development [OECD] 2010), after they completed both the pretests and posttests.

Procedure

Prior to the intervention (day 1), every student took pretests consisting of the targeted randomness, probability and evolutionary knowledge tests (RaProEvo and CINS-A), and instruments designed to capture information on the control variables: general biological content knowledge (GBCK), language proficiency (C-test), self-reported test-taking effort (effort), and demographic data (i.e., age, sex, and biology grade; see also Fig. 3). Roughly 2 weeks later (day 2: intervention day) every student of each intervention group worked alone through their own EvoSketch worksheet on a single laptop. Laptops were all of the same types and provided by the Leibniz Institute for Science and Mathematics Education (IPN) at Kiel University. Students of all intervention groups had 45 min to complete their worksheet tasks (intervention). On average, learners spent 30 min (SD = 7 min; range 13–52 min) completing their tasks. Immediately after completing their worksheet, students took posttests consisting of the knowledge tests (RaProEvo and CINS-B) and items asking about their effort and PCL during the learning process. Roughly eight school weeks later (day 3), students took follow-up tests consisting of the targeted knowledge tests (RaProEvo and CINS-B). The study was conducted by the first author, with support from a university student who set-up and removed the laptops on the second day. All instruments and materials applied in this study were in German language.

Fig. 3
figure 3

Overview of the study procedure and used test instruments

Statistical analysis

The unequal sample size of the different groups might create problems in terms of homogeneity of variance across groups. Therefore, we performed Levene’s test to see if our groups have roughly the same variance on the investigated variables. Depending on the test result, we compared groups based on either Hochberg’s GT2 post hoc tests (if Leven’s test p > 0.05) or Games-Howell post hoc tests (if Levene’s test p < 0.05; Field 2018).

We analyzed the CINS-B and RaProEvo responses with generalized linear mixed models (GLMM) featuring a logistic link function, crossed random effects for participants and items, and an additional random effect for class (Baayen et al. 2008). The random effects for participants and items were included to account for differences in participants’ general ability and items’ general difficulty, respectively. The random effect for class also controlled for possible discrepancies in average ability between classes. To uncover systematic effects of the experimental conditions on the development of students’ knowledge, dummy-coded variables for intervention, assessment, and their interaction were incorporated as fixed effects in the models. The text group served as the reference category for intervention, while posttest and pretest, respectively, served as the reference category for the CINS-B and RaProEvo assessments. This approach ensured simultaneous generalization of significant effects to new samples of both participants and items (Raaijmakers et al. 1999). As a measure of effect size (i.e., an expression of how much the respective method is better than the alternative one; Furukawa and Leucht 2011), we computed Cohens’ d. Cohens’ d provides information on the magnitude of the effect relative to the standard deviation (Cohen 1988). For instance, an effect of d = 0.25 would mean that the difference is one quarter of a standard deviation. The greater the value of Cohens’ d, the greater the effect. Cohen (1988) also suggested a rule of thumb for interpreting the results with a small effect starting at 0.2, a medium effect starting at 0.5, and a large effect starting at 0.8. We used the lme4-package (Bates et al. 2011) for the statistical computing environment R 3.0.0 (R Core Team 2013) for all these statistical analyses.

Results

Baseline equivalence

At first, we conducted one-way analyses of variance (ANOVAs) to detect possible significant differences between intervention groups in pretest performance (i.e., CINS-A and RaProEvo scores) or the control variables (demographic variables and either C-test or GBCK scores). Values of these variables for each of the groups are listed in Table 1. Levene’s test of each variable showed that the four groups had statistically equivalent variances (Fs < 1.96, ps > 0.121), and ANOVA results indicated that the groups do not significantly differ in any of the relevant variables (Welch’s Fs < 2.66, ps > 0.051). Thus, the random assignment of learners to the four intervention groups caused no apparent bias in terms of any of these variables.

Table 1 Control variables and pretest performance scores (means with standard deviations; minimum and maximum in parentheses)

Intervention effects on CINS-B and RaProEvo scores

CINS-B

As already stated, students’ responses to the CINS-B items were analyzed with a generalized linear mixed model featuring a logistic link function, crossed random effects for participants and items, and a random effect for class. Dummy-coded variables for intervention (with text as the reference category) and assessment (with posttest as the reference category) were included as fixed effects. Moreover, a fixed effect for students’ CINS-A pretest performance was included as a covariate. A main effect for CINS-A (b = 0.12, SE = 0.02, p < 0.001, d = 0.06) was detected, but no other significant fixed effects (Table 2). No significant differences were detected, at posttest or follow-up, between intervention groups in understanding of evolution through natural selection. The inclusion of the GBCK results, C-test scores, and time-on-task data as further covariates did not alter this pattern of results.

Table 2 Intervention effects on CINS-B and RaProEvo scores

RaProEvo

Students’ RaProEvo performance was explored with a similar model, but with pretest as the reference category for assessment. No significant main effects of intervention were detected, indicating that there were no substantial differences between intervention groups in conceptual knowledge of randomness and probability in evolutionary context at the outset of the study (Table 2). Similarly, there was no significant general improvement in students’ performance across assessments. However, a significant interaction revealed that students in the simulation group outperformed students in the text group at follow-up, b = 0.34, SE = 0.15, p = 0.024, d = 0.19. Incorporation of the GBCK results, C-test scores, and time-on-task data as covariates did not change this pattern of results.

Time-on-task

Differences in the time learners spend on tasks in their interventions may influence learners’ knowledge acquisition. Levene’s test indicated statistically inequivalent variances of the four groups regarding time-on-task, F(3, 265) = 3.93, p = 0.009. In addition, one-way ANOVA indicated a significant effect of intervention on time-on-task: Welch’s F(3, 132.69) = 26.74, p < 0.001. Games-Howell post hoc tests revealed that learners in the text group (M = 30 min, SD = 6 min, n = 43) worked significantly longer than learners in the simulation intervention group (M = 26 min, SD = 5 min, n = 70, p = 0.006, d = 0.74). Furthermore, simulation intervention learners spent significantly less time with the material than the sim-interpret (M = 33 min, SD = 7 min, n = 79, p < 0.001, d = 1.14) and sim-reflect intervention (M = 34 min, SD = 7 min, n = 77, p < 0.001, d = 1.31) learners.

Perceived cognitive load (PCL)

Since cognitive load can influence learners’ knowledge acquisition (Sweller 1994), the students’ PCL was measured directly after each intervention. Levene’s test showed that the four groups had statistically equivalent variances on PCL, F(3, 265) = 0.66, p = 0.576, while one-way ANOVA indicated significant differences between intervention groups: Welch’s F(3, 134.55) = 5.40, p = 0.002. Hochberg’s GT2 post hoc tests showed that average PCL was higher in the text intervention group (M = 1.31, SD = 0.53, n = 43) than in the simulation (M = 0.92, SD = 0.63, n = 70, p = 0.004, d = 0.66) and sim-reflect intervention (M = 0.95, SD = 0.58, n = 77, p = 0.008, d = 0.64) groups. However, no significant differences in this respect between the other pairs of interventions were detected (in all remaining cases, p > 0.05; PCL sim-interpret: M = 1.08, SD = 0.60, n = 79).

Self-reported test-taking effort

Test performance may also depend on the test-taking effort, as low effort is likely to result in test scores underrepresenting learners’ true level of knowledge (Wise and Kong 2005). We applied repeated-measures ANOVA to investigate differences between intervention groups in self-reported test-taking effort. Levene’s test showed that the four groups had statistically equivalent variances on pretest effort, F(3, 229) = 1.75, p = 0.158, but inequivalent variances on posttest effort, F(3, 229) = 4.08, p = 0.008. The results of the repeated-measures ANOVA showed a significant main effect of self-reported test-taking effort: F(1, 229) = 30.86, p < 0.001. Repeated contrasts also revealed that learners self-reportedly spent significantly more effort in the pretests (M = 7.03, SD = 1.84) than in the posttests (M = 6.19, SD = 2.07, n = 233; p < 0.001, d = 0.40). Nevertheless, no significant main effect of group or interaction effect between group and effort was detected: F(3, 229) = 1.49, p = 0.218, and F(3, 229) = 0.46, p = 0.710, respectively.

In addition, Spearman’s correlation coefficients were calculated to assess relationships between effort (pre- and posttest) and test performance (RaProEvo and CINS-A/B scores). A significant positive association was detected between pretest effort and RaProEvo pretest performance (rs = 0.14, p = 0.031, n = 236). Significant positive relationships were also found between posttest effort and both RaProEvo and CINS-B posttest scores (rs = 0.18, p = 0.007, n = 240, and rs = 0.14, p = 0.028, n = 240, respectively).

Discussion

The main aim of this study was to assess the effectiveness of EvoSketch simulations for improving students’ knowledge about randomness and probability in evolutionary contexts, and their evolutionary knowledge. Since instructional support may reportedly improve the effectiveness of simulations, and EvoSketch Worksheets provide experimental support, we also examined and compared effects of additional interpretative and reflective support (worked example and reflective questions, respectively) on self-directed learning with EvoSketch simulations.

We found that the overall mean posttest scores were lower, but mean follow-up test scores were higher than pretest scores. Concerning RaProEvo learning gains from the pretest to the follow-up test (not posttest), findings indicate that learners in the simulation intervention group (but not those in the simulation with additional self-directed support groups) acquired more knowledge than text-based learners. However, this positive effect was very small (i.e., Cohens’ d of 0.19), which means that the difference between the simulation group and the text group was only one-fifth of a standard deviation. Expressed in other words (i.e., number needed to treat; Furukawa and Leucht 2011), it would mean that if 100 participants worked with EvoSketch (without additional self-directed support), only six more of them would have a greater RaProEvo score compared to students who only received the text material. Concerning the CINS scores, we could not find differences between the intervention groups. In contrast, we detected significant differences between intervention groups in both time spent on the material and PCL. Learners in the simulation groups with additional support (sim-interpret and sim-reflect) worked significantly longer on their tasks than learners in the simulation group (without additional support). Still, these groups did not differ in PCL. Students in the text group spent an intermediate amount of time on their worksheets but reported a significantly higher PCL than students of the simulation and sim-reflect groups.

The capacity of humans’ working memory is limited and learning is likely to be hindered when tasks require too much cognitive load (Chandler and Sweller 1991; Paas and Sweller 2014; Sweller 1988). Thus, too much (new) information that is not aligned with the learner’s prior knowledge, as well as inadequately designed learning material, can result in a high load of the working memory, which is detrimental for the learning process (de Jong 2010; Kirschner et al. 2006). The high PCL of the text group could have resulted from aspects of the intervention material. These students were not only supported with the two introductory texts of the EvoSketch Worksheet, but also received a text as Powerpoint presentation, to which three questions regarding topics covered in this text and two questions regarding evolution in a broader sense had to be answered. These students had to understand the concepts of randomness and probability in evolution based on the information given in the text only. In other words, they had no supporting simulation that visualized these concepts in connection with the evolutionary concept of variation or the relevance of random processes (i.e., mutations) for the probabilistic process of natural selection. Consequently, text-based learners had to build up a simulation of these processes in their mind. This may have caused them to perceive a higher cognitive load than students of the other intervention groups.

Based on the performance tests, our participating secondary school students had on average a low (i.e., CINS) to medium (i.e., RaProEvo) test score, which can be interpreted as low to medium prior knowledge of the focal topics. This is potentially problematic as learners may be overwhelmed by the high amounts of abstract information conveyed in the simulations (Rutten et al. 2012; Wouters and van Oostendorp 2013). The slight improvements in delayed knowledge acquisition of simulation groups, relative to the text-based learners, may indicate that EvoSketch is probably not too abstract (nonrealistic) for fostering learners’ knowledge about randomness and probability, but it does not seem to foster broad evolutionary knowledge in just one school lesson.

The limited duration of the learning session (the intervention time was roughly 45 min) may have affected several of the performance results, particularly CINS scores. There are a few studies indicating that learning evolutionary concepts in very short time (i.e., one or two lessons or hours) can result in higher knowledge (e.g., Beardsley et al. 2012; Bohlin 2017; Lee et al. 2017; Yamanoi and Iwasaki 2015). However, evolution education research also shows that the theory of evolution presents severe problems to learners, which have not been effectively solved by teaching strategies applied to date (e.g., Kampourakis and Zogza 2008; Rosengren et al. 2012). Introducing abstract, counter-intuitive concepts (i.e., randomness) in addition to these problems (particularly in a brief intervention) may partly explain the lack of learning gains directly after the intervention and the generally weak between-intervention differences in students’ learning.

Moreover, additional instructional support in either the interpretative or the reflective forms did not lead to improvements in the performance (i.e., RaProEvo and CINS) of simulation-based learners relative to text group learners. One explanation might be that students perceived the additional material similar to normal textbook work (i.e., reading the worked example or answering the reflective questions on paper), which might have lowered the effect of self-directed learning with a simulation. Maybe the results could have been different if the additional supports were integrated as computer-based exercises.

Another factor for our lack of learning gains may be tracked back to the expertise reversal effect (Kalyuga et al. 2003). The effect explains why some instructional support may be highly effective for learners with low knowledge, while losing its effectiveness and resulting in negative learning consequences for high knowledge learners, and vice versa (e.g., Kalyuga et al. 2003; Scott and Schwartz 2007). For instance, additional instructional support (e.g., worked example) may have been trivial for students with high prior knowledge (high pretest scores; see also maximum values in Table 1) to understand how threshold concepts and evolutionary concepts are connected in EvoSketch simulations, and even might have caused negative learning results for these students.

At last, the high amount of additional information provided in these interventions could have overwhelmed the students, deterred learners with low interest, and—in turn—reduced their motivation (Amabile et al. 1994; Pintrich and Schrauben 1992). Participants did not receive any credit for their test performance, and their results did not influence their final grade. Thus, their inherent learning motivation was likely correlated with motivation to address the large amount of material (e.g., worked example, large numbers of test items), thereby introducing a substantial random behavioral response factor in the posttest results (e.g., Meijer 2003). Accordingly, results of correlation analyses indicated a significant positive correlation between self-reported test-taking effort and posttest scores. Moreover, students reported significantly lower effort in the posttests than in the pretests, but no differences were detected among the groups, which may explain the lack of learning gains directly after the intervention. Since we did not ask the students a third time for their self-directed test-taking effort, we cannot clarify the connection of test performance and effort in the follow-up testing.

Limitations

This study is the first experimental approach to (1) examine the relationships of threshold concepts (i.e., randomness and probability) and evolution knowledge, and (2) foster the understanding of these concepts by use of an abstract visualization (i.e., EvoSketch). Visualizing random and probabilistic processes through EvoSketch seems to have a very small positive effect on students’ conceptual knowledge of randomness and probability in evolutionary contexts, although it is unclear if this is due to higher understanding or rather an aspect of variability in the sample.

Nevertheless, our statistical analyses are limited by the lower than desired reliability of the knowledge test instruments (RaProEvo, CINS, and GBCK). The GBCK test’s reliability was not expected to be high because it covers a large range of biological topics (Neubrand et al. 2016), but its internal consistency was unsatisfactorily low. The internal consistency of the CINS instrument was similarly low, possibly because the tenth grade students had not received formal instruction on evolutionary theory before the intervention and may have been overstrained by the complexity of the presented items. In contrast, the internal consistency of the RaProEvo instrument was higher, but still not satisfactory. In addition, some of the test instruments (i.e., RaProEvo or CINS) were originally developed and validated to measure post-secondary students’ knowledge of the respective context. However, at least the RaProEvo seems to be applicable to be used with secondary school students based on the range of item difficulty (see Additional file 4).

Another factor that potentially has affected the results is the limited duration of the learning session. We worked with tenth grade school students in several participating schools. The respective class teachers could decide on their own if they wanted their class (or classes) to participate or not. Since the research was performed during regular school lessons (depending on the particular school a lesson was between 45 and 60 min), we were unable to extend the research to more than the respective three days (i.e., five lessons including the time of the tests). Concerning this, the timeframe for the intervention was highly restricted. Nevertheless, studies indicated that even short learning periods could improve student’s knowledge acquisition (e.g., Bohlin 2017; Eckhardt et al. 2013; Yamanoi and Iwasaki 2015). The timeframe should have been fine for investigating the effectiveness of EvoSketch simulations and the respective self-directed instructional support.

At last, in this study, we only focused on the effect of self-directed learning and did not incorporate teachers’ support. Although teachers can be powerful for students’ learning (e.g., Hattie 2009), students’ performance is also likely to be influenced by teacher’s professional knowledge (e.g., Mahler et al. 2017; Sadler et al. 2013). Our first step was to examine if EvoSketch can be effective on its own (without any teachers’ support), but including teacher’s support in further studies (e.g., in form of class discussions or one-by-one support) may be helpful for students’ understanding of abstract threshold concepts, particularly directly after working with EvoSketch.

Implications and future research

Our findings may be useful for further research on how to visualize randomness and probability in evolution and for implementing EvoSketch in school sessions. Adequate knowledge of evolutionary concepts, and particularly related abstract concepts such as randomness and probability, is essential for students to critically address numerous issues associated with their environment and everyday life.

Thus, when using EvoSketch in the classroom, we recommend increasing the intervention timeframe to incorporate interventions on several days or weeks to foster students’ understanding of randomness and probability in the context of evolution. Moreover, working with EvoSketch on a variety of real evolutionary examples (e.g., resistance in bacteria, peppered moth evolution), may help novices to realize that the same processes are relevant in different organisms to evolve, and are based on the same evolutionary principles. Studies of Nehm and Ridgway (2011), and Kampourakis and Zogza (2009) indicate that novices’ explanations are often based on concrete surface features (e.g., running speed of cheetah) and have multiple explanatory models, while experts’ explanations are based directly on the main domain principles (e.g., natural selection). Therefore, using an underlying framework (e.g., EvoSketch) and explaining different real case studies based hereupon might help to develop a coherent understanding of the respective concepts across case studies.

Additionally, deep learning is often more strongly supported by small group learning than individual learning (Dori and Belcher 2005; Springer et al. 1999). EvoSketch could be used in group settings with each individual working initially on their own and subsequently discuss observations in the group. Such discussions could also be extended to class discussions with the teacher. Learners with little prior knowledge could receive additional instructional support through worked examples or reflective prompts.

This study was only a first step to examine how the understanding of abstract concepts may be fostered by the use of visualizations. In future studies, we intend confirming whether learners apply a better understanding when (1) using an abstracted example in contrast to using a real example, and (2) students received additional teacher support. Moreover, we also want to gather more qualitative data on how students actually use EvoSketch simulations and how this changes their experience by using eye-tracking methods, think-aloud protocols, and interviews afterwards. We hope to increase the research in evolution education by focusing on understanding underlying abstract concepts such as randomness and probability.

Conclusions

We developed the simulation software EvoSketch to allow learners to explore—in a non-realistic way—random and probabilistic phenomena associated with the process of natural selection. Although the simulation group received higher knowledge gains in the follow-up test than the comparison text group, the effect size was very small. Moreover, the additional self-directed learning supports (i.e., worked example or reflective questions) did not seem to improve students’ knowledge. In fact, there was no immediate learning gain directly after the interventions. Our suggestion is that if students were able to work on EvoSketch for several rounds (i.e., going through the 20 generations more than once), and playing with the (new) knowledge they have (e.g., drawing known inaccurate lines while the fitness value influences the selection process), they would probably gain a more intuitive understanding of the random and probabilistic processes. Additionally, incorporate teacher’s support may increase students’ learning.