Background

The importance of evolution education

Evolution is the unifying principle of biology (Armstrong 1929, p. 135; Dobzhansky 1973, p. 125; Mayr 1982), and, more broadly, “an essential concept for anyone who considers science to be the best way to understand the natural world” (Fishman 2008, p. 1586). Moreover, evolutionary principles are routinely and effectively leveraged for practical applications in medicine, public health, agriculture, conservation biology, natural resource management, and environmental science (Catley and Novick 2009, p. 313; Hendry et al. 2011; Novick and Catley 2012). A lack of evolution understanding and acceptance prevents informed decision making regarding biological issues that may have personal ramifications (Nadelson and Hardy 2015). As such, competence in evolutionary biology is universally recommended as a core outcome for students of biology (Marocco 2000; NRC BIO2010 2010; AAAS Vision and Change 2011; Quinn et al. 2011; NRC 2012, 2013), and informed members of our society (AAAS 1989, 2001). Emphasizing the importance of evolution understanding for all members of society, Smith (2010) states that “omitting evolution from basic instruction of our citizenry would constitute the equivalent of educational malpractice” (p. 544). Thus, an understanding of evolution at all levels of the curriculum is foundational to the discipline of biology.

Despite the overwhelming agreement among scientists regarding the centrality of evolutionary concepts to all areas of biology (Pew Research Center 2015), many students of biology have a limited understanding of evolution, as demonstrated by difficulty in (1) correctly identifying the patterns, processes, and outcomes of evolution (Mayr 1982; Clough and Driver 1986; Good et al. 1992; Scharmann and Harris 1992; Cummins et al. 1994; Anderson et al. 2002; Smith 2010), and (2) interpretation of phylogenetic relationships represented graphically (Baum et al. 2005; Meir et al. 2007; Baum and Offner 2008; Naegle 2009). Students and non-scientists also frequently employ the misconception that evolutionary processes are direct rather than emergent (Chi 2005, p. 174) and view evolution as a static entity rather than a dynamic—and generally lengthy—process (Smith 2010, p. 543). A concurrent lack of familiarity with the timescale of evolution (i.e. “deep time”) likely exacerbates this disconnect (Metzger 2011). These principles and skills are part of a broader set of complex concepts that are widely considered difficult to teach, and even more difficult to teach well (Anderson 2007; Gregg et al. 2003). Furthermore, many science teachers—particularly high school science teachers—often lack the disciplinary knowledge and confidence to teach evolution effectively (Bishop and Anderson 1990; Glaze and Goldston 2015).

Acceptance of evolution

In addition to widespread difficulty in, and lack of, understanding regarding evolutionary biology, low rates of acceptance of evolution have been reported in the general population (Miller et al. 2006; Gallup 2016), in pre-service educators (Romine et al. 2016), high school biology educators (Moore and Kraemer 2005; Moore and Cotner 2009; Glaze and Goldston 2015), university professors (Romine et al. 2016), and in various student populations (Rice et al. 2011; Romine et al. 2016). Although belief and acceptance may be closely related, recent investigators of evolution acceptance have separated these constructs, with acceptance being more closely related to “believing that” rather than “believing in” (Smith et al. 2016, pp. 1291–1292), and that acceptance “…is more voluntary than belief, and involves a commitment to use what is accepted; belief is less voluntary, and need not be used as a basis for inference or action.” (Smith et al. 2016, p. 1292). For example, a biology student may interpret phylogenetic data from an experiment using an evolutionary framework (belief that), but simultaneously hold religious or cultural views (belief in) that are kept distinct from the scientific process. Thus, it may be useful to investigate the extent to which students accept evolution as distinct from student belief in evolution.

Several specific variables have been identified as having significant associations with acceptance of evolution, although with somewhat varying consistency in the literature. Perhaps the two most commonly identified variables identified as having a significant association with evolution acceptance are religiosity (Mazur 2004; Nehm and Schonfeld 2007; Evans 2008; Moore et al. 2011; Heddy and Nadelson 2013; Yousuf et al. 2011; Barone et al. 2014; Carter and Wiles 2014; Rissler et al. 2014), and performance on biology or evolution knowledge assessments (Nadelson and Southerland 2010; Yousuf et al. 2011; Walter et al. 2013; Carter et al. 2015;  Mead et al. 2017).

Other variables that have been identified as having a significant association with evolution acceptance include age (Evans 2000), gender, academic standing, college major, prior study in biology and/or philosophy (Ingram and Nelson 2006; Rutledge and Mitchell 2002), trust in science and scientists (Nadelson and Hardy 2015), attitudes toward science and technology, attitudes toward life (Miller et al. 2006), and high school biology experience (Moore and Cotner 2009). Studies investigating populations in Minnesota specifically have reported that approximately 25–30% of high school biology teachers believe that creationism has a valid scientific foundation (Moore and Kraemer 2005; Moore and Cotner 2009); 63% of high school biology teachers teach evolution and not creationism while an additional 20% of high school biology teachers teach both evolution and creationism (Moore and Cotner 2009, p. 98). Moore and Cotner (2009) found that the nature of biology education students experience in high school significantly impacts students’ later attitudes toward evolution when they are in college: students taught evolutionary theory in high school exhibit a significantly higher degree of acceptance of evolution as compared to students who were taught both evolutionary theory and creationism, or only creationism (Moore and Cotner 2009; Rissler et al. 2014). Thus, the incorporation of creationism in high school biology instruction significantly increases the likelihood that students accept creationism and reject evolution when they arrive at college. Interestingly, students who were not taught evolutionary theory nor creationism are more likely to accept the scientific validity of evolutionary theory and related concepts upon entering college as compared to peers who experienced high school biology classes that included creationism (Moore and Cotner 2009, p. 97; Rissler et al. 2014), leading to the recommendation that “omission of evolution from high school biology courses may be preferable to a mixed approach that validates nonscientific explanations of diversity” (Moore and Cotner 2009, p. 99).

A recent multifactorial analysis (Dunk et al. 2017) reported that the greatest predictive variable for evolution acceptance (as measured by the MATE instrument, see “Measuring evolution acceptance” below) was student responses to a validated measure of an understanding of the nature of science, “Understanding of Science,” (Johnson and Peeples 1987). Additional predictive variables included religiosity (measure obtained using three items from “Evolution Attitudes and Literacy Survey-Short Form (EALS-SF), Short and Hawley 2012), openness to experience (measure obtained by “Big five inventory,” John et al. 2008), religious denomination, number of biology courses previously taken, and knowledge of evolutionary biology terms (“Familiarity with Evolutionary Terms,” Barone et al. 2014). These variables together accounted for nearly a third of the variation in the measurement of acceptance of evolution, indicating that other variables unidentified in the model contribute significantly to the measure of acceptance in that population. Thus, there are a wide variety of factors, likely some as yet unidentified, that contribute to evolution acceptance.

Measuring evolution acceptance

A number of instruments for measuring acceptance of evolution have been developed, including the Measure of Acceptance of the Theory of Evolution (MATE) (Rutledge and Warden 1999, 2000; Rutledge and Sadler 2007), the Inventory of Student Evolution Acceptance (I-SEA) (Nadelson and Southerland 2012), the Evolutionary Attitudes and Literacy Survey (EALS) (Short and Hawley 2012), and the Generalized Acceptance of EvolutioN Evaluation (GAENE) (Smith et al. 2016).

The I-SEA is a 24-item Likert-type scale questionnaire designed to capture students’ potential differential responses toward the acceptance of microevolution, macroevolution, and human evolution (Nadelson and Southerland 2012, p. 1657), with items evenly distributed across those three subcategories. Confirmatory factor analysis supports the contention that I-SEA has the potential to consistently and reliably measure evolution acceptance overall and differentially across the three constructs (Nadelson and Southerland 2012). However, the instrument may be of more limited utility in differentiating acceptance in microevolution vs. macroevolution constructs in populations with a lower level of evolution understanding, in which microevolution (i.e. variation within a species) and macroevolution (i.e. speciation) are likely to be more conflated (Nadelson and Southerland 2012, pp. 1657, 1659).

The EALS was initially developed as a 104-item instrument developed to measure a wide array of factors related to acceptance of evolution including political ideology, moral objections to evolution, religious identity, activity distrust of scientific enterprise, exposure to evolutionary theory, young earth creationist beliefs, attitudes toward life, intelligent design fallacies, scientific, genetic, and evolutionary literacy, relevance of evolutionary theory, social objections, and demographics (Hawley et al. 2011). Recognizing that 104 items may be cumbersome for implementation by researchers and educators, Short and Hawley (2012) developed an EALS-short form version (EALS-SF) consisting of 62 items that maintains the original instrument’s structure and validity.

The MATE is a 20-item Likert-type scale instrument designed to measure acceptance of fundamental evolutionary concepts (Rutledge and Sadler 2007). Although the authors of the MATE instrument separate the 20 items into six evolution concepts, the MATE has generally been considered a unidimensional measure of evolution acceptance. Reliability measures of the MATE indicate the instrument produces high reliability (c. f. Romine et al. 2016, Table 1) and Rutledge and Sadler (2007) report a high test–retest consistency. However, the instrument has been criticized on several fronts, including lacking a clear definition of “acceptance,” potential conflation of evolution acceptance with knowledge and/or religious beliefs, as well as inadequate construct validation and unresolved dimensionality (Wagler and Wagler 2013; Romine et al. 2016, p. 2; Smith et al. 2016, p. 1293). As MATE has been widely used, we selected it for our study because it would provide a means of comparing outcomes in our population to other populations that have been investigated using the MATE.

Table 1 Fit and reliability for confirmatory factor analysis models

The GAENE (Smith et al. 2016) is a recently published instrument that has not yet been widely used. However, adoption of this rigorously developed instrument may yield an improved measure of evolution acceptance that does not conflate evolution understanding and evolution acceptance (Smith et al. 2016). GAENE Version 1.0 consisted of 16 Likert-type items; extensive psychometric testing and refinement resulted in GAENE Version 2.0, a 14-item instrument, and GAENE Version 2.1, a psychometrically superior 13-item instrument recommended for use in most settings. Smith et al. (2016) include a comparison of the characteristics of development for the EALS, I-SEA, MATE, and GAENE evolution acceptance instruments (p. 1290).

Still other instruments seek to measure evolution understanding, such as the Conceptual Inventory of Natural Selection (CINS), a 20-item multiple choice instrument targeting understanding of natural selection (Anderson et al. 2002), and the Measure of Understanding of Macroevolution (MUM), a 27-item instrument targeting understanding of macroevolution, with 26 multiple choice items and one free-response item (Nadelson and Southerland 2010).

Study objectives

In this study, we sought to investigate students’ level of acceptance of the theory of evolution, with a null hypothesis that there would be no change in students’ level of acceptance from the beginning of the term to end of term. Topics in evolutionary biology were the focus of instruction both early in the term and late in the term, representing a “book-end” approach that reinforced the foundational nature of evolution understanding for a coherent and unifying lens through which to view all biological knowledge. Early in the term, evolutionary biology topics included an investigation of the history of life on earth with an emphasis on developing students’ sense of deep time (Metzger 2011), and an understanding of the evolutionary relatedness of all life on earth, including familiarity with visual representations of evolutionary relationships and interpretation of evolutionary relationships presented in phylogenetic trees. Later in the term, evolutionary biology topics included patterns and processes of evolution incorporating concepts learned from population genetics and molecular genetics modules earlier in the course.

Since evolution acceptance had not previously been measured at our institution, this study establishes a “baseline” to which future curricular interventions could be compared. Students in our program experience a one-semester foundational biology course with lab in the context of a Health Sciences undergraduate degree program; many of the populations in which evolution acceptance has been studied are students in a biology major with a two-semester introductory biology sequence, or are “non-majors” students. It was therefore of interest to us to investigate our students in comparison to students of other major designations, and assess if our population’s level of acceptance was more closely aligned to biology majors or non-majors from other institutions. At other institutions, a review of the published literature demonstrates that some populations experience little or no gain in measures of evolution acceptance (Romine et al. 2016, p. 3), while others demonstrated significant gains pre-to-post instruction (Smith 2010; Romine et al. 2016, p. 3). In many studies in which marked gains are reported, the instructional methods focused on intensive instruction in evolutionary topics (c.f. Wiles and Atler 2011). Our course design did not employ an explicit intervention to promote evolution acceptance, but understanding of key evolutionary principles is a primary course learning objective.

A further objective of our study was to determine which characteristics and performance variables serve as predictors for evolution acceptance in our study population. As evolution acceptance has been demonstrated to have significant relationships with a number of other student characteristics and performance variables, our study includes a consideration of variables for which data were available.

To investigate evolution acceptance, we utilized two independent measures of evolution acceptance: the MATE and the GAENE, and performed an analysis to determine the association between the scores obtained by each instrument. The MATE has been used as a measure of evolution acceptance in at least 25 studies previously, while the GAENE is a recently published instrument (Smith et al. 2016). We are aware of no other study that presents a comparison of scores obtained in a single population for these two evolution acceptance instruments.

As previous research (Romine et al. 2016) provided evidence that the MATE instrument may be more appropriately considered as a bidimensional instrument that captures two different constructs of evolution acceptance—Facts and Credibility—we also wished to perform psychometric analyses to determine the most appropriate way to treat the MATE scores (i.e. as a single score or as two separate scores).

Methods

Demographics and incoming performance metrics of study population

This study took place within the context of a health science undergraduate degree program (Bachelor of Science in Health Sciences, BSHS) at a small liberal arts university in the Midwest. Students entering the program were mostly traditional-aged college students. According to institutional data, approximately 75% of students in the program identified as female, and 27% identified as institutionally underrepresented minorities (URM), a designation which includes the categories American Indian, Asian, Black, and Hispanic. All participants included consented to participate in this research in accordance with University of Minnesota IRB protocol #1008E87333.

The total number of students enrolled in the course was 127. A total of 105 students completed all three assessments (pre-MATE, post-MATE, post-GAENE) satisfactorily (participation rate = 105/127 = 82.67%). Of the students participating in the study, 85 (80%) identified as female, and 38 (36.5%) identified as an underrepresented minority (URM). The average number of college credits completed prior to enrolling in the course was 31.33, with an average college GPA of 3.06. The average ACT Math score for this population was 24.64.

Course context

Study subjects were students enrolled in two sections of a 5-credit first-year foundational biology course with lab. Instruction took place in an active learning classroom (Dori and Belcher 2005; Beichner et al. 2007; Walker et al. 2011) with a flipped pedagogy model in which students were expected to, and were held accountable for, engaging with assigned material prior to classroom instruction. The physical classroom environment and curricular design facilitated regular implementation of a variety of teaching and learning activities and Classroom Assessment Techniques (CATs) (Angelo and Cross 1993). In preparation for classroom instruction and activities, students were assigned pre-instruction reading with corresponding preparation questions (i.e. study guide questions). Additionally, students completed a low-stakes pre-class quiz consisting of five questions related to the material in the assigned reading. Students were allowed two attempts on the pre-class quiz and were able to see which items they answered correctly or incorrectly immediately after submitting the quiz. Additional files posted on the course website included slides, links to online conceptual animations, practice questions, and other resources.

Schedule of course topics

Understanding of the centrality and importance of evolution in the biological sciences was a key learning objective in the course. As such, evolutionary topics were not relegated to one unit and then set aside for the remainder of the term. Rather, the semester began and ended with explicit instruction in evolutionary biology, referred to here as a “bookend” approach. The intervening instruction, while primarily addressing other topics, would also incorporate connections to evolutionary biology as a unifying principle. For example, the unit focusing primarily on metabolism incorporated a consideration of the homologous relationship between the cytochrome proteins of mitochondria and the cytochrome proteins of the chloroplast. Thus, evolutionary principles were reiterated throughout the course as an organizing theme.

Early instruction emphasized deep time as a way of viewing the history of the earth and life on earth, along with evidence for evolution (e.g. fossil record, biogeography, anatomical homologies) and easily recognizable evolutionary processes, such as response to predation selection pressure, with which students likely had some previous exposure or knowledge. In addition to connecting with students’ prior knowledge, to extend students’ breadth of evolution understanding early in the course, we also included neutral evolutionary processes such as genetic drift, which are less familiar and accessible to students, but which are increasingly prominent in our modern understanding of evolution at the molecular level (Kimura 1977; Bromham and Penny 2003). Later instruction in evolution topics included a more in-depth consideration of sources of genetic variation, molecular evolution, and population genetics. A molecular perspective of evolution is more accessible to students following instruction in other topics such as DNA replication, meiosis, the genetic code, and gene expression, which were addressed between the bookends of evolution instruction in the course. Previous research has demonstrated that placing instruction in genetics prior to instruction in evolution improved students’ evolution understanding, but did not significantly impact evolution acceptance as compared to instruction that places instruction in evolution prior to instruction in genetics (Mead et al. 2017).

Calculation of overall course grade

As our study did not employ a separate measure of knowledge in evolution, we chose to use students’ final course grades (%) as a measure (albeit, an imperfect measure) of biology knowledge. A student’s final course grade was comprised of grade requirements in the following categories, weighted in calculation of the final grade as indicated in parentheses: pre-class quizzes and in-class activities (15%), formal and informal writing assignments (20%), exams (40%) laboratory activities (20%), and reflection exercises (5%).

Additional measures of broad student knowledge—ACT Math score and cumulative college GPA at the start of term—were also included in our investigation.

Implementation of MATE and GAENE instruments

To assess students’ acceptance of evolution, we utilized two published instruments: the 20-item Measure of Acceptance of the Theory of Evolution (MATE) (Rutledge and Sadler 2007), and the Generalized Acceptance of EvolutioN Evaluation (GAENE), Version 2.1 (Smith et al. 2016). The GAENE Version 2.1 is a 13-item instrument, which we implemented with random order presentation and 5-point Likert-type scale as per the authors’ recommendations (Smith et al. 2016).

The MATE instrument was implemented as a pre- and post-test measure to investigate the level of acceptance in our undergraduate health sciences students before and after instruction, while the GAENE instrument was implemented as a post-test only. In all cases, student responses were gathered via our online course management system. Students completed the assessments outside of class time and were awarded nominal completion points for submitting responses to the instruments. Our online assessment allowed students to enter a numeric character as a response to each Likert-scaled item; instances in which a student entered a non-numeric or multiple numeric characters of different value were deemed ambiguous responses and thus removed from the dataset prior to analysis. Instances in which a student entered the same numeric character multiple times (e.g. 11) were considered non-ambiguous errors of entry and were replaced with a single numeric character of that value. If an individual student had more than one ambiguous character entry for an assessment, that individual was removed from the dataset.

Building validity evidence

Dimensionality—confirmatory factor analysis

The GAENE and the MATE are both intended to be unidimensional measures of evolution acceptance. To contribute evidence for the valid use of these instruments, a confirmatory factor analysis (CFA) was performed to examine the dimensionality of both instruments based on responses from the current sample. A unidimensional model was fit for the GAENE. For both the pre- and post-measures of the MATE a unidimensional and a bidimensional model were tested with the bidimensional model examining whether items loaded onto Romine et al.’s (2016) proposed Facts and Credibility dimensions. The fit of the uni- and bidimensional models were then compared using the likelihood ratio test, which tests whether the addition of a second dimension significantly improves model fit. Items on both the GAENE and the MATE are five-category Likert-type items and were treated as ordered categorical variables rather than continuous variables in the CFA estimation (Flora and Flake 2017; Flora et al. 2012). As categorical variables, the association of the items and the underlying factor(s) was nonlinear. Consequently, all of the CFA models were estimated with a diagonally weighted least squares estimator, which makes no assumptions about the distribution of the item responses and uses the polychoric, rather than product-moment, correlation matrix (Li 2016; Rhemtulla et al. 2012). The full weight matrix, however, was used to compute robust standard errors and a mean- and variance-adjusted Chi square test statistic. The CFA models were run using the lavaan package (v. 0.6-1) in R (Rosseel 2012). The comparative fit index (CFI), root means squared error of approximation (RMSEA), and standardized root mean squared residual (SRMR) were used to assess model fit for the CFA analyses. CFI evaluates incremental fit assessing whether the tested model fits better than the null model that treats all items as completely unrelated to each other. Absolute fit—the degree to which the relationships between variables implied by the model are similar to the relationships actually found in the data—are measured by RMSEA and SRMR with RMSEA including a penalty for greater model complexity. Simulation studies suggest acceptable model fit should have a CFI ≥ 0.95, RMSEA ≤ 0.06, and SRMR ≤ 0.08 (Hu and Bentler 1999). Additionally, hierarchical omega reliability (ωh) (McDonald 1999) was calculated to evaluate the proportion of total variance in item responses explained by the factor model.

Item calibration and person scores—Rasch scaling

Rasch scaling tests whether data from an instrument fit a theoretical measurement model (Rasch 1960). The Rasch model assumes the instrument is unidimensional, which is why CFA is more useful for examining the dimensionality of instruments. CFA, however, is based on classical test theory and has weaker assumptions than the Rasch model, which is rooted in the item response theory framework (Smith et al. 2002). Thus, when data fit the Rasch model and its stronger assumptions, it provides more appropriate person score and item calibration estimates which contain a number of properties that are beneficial for making both norm-referenced and criterion-referenced score interpretations: (a) item locations and person scores are placed on the same scale (the logit scale). In the current study, a person’s score is a measure of their level of acceptance of evolution whereby a person with a high level of acceptance will have a high score on the logit scale. Each item in the instrument is placed on the same logit scale whereby an item reflective of a high level of evolution acceptance when endorsed also has a high score on the scale; (b) the common metric for person and item parameters allows for calculating the probability a person with a certain evolution acceptance will endorse an item at a given location, which is useful for making predictions about person responses and evaluating whether the items on the instrument adequately cover the variability in respondents’ acceptance of evolution; (c) despite the ordinal nature of the Likert-type items used in the instrument, when the data fit the Rasch model, the resulting scaled scores are on an interval, linear scale enabling the use of scores in parametric statistical tests; (d) item location estimates are independent of the distribution of person scores and the person score estimates are independent of the item location distribution, thus enabling greater generalizability of the person and item estimates. In contrast, summed scores for the raw item responses are dependent both on the sample of items and the sample of persons, thus making it difficult to predict how persons would respond to a different set of items or how well the items would measure a different sample of persons. Readers interested in learning more about Rasch analysis can consult Wright and Masters (1982) and Bond and Fox (2015) and for the use of Rasch in instrument development see Smith et al. (2002) and Boone (2016).

For the current analysis, all Rasch models were run with the Rasch partial credit model (Masters 1982) using the mirt (v. 1.28) package in R (Chalmers 2012). The fit of GAENE and MATE instrument data to the Rasch partial credit model was evaluated using outfit mean square, infit mean square, and marginal reliability for the item location and person scores. Outfit measures how sensitive the item (person) estimates are to outliers while infit measures the difference between the observed score patterns and the model expected score patterns with poor infit being a greater threat to validity than poor outfit for the interpretation and use of scores (Linacre 2002). Outfit and infit are expected to close to 1.0 with values between 0.5 and 1.5 being acceptable.

Unlike classical test theory conceptions of reliability, such as Cronbach’s α (Cronbach 1951), that assume an instrument measures people on the underlying construct equally well across the construct’s entire spectrum, the Rasch and other item response theory models do not make this assumption and estimate a reliability for each observed score (Bond and Fox 2015). The average of the reliability estimates from across all observed scores is the marginal reliability. Given that Cronbach’s α was commonly reported by others using the MATE and GAENE, it was also calculated for each instrument to allow for comparing the reliability of the instruments on the present sample with previous administrations.

To further examine Romaine et al.’s (2016) Facts and Credibility subscales of the MATE, three separate unidimensional Rasch models were run on the pre-MATE responses: (1) all pre-MATE items, (2) Fact items only, and (3) Credibility items only. Using the item fit approach discussed in Smith (1996), we compared the fit of each item to the Rasch model when it was used with all pre-MATE items or when used only as part of the separate Fact or Credibility dimension. If items tended to fit better in the model with all pre-MATE items this was evidence the pre-MATE is a unidimensional instrument; whereas if the items tended to fit better in the Fact or Credibility models this was evidence the pre-MATE is a bidimensional instrument. The process of running three separate unidimensional Rasch models and using the item fit approach to compare the models was repeated with the post-MATE responses. After evaluating item fit, the stacking procedure outlined by Wright (1996, 2003) was then used to fit three more Rasch partial credit models (an all item model, a Fact item only model, and a Credibility item only model) using both the pre-and post-MATE responses simultaneously in order to estimate comparable pre- and post-MATE person scores. The person scores from these three simultaneously estimated models were used for all subsequent correlation and regression analyses. For the GAENE, a single Rasch partial credit model was run from which the item fit was evaluated and the person scores were used in the correlation and regression analyses.

Change in pre and post MATE responses

Changes in student responses to the MATE instrument from the pre-to post-administrations were investigated in three ways:

  1. 1.

    Change in the simultaneously estimated pre- and post-MATE Rasch-scaled scores were compared with a paired t-test.

  2. 2.

    Change in the raw summed scores calculated as originally proposed by Rutledge and Sadler (2007) were compared using mean normalized change (c; Marx and Cummings 2007). Normalized change calculates the mean of the change in raw summed score from pre- to post-test, rather than the change in the mean raw summed score from pre- to post-test. In keeping with Marx and Cummings (2007) recommendations, students who scored 100% on both the pre- and post-MATE instrument were removed from the analysis of normalized change, as those students’ performance was beyond the scope of the instrument’s measurement (Marx and Cummings 2007, p. 87).

  3. 3.

    At the item level, the association between raw ordinal responses for all 20 items of the pre- and post-MATE were compared using Cramer’s V, an effect size measure from the association based family of effect sizes (Cramer 1946; Cohen 1988). Values for Cramer’s V range from 0 to 1, with larger values indicating a stronger association. Cohen’s (1988) standard was used to interpret the strength of association, where V values between 0.1 and 0.29 represent a small association, values between 0.3 and 0.49 represent a medium association, and values above 0.5 represent a large association.

Association between MATE and GAENE Rasch-scaled scores

To measure the degree of association between the MATE and GAENE Rasch-scaled scores, bivariate Pearson product-moment correlations were calculated between the GAENE Rasch scores and the pre- and post-MATE Rasch scores from each of the three MATE Rasch models (all items, Fact items only, Credibility items only) when the pre- and post-MATE scores were estimated simultaneously. The correlation coefficients were then disattenuated of (i.e. corrected for) measurement error using the formula first presented by Spearman (1904). Estimates of reliability quantify the extent to which variance in Rasch scores on the evolution acceptance instruments was due to measurement error. Thus, attenuated (i.e. uncorrected) correlations not only measure the association between students’ true evolution acceptance as measured by the MATE or GAENE, but also any measurement error. By correcting the correlation by the score reliability of the two instruments, measurement error can be removed from the estimation of the association between the two instruments’ measurement of evolution acceptance. The disattenuated correlations also provide evidence for whether the MATE is a unidimensional or bidimensional instrument: if the Rasch scores from the Fact and Credibility dimensions are highly correlated with each other and with the scores from the all-items model, then we can conclude that having separate Fact and Credibility dimension scores does not provide any unique information about students’ acceptance of evolution beyond what a unidimensional MATE score provides. As with Cramer’s V, Cohen’s standard was used to determine the strength of the associations (Cohen 1988).

Multiple regression

The Rasch scores from the all item pre-MATE, all item post-MATE, and GAENE were used as the outcome variable in three separate multiple regression models to investigate variables possibly predictive of evolution acceptance: gender, ethnicity, college GPA, and Math ACT. Two additional regression models were run to further investigate the association between the MATE and GAENE instruments while controlling for the other variables in the regression model. First, with the post-MATE as the outcome variable, the pre-MATE and GAENE Rasch scores were added to the initial model, and second, with the GAENE as the outcome variable, the pre- and post-MATE Rasch scores were added to the initial model.

Results

Validity evidence

Dimensionality

Confirmatory factor analysis was used to evaluate whether the MATE should be considered a unidimensional measure of evolution acceptance or bidimensional instrument measuring separate Facts and Credibility dimensions for a sample of health science undergraduate students. A likelihood ratio test directly comparing the unidimensional and bidimensional models was significant for both the pre-MATE (χ2 = 7.32, df = 1, p = 0.01) and the post-MATE (χ2 = 29.04, df = 1, p < 0.01), indicating that the bidimensional model significantly improved model fit. The fit statistics (Table 1) for both the unidimensional model and the bidimensional model at pre- and post-administration, however, fall outside Hu and Bentler’s (1999) criteria that an acceptable model should have a CFI ≥ 0.95, RMSEA ≤ 0.06, and SRMR ≤ 0.08. This suggests that although the bidimensional model is the better fitting model, it is still a poor fit for the data. In contrast, all of the MATE models had high hierarchical omega reliability with values > 0.95, indicating in a classical test theory sense that a large proportion of variation in the observed raw summed scores on the MATE was true variation in the summed scores as opposed to measurement error. Given that CFA is sample dependent, the high reliability yet poor model fit suggest the MATE raw summed scores were measured with precision, but the summed score was a weak measure of the underlying construct for this sample of health science undergraduate students.

For the GAENE, the unidimensional model met the fit criteria on the CFI and SRMR for acceptable fit with values of 0.98 (≥ 0.95) and 0.05 (≤ 0.08), respectively, but the RMSEA of 0.10 was above the criteria of ≤ 0.06. Hu and Bentler (1999) note, however, that when CFI > 0.96 models can still have acceptable fit when RMSEA and SRMR > 0.09. Therefore, the GAENE unidimensional model can be considered an adequate fit for the data from the sample. Taken in conjunction with the high hierarchical omega reliability (ωh= 0.96), these results are evidence that the raw summed score from the GAENE is a precise measure that can be interpreted as an adequate indication of evolution acceptance for a student in the sample. Although the GAENE and MATE models cannot be compared directly, the model fit from the CFA provide evidence that the GAENE is a better instrument for measuring evolution acceptance than the MATE for health science undergraduate students in the sample.

Item calibration and person scores from Rasch scaling

Results from the Rasch analysis on the pre-MATE and post-MATE data provide ambiguous evidence for whether the MATE should be considered a unidimensional or bidimensional instrument. Data fit the Rasch model when the item and person outfit mean square and infit mean square are close to 1.0. Additionally, the item location and person score estimates on the shared logit scale are more precise as the marginal reliability (ρ) approaches 1.0. The person outfit, infit, and marginal reliability were 0.99, 1.02, and 0.93 for the pre-MATE with all items model (Table 2). These were closer to 1.0 than the corresponding values for the pre-MATE Facts (outfit = 0.90, infit = 0.93, ρ = 0.90) or Credibility models (outfit = 0.91, infit = 0.94, ρ = 0.86). The item marginal reliability was similar for the three pre-MATE models (All items: ρ = 0.87, Facts: ρ = 0.86, Credibility: ρ = 0.87). The item outfit and infit values for each item, displayed in Fig. 1, are more indicative of fit than the scale-level values (Linacre 2002). Acceptable values are between 0.5 and 1.5 (see Additional file 1 for full item-level statistics). For the pre-MATE with all items model, items 2, 15, and 19 had outfit and/or infit > 1.5, indicating these items did not fit the Rasch model and inclusion of these items deteriorates the quality of the instrument. For the pre-MATE Facts and Credibility models only item 15 had an outfit > 1.5, meaning that this item increased the instruments measurement error because it was overly sensitive to outliers in the person responses.

Table 2 Fit and reliability for Rasch scales
Fig. 1
figure 1

Item level outfit mean square and infit mean square for Rasch scales. Item outfit and infit should be close to 1, designated in the figure by the black line, and values between 0.5 and 1.5, designated by the grey lines, are considered acceptable

The post-MATE with all items model had person outfit, infit, and marginal reliability of 1.05, 1.08, and 0.93, respectively, which were equal or closer to 1.0 than the corresponding values for the post-MATE Facts (outfit = 0.95, infit = 0.91, ρ = 0.89) or Credibility models (outfit = 0.91, infit = 0.94, ρ = 0.86). The item marginal reliability was similar for the three post-MATE models (All items: ρ = 0.78, Facts: ρ = 0.80, Credibility: ρ = 0.77). As shown in Fig. 1, items 11, 17, and 19 had outfit > 1.5 for the post-MATE with all items model, indicating these items did not fit the Rasch model due to their sensitivity to person response outliers. Only item 11 had outfit > 1.5 for the post-MATE Facts and Credibility models.

For the last set of Rasch models on the MATE, the pre-MATE and post-MATE data were used simultaneously to estimate the item locations and person scores primarily for the purpose of creating comparable pre- and post-MATE person scores. The all items model had person outfit, infit, and marginal reliability of 1.01, 1.05, and 0.93, respectively, which were closer to 1.0 than the corresponding values for the Facts (outfit = 0.91, infit = 0.92, ρ = 0.90) or Credibility models (outfit = 0.91, infit = 0.94, ρ = 0.86). The item marginal reliability was similar for the three MATE models with pre- and post-responses estimated simultaneously (All items: ρ = 0.91, Facts: ρ = 0.91, Credibility: ρ = 0.92). As shown in Fig. 1, items 2, 15, and 19 had outfit > 1.5 in the all items model while items 11 and 15 had outfit > 1.5 in the Facts and Credibility models. These items did not fit the Rasch model as a result of over sensitivity to person response outliers. In consideration of whether the MATE is a unidimensional or bidimensional instrument the results do not provide clear support. For the pre-MATE, post-MATE, and simultaneously estimated MATE the models with all items produced equal or better person fit and reliability, but the Facts and Credibility models demonstrated better item fit.

The Rasch model using the pre- and post-MATE data simultaneously was used for comparison with the Rasch model fit to the GAENE data. The person outfit, infit, and marginal reliability for the GAENE were 0.94, 0.98, and 0.93, respectively. Despite containing seven fewer items, the person fit and reliability for the GAENE was better than the MATE Facts and Credibility models and similar to all items MATE model. The item reliability for the GAENE (ρ = 0.86) was lower than the MATE models (All items: ρ = 0.91, Facts: ρ = 0.90, Credibility: ρ = 0.92); however, this result is unsurprising given that a larger person sample leads to higher item reliability and the MATE models used both pre- and post-responses, and thus, had twice the sample size as the GAENE (Bond and Fox 2015). Regarding item fit, while both the MATE with all items and the MATE Fact and Credibility models had multiple items with high outfit, all of the items on the GAENE demonstrated acceptable infit and outfit.

Results from the Rasch analysis suggest that the GAENE data fit the Rasch model meaning the resulting item locations and person scores are placed on the same linear and interval-level scale with the item locations independent of the person score distribution and, unlike the raw summed scores, the person scores are independent of the item location distribution. Although the person fit was acceptable, regardless of whether the MATE was estimated with all items or with the Facts and Credibility dimensions estimated separately, some items demonstrated high outfit suggesting that the MATE is sensitive to person response outliers. Poor outfit, however, is less of a threat to validity for the interpretation and use of scores than infit (Linacre 2002), so as a whole the MATE can be considered an adequate fit for the Rasch model. Nonetheless, the Rasch analysis provides evidence that the GAENE more appropriately measures evolution acceptance than the MATE.

Changes in pre- and post-MATE scores

When the Rasch-scaled pre-MATE and post-MATE scores were estimated simultaneously for comparison, students demonstrated a significant change in scores (t(104) = 3.94, p < 0.01) from pre- (M = − 0.18, SD = 1.26) to post-assessment (M = 0.18, SD = 1.41). The effect size of the Rasch score change of 0.36 from pre- to post-MATE was d = 0.38, considered a small to medium effect size (Cohen 1988). Using the raw scores, the mean pre-MATE score was 78.68 (SD = 12.44) and mean post-MATE score was 81.72 (SD = 12.41), with a mean normalized change (c) of 14.21%. According to categories of acceptance developed by Rutledge (1996) and reported in Rutledge and Sadler (2007), MATE raw scores between 77 and 88 represent “High Acceptance”.

At the item level, a Cramer’s V association of 1.00 signifies that a student’s response to an item on the pre-MATE was a perfect indicator of the student’s response to the item on the post-MATE whereas a Cramer’s V of 0.00 means a student’s pre-MATE response was unrelated to their post-MATE response. The Cramer’s V associations ranged from 0.31 to 0.59 (Fig. 2) suggesting there was a medium to large association (i.e. effect size) in pre- and post-MATE raw ordinal responses for all items, but also that there was some change in response patterns between pre- and post-administrations of the MATE.

Fig. 2
figure 2

Cramer’s V Association between pre- and post-MATE responses. Error bars represent the 95% confidence interval of the V statistic

Investigating the association between MATE and GAENE scores

GAENE scores

Students in our study obtained a mean Rasch scaled score of − 0.01 (SD = 1.79) and mean summed raw score of 51.70 (SD = 9.02) on the GAENE instrument. Unlike for the MATE, the authors of the GAENE instrument elected not to propose cutoff scores to delineate what GAENE score constitutes low acceptance, moderate acceptance, and high acceptance (Smith et al. 2016, pp. 1309–1310).

Disattenuated correlations

Disattenuated correlations correcting for measurement error reveal significant, strong correlations between the evolution acceptance Rasch scores produced by the MATE and GAENE instruments (Table 3). The GAENE was only administered at the end of the semester and are most appropriately compared to the post-MATE Rasch scores. Nonetheless, significant associations between the GAENE Rasch scores and both the pre-MATE and post-MATE held when compared with the Rasch scores from all MATE items, or when compared with the Rasch scores from the Facts and Credibility dimensions of the MATE.

Table 3 Disattenuated correlations for Rasch-scaled scores

The disattentuated correlation between the Facts Rasch score and the Credibility Rasch score was 0.95 and 0.92 for both pre- and post-MATE, indicating that after correcting for measurement error the two scores were largely redundant. Additionally at both pre- and post-administration, the Facts and Credibility Rasch scores were perfectly correlated (r = 1.00) with the Rasch score from all MATE items meaning that the Fact and Credibility scores provided no additional information above and beyond what was already provided by the unidimensional scores. Therefore, from a practical standpoint, using and reporting a unidimensional MATE score is more efficient than separate Fact and Credibility scores.

Multiple regression

Multiple regression models were run separately on pre-MATE, post-MATE, and GAENE Rasch scores with gender, URM status, college GPA, course performance, and Math ACT as variables possibly predictive of evolution acceptance (Table 4). Overall the variables explained little of the variation in evolution acceptance scores with R2 values of 0.07, 0.12, and 0.11 for the pre-MATE, post-MATE, and GAENE, respectively. The R2 for the pre-MATE model was lower in part because course performance was not included in the model given that course performance was measured after the pre-MATE and therefore could not be a predictor. The regression models with demographic and academic performance as predictors identified only URM as a significant predictor of Rasch GAENE score (β = − 0.37, p = 0.03), but none were significantly associated with pre- or post-MATE Rasch scores. The association between URM and GAENE became non-significant (β = − 0.16, p = 0.09), however, after adding pre- and post-MATE scores to the model. In contrast, both the pre-MATE (β = 0.31, p < 0.01) and post-MATE (β = 0.72, p < 0.01) scores were significant indicating that the two time points each explain unique variation in GAENE scores and highlights that there are differences between the pre- and post-MATE scores.

Table 4 Unstandardized and standardized coefficients (standard errors) for regression models

Although the demographic and academic performance variables only explained 11% of the variation in GAENE scores, pre-MATE (β = 0.31, p < 0.01) and post-MATE (β = 0.72, p < 0.01) scores when added to the model explained an additional 62% of the variation in GAENE scores. Similarly, the demographic and academic performance variables only explained 12% of the variation on post-MATE scores with pre-MATE (β = 0.27, p < 0.01) and GAENE (β = 0.56, p < 0.01) scores explaining an additional 62% of the variation in post-MATE scores when added to the model (Table 4).

Discussion

Evolution acceptance in undergraduate health sciences majors

The students in our study reported a high level of evolution acceptance at the start of the semester: the average pre-test value based on the raw score for the MATE in our sample was 78%, which is just above the boundary between ‘Moderate Acceptance’ (65–76%) and High Acceptance (77–88) using the categories of acceptance developed by Rutledge (1996) and reported in Rutledge and Sadler (2007). This result is strikingly similar to the average raw MATE score of 77.17% obtained by Dunk et al. (2017, Table 5). The demographic composition of the sample in the study by Dunk et al. (2017) was also strikingly similar to ours: “skewed young, white, and female with a high proportion of health majors”. By comparison, other studies have reported lower MATE raw scores in college biology majors (Rissler et al. 2014; Ingram and Nelson 2006), college non-biology majors (Rutledge and Sadler 2007; Deniz et al. 2008), gifted high school students (Wiles and Alters 2011), and biology teachers (Rutledge and Warden 1999). From this, we conclude that evolution acceptance in our population of health sciences students is relatively high when compared to a number of other university student populations, both biology majors and non-biology majors, although not all (Table 5).

Table 5 MATE scores reported in the literature for various populations

Change in evolution acceptance pre- to post-test

Although this study did not investigate the impact of a specific curriculum intervention, students in this study were enrolled in a foundational introductory biology course and experienced instruction in a wide variety of biology topics, including evolution, between the administration of the pre- and post-MATE. A significant increase in students’ reported level of evolution acceptance was found between pre and post MATE Rasch scores. Other studies implementing a pre- and posttest design using the MATE as an instrument have similarly reported significant gains pre- to post-test (Rissler et al. 2014; Ingram and Nelson 2006; Wiles and Alters 2011), while others have failed to find a significant different following instruction (Walter et al. 2013). Rissler et al. (2014) reported significant gains in evolution acceptance, but only for the “least religious” students (p. 11). From our results, we conclude that the curriculum design and instruction implemented for our undergraduate introductory biology course is having an impact on student acceptance of evolution. We think this is notable for at least two reasons: (1) change was demonstrated after a single semester of instruction as opposed to a two-semester sequence, and (2) students’ level of evolution acceptance was significantly positively impacted despite not having an explicit emphasis or curriculum intervention designed to target evolution acceptance. This result appears to be consistent with other studies reporting increased evolution acceptance as a result of instruction in general biology and other courses in which evolution is a topic of study (c.f. Wiles and Alters 2011), but in contrast to courses in which topics in evolution are likely absent (e.g. anatomy and physiology) and no change in evolution acceptance is observed (c.f. Rissler et al. 2014, p. 10).

Validity evidence

The confirmatory factor analysis directly compared a unidimensional and bidimensional model for the MATE with the significant likelihood ratio test at both pre- and post-test providing evidence that structurally the MATE is a bidimensional instrument. The Rasch analysis provided additional, albeit limited, evidence for a bidimensional MATE structure as the all item model had more misfitting items than the Fact and Credibility models at pre- and post-test and when the pre- and post-MATE data were used simultaneously. The person fit and reliability, however, favored the all items model and the disattentuated correlations showed that the having two scores for the MATE was redundant. Therefore, results from the present study suggest that while the MATE might more appropriately measures two dimensions of evolutionary acceptance, interpretation and use of a single unidimensional score is equally informative and more practically efficient. The evidence is, however, ambiguous enough to warrant further investigation. The vague dimensionality could be due to measurement error or the MATE might measure evolution acceptance differently under different circumstances or with different groups of people. One future avenue of research to address this quandary would be to perform a differential item functioning analysis to investigate measurement invariance between various groups of respondents, such as students in natural and health science majors versus students in liberal arts majors or people with high and low religiosity.

The GAENE produced adequate fit and reliability in both the CFA and Rasch analysis to provide converging evidence that the GAENE is a unidimensional measure of evolution acceptance. The MATE, in addition to the ambiguity of its dimensionality, had poor model fit in the CFA, and the Rasch analysis showed that some items were over-sensitive to outlier person responses. Therefore, the psychometric evidence points to the GAENE being the superior measure of evolution acceptance.

Measures of student performance and evolution acceptance

Measures of student performance (overall course grade, ACT Math, college GPA at start of term) did not emerge as predictive of MATE and GAENE Rasch scores in regression modeling (see Table 4). While some authors have reported significant associations between knowledge in evolution and acceptance of evolution (Rutledge and Warden 2000; Nadelson and Southerland 2010; Walter et al. 2013; Carter and Wiles 2014), others have found no significant association between knowledge of and acceptance of evolution (Cavallo and McCall 2008; Sinatra et al. 2003). A proposed limitation of the MATE instrument for measuring students’ level of acceptance is the possible conflation of knowledge and acceptance by inclusion of items in an acceptance instrument that measure knowledge (Smith et al. 2016, p. 1293), and that considering the MATE as a bidimensional instrument may help to address this issue. However, we found limited evidence for considering the MATE as a bidimensional instrument, and no practical utility for reporting scores beyond a single unidimensional score.

While the average MATE raw score for our sample indicates “High Acceptance” according to the categories developed by Rutledge (1996) and reported in Rutledge and Sadler (2007), the GAENE has no developed cutscores for interpreting relative acceptance using raw scores from this instrument (Smith et al. 2016, p. 1310). As we are aware of no other study that reports scores from both MATE and GAENE in the same sample, it will be of interest to see if future work replicates the significant and strong correlation between the MATE and GAENE evolution acceptance scores as we report here.

There are several additional limitations which may affect the results reported in our study. First, our investigation of evolution acceptance involves a relatively small number of students that is not intended to be representative of all undergraduate populations; our study sample is a reasonable representation of the undergraduate population of health sciences majors at our institution, and thus is informative. In this study, we sought to investigate evolution acceptance in this population in conjunction with other variables that have been reported to co-vary or impact evolution acceptance, including student demographic and performance variables. However, we did not include a formal measure of student knowledge in evolution, but rather used more holistic measures of student knowledge (e.g. overall course performance, ACT Math, college GPA). As such, we are limited to the extent that we can comment on the relationship between student knowledge in evolution and acceptance of evolution. Further, overall course performance is not a perfect measure or representation of a students’ knowledge in biology. Additional measures of broad student knowledge—ACT Math score and cumulative college GPA at the start of term—were also included in our investigation. While geographical location and context may impact students’ evolution acceptance (Berkman and Plutzer 2011; Belin and Kisida 2014), we also did not address this as a variable in our study. A final limitation of our study is that we did not include a measure of religiosity, a variable which has been repeatedly reported to have a significant association with evolution acceptance (Smith 2010; Rissler et al. 2014; Dunk et al. 2017).

These identified limitations point to future directions for continued investigation to more thoroughly understand evolution acceptance in this population of undergraduate students, and broadly. We are also quite interested in exploring, as others have done (c.f. Smith 2010) the impact of curricular modifications to determine if differences in instructional approaches will affect either short- and/or long-term measures of evolution acceptance in our students. Further work should also address potential disparities in evolution acceptance between URM and non-URM status students. While the present work identified URM as a variable predictive of evolution acceptance (with URM students having lower acceptance as compared to non-URM students), this variable was not predictive in all models, and thus is as yet of ambiguous importance to evolution acceptance broadly, and evolution acceptance in this population specifically.