Introduction

Concepts are fundamental to advancing and understanding biology, physics, and chemistry (Mayr, 1997). Ernst Mayr (1997), a twentieth century’s leading evolutionary biologist and philosopher of biology, pointed out the central role of concepts in life and physical sciences. He stated that “[t]he two major contributors to a new theory in the life sciences are the discovery of new facts (observations) and the development of new concepts” (p. 62).

Concept mapping is an instructional strategy usually used to enhance science achievement for elementary and secondary education students, who are the focus of the present study. Science achievement refers to vocabulary development, understanding, and reading comprehension of scientific texts. Concept mapping, a versatile tool, extends its utility beyond the realm of science to encompass various disciplines such as mathematics, social science, humanities, and business management. However, our meta-analysis centers specifically on the utilization of concept mapping within the realms of natural science, specifically in the fields of physics, chemistry, and biology.

A concept map is a graphical tool, an alternative to written text, for representing verbal information and illustrating and organizing a set of relationships among concepts (Nesbit & Adesope, 2013; Novak & Cañas, 2008; Novak, 2010a). A fundamental characteristic of a concept map is its structural and relational nature, establishing connections between various concepts (Davies, 2011). A concept map embodies a “hierarchical tree” in levels featuring superior and subordinate components, encompassing primary, secondary, and tertiary ideas (Davies, 2011). The more general concepts are often placed at the top or center of a concept map (Davies, 2011; Novak, 1991, 2010a; Novak & Cañas, 2008). Concepts are usually enclosed in circles or boxes, and relationships between concepts are illustrated by connecting lines; these lines contain words such as “is,” “are,” and “includes,” or short prepositional phrases such as “leads to,” “results from,” “is part of,” etc. linking two concepts showing their relationships (Davies, 2011; Novak, 1991, 2010a; Novak & Cañas, 2008). Examples are sometimes included to clarify the meaning of a concept (Davies, 2011). In the 55 studies of our meta-analysis, we observed that concept maps were typically hierarchically structured, with the central concept typically positioned at the top or center, perhaps because they were addressed to elementary, intermediate, and secondary education students, and this structure is more straightforward. However, we note that non-hierarchical types of concept maps (i.e., network concept maps) are also in use (Amadieu et al., 2009; Davies, 2011).

Mind mapping is similar to a concept mapping tool, defined as visual and non-linear representations of ideas and their interconnectedness (Davies, 2011). Mind mapping and concept mapping differ in terms of precision and formality. Whereas mind maps are less formal and structured, concept maps are formal and generally more tightly structured (Davies, 2011).

Not surprisingly, concept maps have been extensively used in school settings as an instructional strategy to organize information into a graphic form, creating a visual representation of the text structure (Kwon & Cifuentes, 2007; Sturm & Rankin-Erickson, 2002). Elementary and secondary education students are more likely to understand and remember relationships among concepts when involved in the mapping process (Karpicke & Blunt, 2011; Novak & Cañas, 2008; Rewey et al., 1989). Thus, students can actively identify critical ideas and details in the new content (Carnine & Carnine, 2004; Marzetta et al., 2018). Concept maps have been used in several ways: (a) to present information to students, (b) for students to represent their knowledge in learning activities, and (c) to assess what students know (Nesbit & Adesope, 2013). Concept mapping may enhance organizational cues for retrieving concepts and information from memory (Kwon & Cifuentes, 2007). Concept maps can also be scaffolding tools to relate new learning material to students’ prior knowledge (Marzetta et al., 2018; Novak, 2010a).

Theories Explaining Concept Mapping Effects on Learning

Several theories can explain the effectiveness of concept maps. We will concentrate on the three learning and instructional theories that may offer a closer explanation of the learning outcomes associated with concept maps.

First, concept mapping was initially linked to meaningful learning, a concept formulated by Ausubel (1962) within the framework of his subsumption (assimilation) learning theory (Novak & Cañas, 2008; Novak, 1991, 2010a, 2010b). In the subsumption learning theory, meaningful learning is contrasted with rote learning (Ausubel, 1962). Rote learning occurs when new information is acquired by verbatim memorization and arbitrarily incorporated into prior knowledge (Novak & Gowin, 1984; Novak, 2010a). Instead, meaningful verbal learning refers to anchoring new knowledge (or concepts) to relevant prior knowledge and existing cognitive structure (Ausubel, 1962, 1963; Novak & Gowin, 1984; Novak, 2010a).“Cognitive structure” refers to hierarchically organized concepts in any field of knowledge (e.g., biology, physics, chemistry) that represent the residue of information and ideas (i.e., schemas in their long-term memory) (Ausubel, 1963; Novak & Gowin, 1984). Ausubel’s primary hypothesis was that learning is facilitated, and retention can be maintained over time when knowledge is meaningfully acquired and hierarchically organized (Ausubel, 1963; Kirschner & Hendrick, 2020; Novak, 2010a). The organized structure of concept maps aligns with Ausubel’s idea of organizing knowledge in a meaningful way. Requiring students to reflect on relationships, connections, and hierarchies between concepts enhances meaningful learning (Novak, 2010a).

Second, since concept maps contain graphical elements and sometimes symbols and pictures, Allan Paivio’s (1986) dual coding theory (DCT) is relevant to understanding concept mapping effectiveness. According to the DCT, information can be stored in two distinct but interconnected coding systems: a verbal system that deals with language and a non-verbal system that deals with information concerning nonverbal objects (i.e., images, symbols) (Clark & Paivio, 1991; Paivio, 1986). Concept maps involve the use of both verbal and visual elements. The verbal elements consist of labels, keywords, or short written text, while the visual elements include shapes, lines, and spatial organization. Thus, the two-modal representation (verbal and visual) of concept mapping aligns with the principles of DCT, which propose that information encoded using two modalities can be more efficiently integrated and retained than relying solely on a single, typically verbal system (Kirschner & Hendrick, 2020). DCT may provide a more precise mechanism of action (the potential of dual coding) for explaining how concept maps enhance learning outcomes compared to Ausubel’s (1962, 1963) subsumption learning theory. In situations where experimental methods demonstrate the effectiveness of instructional practices, it becomes essential to understand the underlying mechanisms of action (Bunge, 2013).

Third, cognitive load theory (CLT), a theory about instructional design developed by John Sweller and his colleagues (Sweller, 1994, 2020; Sweller et al., 2011), has been used to explain the functioning and usefulness of concept mapping. The starting point of the CLT is that there is limited working memory to code information. The limitations of working memory can be circumvented by organizing schemas (mental frameworks) of knowledge that allow students to work on complex material without exceeding their cognitive capacity (Kirschner & Hendrick, 2020). In other words, because working memory is limited, teachers should design learning experiences to reduce the extraneous cognitive load related to the additional cognitive load imposed by instructional design and presentation of information (Chen et al., 2023; Sweller et al., 1998, 2011). Intrinsic cognitive load refers to the inherent complexity or difficulty of the material being learned, making it intrinsic to the material being addressed (Sweller et al., 2011). Inherent load is defined by the number of novel elements, the element interactivity that has to be processed, and the expertise of the learners (Chen et al., 2023; Sweller et al., 1998, 2011). Thus, by visually simplifying unnecessary verbal information, the concept maps can reduce the extraneous cognitive load. Furthermore, the graphical and relational structure of concept maps can provide a higher degree of organization when compared to traditional text; this increased organization can assist students in structuring their learning materials into schemas or activating and reorganizing relevant schemas, thus reducing the extraneous cognitive load (Amadieu et al., 2009; Davies, 2011; Rivet & Krajcik, 2007). By providing a visual representation of otherwise verbal information and logically organizing the information, concept maps can be beneficial for low-achieving students and students experiencing learning difficulties (Kirschner, 2002; Sweller, 1994).

In summary, concerning concept maps, both CLT and DCT theories hold significance and offer potential mechanisms of action (Bunge, 2013). CLT's emphasis on minimizing extraneous cognitive load is relevant to concept mapping, as concept maps often provide a higher level of organization. DCT's emphasis on integrating verbal and visual elements to improve retention and comprehension aligns with concept mapping, as concept maps inherently involve this integration.

Previous Meta-analyses

Previous meta-analyses, such as those by Horton et al. (1993), Nesbit and Adesope (2006), and Schroeder et al. (2018), were more generic and did not specifically focus on science education at the elementary, intermediate, and secondary school levels. For example, in Horton et al.'s (1993) meta-analysis, six of the 19 studies included in the analysis were conducted at the college/post-secondary level. In Nesbit and Adesope's meta-analysis (2006, pp. 441–446), 38 out of the 55 studies that specified grade levels were conducted at the post-secondary level. The same meta-analysis also focused on a broader range of learning subjects, including statistics, humanities, law, and social studies (Nesbit & Adesope, 2006). Schroeder et al. (2018) categorized their studies into STEM (science, technology, engineering, and mathematics) or non-STEM fields. In addition, Nicoara et al. (2017) focused on the medical field.

Science instruction represents a specialized instructional domain distinguished by its unique curriculum (physics, chemistry, biology), content, and perhaps particular instructional challenges. A meta-analysis tailored to the science instruction context is essential to provide insights and guidance relevant to science educators. Furthermore, while broader meta-analyses provide valuable insights, they may not capture the specificities of science education at the elementary, intermediate, and secondary levels. Our search indicated no meta-analysis focusing exclusively on the effectiveness of concept maps in teaching science to elementary and secondary students (Grade 3 to Grade 12). Therefore, this study evaluates the impact of concept mapping on student achievement in science within elementary, intermediate, and secondary education settings. We aim to bridge existing research gaps that pertain to concept mapping effectiveness, particularly the emphasis on science education and the curriculum relevance of concept maps in schools. It is important to note that the complexity of the curriculum for elementary and secondary school students may differ from that of higher education or adult learning contexts.

Effects of Moderator Variables

The effectiveness of concept maps is examined through various moderators, demonstrating the diverse and dynamic nature of educational contexts, learning subjects, and learners. This wide range of moderator variables can shed light on concept maps’s impact on learning outcomes and their specific aspects of usage. Among the crucial moderators we explore are the following:

Learning Subject

Physics, chemistry, and biology are distinct scientific disciplines, each with its concepts, principles, processes, and facts. Students may encounter different learning challenges in physics, chemistry, and biology due to differences in the nature of experiments, equipment and settings, physical laws, chemical reactions and equations, biological processes, or the complexity of the mathematical concepts involved. Ernst Mayr (1997, p. 63) suggested that concepts may be more critical to biology (e.g., natural selection, adaptation, evolution, and ecology) than physical sciences, where discovering new facts may be more critical. Nevertheless, instruction in physical sciences and biology varies from the process of discovering new knowledge in these fields. Furthermore, analyzing the effectiveness of concept mapping in each learning subject allows for a tailored examination of how an instructional tool aligns with the distinct content of these three instructional domains.

Level of Achievement

Characteristics of learners may be relevant to the effectiveness of concept maps (Nesbit & Adesope, 2013). But little is known about the effectiveness of concept mapping for low-achieving or high-achieving students. Few studies have suggested that concept maps could be an effective tool for science instruction for low-achieving students (Guastello et al., 2000; Morfidi et al., 2017; Roberts & Joiner, 2007).

In our review, most studies included students with mixed abilities or did not define their participants’ learning achievement well. Several studies have not stated whether low-achieving students or high-achieving were among their participants. For example, only three studies that fulfilled the selection criteria had their exclusive focus on “low achieving” students (Guastello et al., 2000), “poor readers” (Morfidi et al., 2017), and autism (Roberts & Joiner, 2007) and their control groups had similar low-achieving learning characteristics to the experimental groups. In addition, among the set of seven studies with experimental groups designated by ability, we identified two other studies (Haugwitz et al., 2010; Schmid & Telaro, 1990), which also included control “low achieving” groups as clearly defined subgroups, and had all the information to run a second more focused meta-analysis. Two other studies (Brown, 2000; Udupa, 1992) out of the aforementioned seven studies included experimental groups clearly defined as low achieving, but their control groups were of mixed ability; therefore, these two studies were excluded from the second meta-analysis, which specifically concentrated on exceptional learners, including either low-achieving or high-achieving students. Our literature search identified four studies (Gulati, 2005; Ling & Boo, 2007; Schmid & Telaro, 1990; Snead, 2000), including experimental and control “high-achieving” groups as clearly defined subgroups in their analysis. That is, their control groups had similar high-achieving learning characteristics to the experimental groups. Although one study (Brown, 2000) included an experimental group clearly defined as high achieving, the control group was of mixed ability; for this reason, this study was not included in the second meta-analysis.

Teaching Guidance

Concept mapping is an adaptable instructional tool that can be implemented in various ways, from highly structured and directed by the teacher or researcher to minimally guided and constructed by students. We distinguish here three levels of teaching guidance: (a) concept maps constructed by learners through independent work, that is, lowly directed; (b) concept maps given or closely directed by teachers/researchers, that is, highly directed; and (c) concept maps scaffolded by teachers/researchers following a mixed or interactive approach (see Chang et al., 2002). Exploring the impact of different levels of teaching guidance allows us to understand how this teaching aspect of concept mapping can influence learning outcomes.

Grade Level and Level of Schooling

The complexity of concept maps may vary depending on the grade level or level of schooling. Cognitive processing also varies with age and level of schooling (Demetriou et al., 2011). For younger students in elementary school, concept maps may emphasize more straightforward concepts and entail fewer relationships. For example, the complexity of the cognitive process is determined by the number of interacting concepts involved (Demetriou et al., 2011; Halford et al., 1998). At higher grade levels, the complexity and interactivity of the concept elements may increase, aligning with the advanced subject matter (Sweller et al., 2011) and resulting in the construction of more intricate concept maps. Understanding how concept maps impact students at different grades and levels of schooling can help tailor teaching strategies to their cognitive needs.

Type of Technology

Concept maps can be either hand-drawn or computer-generated. Hand-drawn concept maps may prompt students to add or develop ideas and require students to put in a significant amount of effort (Erdogan, 2009). On the other hand, computer-generated concept maps can be more easily changed and expanded. Computer-generated concept mapping software usually offers pre-made templates of various maps that can be easily altered and visually improved by adding colorful pictures, symbols, or clip art (Sturm & Rankin-Erickson, 2002). There is evidence from individual studies that both can benefit students by promoting the development of science vocabulary, reading comprehension, and writing (Flanagan & Bouck, 2015; Morfidi et al., 2017; Sturm & Rankin-Erickson, 2002).

Collaboration Level

Concept maps can be individually or collaboratively constructed (Kwon & Cifuentes, 2007, 2009). The collaboration level among students (individual construction of concept maps vs. collaborative work during the intervention phase vs. mixed or not defined approach) may impact the effectiveness of concept mapping. For example, when students individually construct concept maps, they can engage deeply with the content and independently organize their understanding. However, some students may struggle to work with concept mapping independently, mainly if they have limited prior experience. Collaborative concept mapping encourages peer interaction, perhaps enabling some students to rectify misconceptions. However, successful collaborative work necessitates effective communication and coordination among students. In certain instances, dominant group members may exert disproportionate influence over the construction or understanding of the concept map.

Economic Level of Countries

The economic level of countries, whether classified as low and middle-income or high-income, may influence the effectiveness of concept mapping in several ways. For example, high-income countries have better access to educational resources, including technology, teaching materials, and well-equipped classrooms. In contrast, low and middle-income countries may face resource constraints that can affect the implementation and effectiveness of concept mapping. Conversely, in low-income countries where teacher training resources may be limited compared to high-income countries, concept mapping may be considered a more significant innovation.

Publication Type

The publication type (journals vs. dissertations and conference papers) may impact the pooled effectiveness of concept mapping. While journal articles typically undergo a rigorous peer-review process ensuring the quality and reliability of the research findings, the inclusion of dissertations and conference papers, that is, the inclusion of “grey literature,” helps mitigate publication bias and the “file drawer problem” (Paez, 2017). Thus, we included grey literature to reflect the existing evidential base fully, and we assessed the impact of this variation through a moderator analysis.

Considering moderator variables, the research questions for the current meta-analysis are structured as follows, aiming to delve into the multifaceted aspects of concept mapping's effectiveness:

  1. (1)

    What effect does concept mapping have on elementary and secondary education students’ science achievement (physics, chemistry, and biology)?

  2. (2)

    What are the characteristics and conditions of concept mapping instruction that enhance science achievement?

  3. (3)

    What effect does concept mapping have on the science achievement of low-achieving and high-achieving students?

Methods

We performed the meta-analysis following the PRISMA guidelines (Forero et al., 2019; Moher et al., 2009). We initially conducted a systematic review of the literature for studies published in English. Studies were selected from the Academic Search Complete, ERIC, PsycINFO, Web of Science, ProQuest Dissertation Database, and conference papers from the American Educational Research Association (AERA) and National Association for Research in Science Teaching (NARST) (see Fig. 1).

Fig. 1
figure 1

A Prisma flow diagram of the search methods

Inclusion Criteria

We used inclusion criteria similar to those of Nesbit and Adesope (2006) and Schroeder et al.’s (2018) meta-analyses. The criteria for including studies in this meta-analysis were the following:

The criteria for including studies in this meta-analysis were the following:

  1. (1)

    Published studies in English;

  2. (2)

    Articles published in peer-reviewed journals and other sources such as dissertations, theses, and conference papers (e.g., AERA, NAST);

  3. (3)

    Using concept maps as a method to measure students’ achievement in science (biology, physics/earth science, and chemistry);

  4. (4)

    Participants in Grades 3–12;

  5. (5)

    Inclusion of a control group;

  6. (6)

    Reported sufficient data to allow an estimate of standardized mean difference effect size;

  7. (7)

    Assigned participants to groups before differing treatments;

  8. (8)

    Randomly assigned participants to groups or used a pretest or other prior variable correlated with the outcome to control for preexisting differences between the groups. Studies were excluded from this meta-analysis when reported a pretest effect size outside the range − 0.40 < d < 0.40 (Nesbit & Adesope., 2006). Furthermore, when the same authors reported more than one source (e.g., dissertation and journal article), we considered the version published in a journal article.

Literature Search

The literature search focused on studies published between 1980 and May 2020. We used the following combination of broad search terms in a Boolean search: (“concept map* OR knowledge map* OR node-link map*) AND (“science”) to identify potentially relevant studies.

Coding Procedure

After identifying studies that corresponded to the selection criteria according to our research questions, it was necessary to determine a perspective for a correct coding that concerned the present meta-analysis. Two researchers independently read each of the 55 studies retained in the inclusion phase (see Fig. 1) to (a) select group comparisons consistent with the research questions of the meta-analysis and (b) code each comparison according to a predefined coding form.

The coding form included the (b) year of study, (b) subject area (biology vs. physics and earth science vs. chemistry), (c) publication type (journal vs. dissertation and conference papers), (d) country of research, which was recoded according to the World Bank (2020) classification criteria for country economic level into high-income country level vs. middle- and low-income country level (Serajuddin & Hamadeh, 2020), (e) grade level of students and their relevant level of schooling, (f) level of achievement (low-achieving students vs. groups designated by ability vs. mixed levels of achievement), (g) duration of intervention (low duration: < 2 weeks vs. medium duration: 2–4 weeks vs. high duration > 4 weeks; for studies reported in minutes, one week calculated as equivalent to 120 min), (h) the type of technology used by students (pencil-and-paper vs. computer-generated maps or digitalized), (i) degree of teaching guidance (highly directed: maps given or closely directed by teacher/researcher vs. lowly directed: constructed by students vs. mixed-scaffolded or interactive approach), (j) collaboration level among students (individual construction of concept maps vs. collaborative work during the intervention phase vs. mixed or not defined approach) (see Table 1). The first author coded those pertinent features of studies and looked for comparison treatments, including random assignment of participants or control for pre-intervention differences in the experimental and control groups. The second author double-coded the aforementioned features of studies, resulting in interrater reliability of 0.96. Any disagreements were resolved by consensus.

Table 1 Overview of the studies included in the meta-analysis

Statistical Methods

We avoided introducing statistically dependent data, as Lipsey and Wilson (2001) have suggested. Thus, we avoided double or multiple data insertion from studies with one control group when experimental groups underwent different intervention approaches. In this case, we selected only one of the two or more experimental groups, the most relevant group to our meta-analysis, to calculate the standardized mean difference effect size. As effect size, we used the standardized mean difference (SMD):

$$d_{i} = \frac{{X_{1i} - X_{2i} }}{{S_{pi} }}$$

where X1i and X2i are the means of the two groups under comparison in the ith study, and Spi is the pooled standard deviation:

$$S_{pi} = \sqrt {\frac{{\left( {n_{1i} - 1} \right)S_{1i}^{2} + \left( {n_{2i} - 1} \right)S_{2i}^{2} }}{{n_{1i} + n_{2i} - 2}}}$$

Usually, the sample estimate of the SMD is termed Cohen’s d, but it tends to overestimate the absolute value in small samples (Thompson & Sharp, 1999). This bias can be corrected using the Hedges’ g suggested by Hedges and Olkin (1985), which generates an unbiased estimate using a correction factor called J. Therefore, the correction leads to gi = Jdi = di − 3di/(4ni − 9).

The estimated variance of d is given by:

$${\text{var}} (d_{i} ) = s_{i}^{2} = \left( {\frac{1}{{n_{1i} }} + \frac{1}{{n_{2i} }}} \right) + \frac{{d_{i}^{2} }}{{2\left( {n_{1i} + n_{2i} } \right)}}$$

and subsequently var(gi) = J2var(di). The interpretation of the results was based on Cohen’s suggestion: a magnitude of effect < 0.20 was considered trivial, between 0.20–0.50 small, 0.50–0.80 moderate, and an effect size > 0.80 was considered large (Cohen, 1988).

We used the random-effects model for the meta-analysis as proposed by DerSimonian and Laird (1986). The method introduces an additive component of the between-studies variance (τ2), also known as between-study heterogeneity, and we hypothesize that the true effect varies between studies,

$$g_{i} \sim N\left( {g,s_{i}^{2} + \tau^{2} } \right).$$

Then, to obtain a pooled (weighted) estimate of g, we used:

$$\widehat{g} = \frac{{\sum\nolimits_{i = 1}^{k} {w_{i}^{{}} g_{i} } }}{{\sum\nolimits_{i = 1}^{k} {w_{i}^{{}} } }}$$

The weights are given by the inverse-variance estimate used in the meta-analysis with \(w_{i} = \left( {\tau^{2} + s_{i}^{2} } \right)^{ - 1}\)(Normand, 1999; Petiti, 1994). Inferences about the overall effect are based on the normal approximation since:

$${\text{var}} \left( {\widehat{g}} \right) = \frac{1}{{\sum\nolimits_{i = 1}^{k} {w_{i}^{{}} } }}$$

Heterogeneity is not easily evaluated using τ2. Thus, we used the standard Cochran’s Q and the inconsistency index (I2), which ranges between 0 and 100% and provides easy quantification of the variability between studies. The potential sources of heterogeneity were investigated using subgroup analysis or meta-regression analysis using study-level characteristics (Thompson & Sharp, 1999).

Influential studies were identified by removing each time a study from the meta-analysis and observing the change in the pooled estimate. Publication bias was estimated using the rank correlation method (Begg & Mazumdar, 1994), the regression method of Egger’s, and its random effects analog (Egger et al., 1997). The nonparametric “trim and fill” method of Duval and Tweedie (2000) was also used to estimate the number of “missing studies.” The potential time trend of the combined effect over the years was evaluated by applying the standard cumulative meta-analysis approach (Lau et al., 1995) using two different formal tests dealing with time trend: the “first vs. subsequent” comparison (Ioannidis & Trikalinos, 2005) and the Generalized Least Squares (GLS) Regression-based test (Bagos & Nikolopoulos, 2009). The statistical package used in the analysis was Stata 13 (Stata Corporation, 2013), while statistical significance was set at p ≤ 0.05.

Results

On May 12, 2020, we searched for studies published from 1980 until the date the search was conducted in the following databases: Academic Search Complete (ASC; 1245), ERIC (835), PsycINFO (970), and Web of Science (WoS; 2050); the number of studies returned is indicated in parentheses. In the first search (identification phase), we found 5100 studies, many of which addressed topics other than the effectiveness of concept mapping in science. Thus, after reviewing the records, 4803 studies were eliminated as irrelevant. After removing duplicates (n = 113) across databases, 191 articles were maintained and further screened by their abstracts. On May 14, 2020, we searched for dissertation studies in the ProQuest Dissertations and Theses Database published from 1980 until the search date, and 429 dissertations were revealed. After reviewing the records, 372 studies were eliminated as irrelevant, and 57 dissertations were further screened by their abstract. On May 26, 2020, we searched (a) the titles of papers presented at the AERA conferences between 2010 and 2019 in their online paper repository and (b) the titles and abstracts of papers presented at the NARST program conferences between 2007 and 2019. We also conducted a Google Scholar search to find additional papers and gray literature using the same terms. Titles and abstracts were screened for eligibility and discarded if they did not meet the inclusion criteria. This search yielded 123 and 38 papers of potential relevance from online paper repositories and Google Scholar, respectively; thus, 161 papers were further screened by their abstracts. The search results from all data sources are presented in Fig. 1, employing the PRISMA template (Moher et al., 2009).

Of the 409 studies screened by abstract, 262 were irrelevant to concept mapping effectiveness in science (87 from databases, 21 from the ProQuest Dissertations and Theses database, and 154 from conferences, etc.). After meticulously reviewing the remaining 147 studies for their content and applying our inclusion criteria, 55 eligible articles and papers were included in the present meta-analysis. Among those 55 studies, 38 were journal articles, 15 were dissertations (including one Master’s thesis), and two were proceeding papers. Figure 1 illustrates the stages of the selection process of the studies included in the final analysis in a PRISMA Flow Diagram.

The 55 studies yielded 58 comparisons with non-dependent data (Lipsey & Wilson, 2001). Three of the 55 original studies (Hagemans et al., 2013; Redford et al., 2012; Spaulding, 1989) were larger research projects. Each of these three studies comprised two distinct studies with their own separate control groups. Based on the 58 comparisons in the meta-analysis, the total number of participants was 5364. Of them, 2,903 participants were part of the experimental groups, and 2,461 were part of the control groups. The overall ES based on 58 comparisons was estimated at 0.776 (95% CI: 0.597, 0.956; p < 0.0001), suggesting a moderate to large effect of the intervention (see Fig. 2).

Fig. 2
figure 2

Results from meta-analysis by subject and weighted mean effect sizes

The between-studies heterogeneity is high, with an I2 equal to 89.1% (see Higgins et al., 2003). The methods for detecting publication bias suggest a substantial chance for this or other small-study-related bias (p = 0.035 for Begg’s test, p = 0.007 for Egger’s test, and p < 0.0001 for the random effects regression test; Fig. 3).

Fig. 3
figure 3

Funnel plot for detecting publication bias

The trim and fill method detected an asymmetry to the funnel plot and estimated an additional 16 studies that, if appended to the dataset, would “correct” it. This addition would produce an attenuated estimate but still highly significant (g = 0.403, 95% CI: 0.198, 0.608). The influential analysis identified the study of Guastello et al. (2000) as the most influential one (i.e., the one with the largest effect sizes), followed by the studies of Morfidi et al. (2017) and Jena (2019). Interestingly, when the studies of Guastello et al. and Morfidi et al. were excluded from the analysis, the pooled estimate was reduced to 0.668 (95% CI: 0.519, 0.817), and the overall heterogeneity was reduced to 84%, and the evidence of publication bias was eliminated (p = 0.182 for Begg’s test, p = 0.073 for Egger’s test and p = 0.082 for the random effects regression test). The standard cumulative meta-analysis provides evidence for increased effect size over time (Fig. 4).

Fig. 4
figure 4

The standard cumulative meta-analysis shows increased effect size over time

Mean effect sizes varied from moderate to large based on the subject area. For biology (k = 31 comparisons (from 30 studies)), the mean effect size was moderate (g = 0.671; 95% CI: 0.492, 0.851; p < 0.0001), but the between studies heterogeneity was high, with an I2 = 80.8%. For chemistry (k = 7 studies, the mean effect size was small (g = 0.590; 95% CI: -0.204,1.385; p < 0.145), and a high between studies heterogeneity with I2 = 95.0%. For physics (k = 20, comparisons from 19 studies), the mean effect size was large (g = 1.040; 95% CI: 0.656, 1.423; p < 0.0001), but the between studies heterogeneity was also high, with an I2 = 92.2%. In subgroup analyses, the effect sizes obtained from biology, physics/earth science, and chemistry did not yield significant differences among them (p = 0. 220).

The level of achievement (low achieving students vs. groups designated by ability vs. mixed levels of achievements) also yielded significant differences (see Table 2), but this was related to the influential studies of Guastello et al. (2000) and Morfidi et al. (2017), which were performed on low-achieving learners; should these two studies be excluded, the effect disappears.

Table 2 Significant moderators

The degree of teaching guidance (high guidance vs. low guidance vs. mixed approach) seems to be a significant factor with a higher effect size for high teaching guidance (see Table 2). Still, after removing the two influential studies, this result was marginal (p = 0.074).

Regarding the grade level, we did not find significant differences among grades, but it is possible that the lack of significance in this result could be attributed to the limited number of studies within each grade. To explore this further, we made two distinct categorizations of the schooling level. First, similar to Horton et al. (1993), we assessed the effectiveness of concept maps at three schooling levels: elementary (Grades 3–6), middle (Grades 7–9), and high school (Grades 9–12). There were 16 studies containing 1,015 students from Grades 3–6 (elementary), 22 studies with 2,501 students from Grades 7–9 (middle level), and 20 studies with 2,059 students from Grades 10–12 (high level). In the subgroup analysis, the three categories did not yield significant differences (g = 0.836, 0.821, 0.673, respectively, p = 0.76).

Second, like Nesbit and Adesope (2006) and Schroeder et al. (2018), we made an alternative categorization of the schooling level, distinguishing between the intermediate level (Grades 3–8; encompassing 31 studies and 2783 students) and the secondary level of schooling (Grades 9–12; encompassing 27 studies and 2792 students). The results showed a significant difference (g = 0.959 vs. 0.587, p = 0.048), suggesting that the effect is larger in intermediate grades (see Table 2). However, this significant difference may be attributable to the larger proportion of studies investigating chemistry in the secondary grades (7 compared to 0 studies) and the lower proportion of the studies investigating physics (4 compared to 16 studies). We note that in the overall analysis, chemistry had a smaller effect size compared to physics and biology (0.59 vs. 1.04 and 0.671), even though this learning subject difference was not significant.

Regarding the economic level of countries, the effect size from studies conducted in high-income countries is significantly smaller than those from low- and middle-income countries (g = 0.626 vs. 1.053); this difference in effect sizes was not affected by the two aforementioned influential studies (see Table 2). Furthermore, journal publications had larger effect sizes than dissertations and conference papers, and this difference was statistically significant (g = 1.039 vs. 0.254, see Table 2). The year of publication is also related to increased effect size over time, but the proportion of journal publications also increased over time (i.e., in Web of Science, Eric, and PsycINFO databases; see Nesbit & Adesope, 2013). Finally, various features of studies such as technology (pencil-and-paper vs. digitalized), learning environment (individual vs. collaborative vs. mixed or not clearly defined), and duration of intervention did not have significant differences (data not shown in Table 2).

Building a model with multiple variables, we ended up with publication type (journals vs. dissertations and conference papers), country of research by economic level (high income vs. middle- and low-income), and type of learners, resulting in a 46.63% reduction between studies variability. Regarding research question 3, we conducted a separate meta-analysis, including studies focusing on exceptional learners (low-achieving and high-achieving students). The mean effect size was large for low-achieving students (g = 2.019; k = 5) and moderate for high-achieving students (g = 0.610; k = 4), even though the between-groups differences did not reach statistical significance (p = 0.261) due to a small number of studies in this area (Fig. 5).

Fig. 5
figure 5

Results from the second meta-analysis focusing exclusively on low-achieving and high-achieving learners

Discussion

This meta-analysis examined the effectiveness of concept maps in improving students’ science achievement in elementary and secondary education. Specifically, it focused on science school subjects and included 55 studies with 58 comparisons with experimental and quasi-experimental designs. Overall, the results of the present meta-analysis suggest that concept maps are an effective teaching tool to increase students’ science achievement (g = 0.776), even after two influential studies were excluded from the analysis (g = 0.668). In a small subset of their meta-analysis, Nesbit and Adesope (2006) found a similar effect size (g = 0.742, k = 10) when learners constructed concept maps rather than attended a lecture or discussion. Nonetheless, their findings revealed more modest weighted mean effect sizes for concept map construction under two specific learning conditions: (a) in physical science (g = 0.283, k = 6), and (b) in the domains of general science, biology, and statistics (g = 0.522, k = 9). Our findings are also consistent with Schroeder et al. (2018) meta-analysis that indicated that concept maps were superior to other instructional comparison conditions and effective across science, technology, engineering, and math (STEM) (g = 0.60, k = 118), although their meta-analysis was more generic and not specifically focused on science for students in the elementary and secondary education.

In short, evidence from the present meta-analysis suggests that concept maps, compared to traditional instructional methods such as attending lectures, reading text passages, and participating in class discussions, are more effective for facilitating science education. This effectiveness can be explained from three theoretical perspectives. First, concept mapping promotes meaningful engagement and meaningful learning (Novak & Cañas, 2008; Novak & Gowin, 1984; Novak, 2010a). Second, due to their incorporation of both verbal and visual components, concept maps align with the principles of dual coding theory, leveraging the memory system’s capacity to store and organize information in a more interconnected and resilient manner (Kirschner & Hendrick, 2020). Third, concept maps serve as a learning tool, simplifying complex verbal information, omitting unnecessary verbal clutter, and presenting key concepts concisely, thus reducing extraneous cognitive load. Furthermore, concept maps’ graphical and organizational or relational format offers an additional layer of cognitive support. These visual representations provide a certain degree of organization, enabling students to structure their learning materials into coherent schemas. The organizational structure of concept maps can ultimately contribute to reducing the extraneous cognitive load (Amadieu et al., 2009; Kirschner & Hendrick, 2020; Schroeder et al., 2018). For example, biology instruction places a significant emphasis on understanding biological concepts and processes, such as photosynthesis, cellular respiration, homeostasis, and metabolism, and understanding relationships between different elements, such as species interactions within ecological webs. Concept maps can illustrate hierarchical, sequential, and cause-and-effect relationships. In short, concept maps, through their visual clarity and organizational capability, offer a mechanism of action to reduce extraneous cognitive load and enhance learning outcomes in science education.

Learning Subject

Concept mapping seems to be an effective teaching strategy to enhance the science learning of students across science disciplines: biology (g = 0.671, k = 31), physics (g = 1.040, k = 20), and chemistry (g = 0.590, k = 7). Further research is needed for chemistry because of the small number of existing studies (k = 7) and very high heterogeneity.

Evidence-based practices (EBPs) fulfill sound criteria and rigorous indicators (Cook & Cook, 2013). While concept mapping has shown strong evidence-based potential in biology and physics, the case for its effectiveness in chemistry is still promising since the pooled effect size from these studies seemed smaller.

There may be some intrinsic reasons why concept maps typically do a better job in physics than chemistry. Instruction in physics typically places a strong emphasis on theoretical concepts and their application to real-world phenomena. Concept maps can help students bridge the gap between abstract theories and physical phenomena. In contrast, chemistry instruction may have a more experimental focus, where students engage in laboratory work, potentially reducing the reliance on concept maps to bridge theory and practice. Furthermore, in chemistry instruction, other visual aids, such as chemical diagrams, charts, and models, may assume a more prominent role when compared to physics and biology.

Type of Learners

Our first meta-analysis focusing on the level of achievement (low-achieving students vs. groups designated by ability vs. mixed levels of achievements) yielded a statistically significant difference (g = 3.244 vs. 0.551 vs. 0.691), suggesting that the effect is larger in low-achieving students. However, this finding was predominantly influenced by the influential studies conducted by Guastello et al. (2000) and Morfidi et al. (2017), which focused on low-achieving learners; should these two studies be excluded, the effect disappears.

Our third research question focused on the effects that concept mapping has on the science achievement of low-achieving and high-achieving students. For this purpose, we conducted a secondary meta-analysis, specifically incorporating studies that exclusively centered on exceptional learners: low-achieving and high-achieving students. The mean effect size was large for low-achieving students (g = 2.019; k = 5) and moderate for high-achieving students (g = 0.610; k = 4). However, the difference between these two groups did not attain statistical significance, owing to the limited number of studies in the comparison. Overall, our findings are encouraging regarding the effectiveness of concept mapping for low-achieving students.

It has also been hypothesized that by using concept mapping rather than plain text, low-achieving students can better understand the graphic representations of relationships among concepts (Miller, 2016). In addition, the typical hierarchical organization of ideas in concept mapping could improve information organization by low-achieving students (Boyle, 1996; Morfidi et al., 2017).

Teaching Guidance

Our results indicate a statistically significant higher effect size for concept mapping interventions with high levels of teaching guidance (g = 1.792) when compared to those with low teaching guidance (g = 0.470) or scaffolded/interactive approaches (g = 0.905). Nevertheless, it is worth noting that after excluding the two influential studies that specifically targeted low-achieving students, this finding became marginally significant (p = 0.074). In practical terms, this marginal finding suggests that students tend to benefit more when teachers exert high levels of teaching guidance in concept mapping. Conversely, when students are expected to construct concept maps independently, the impact on their learning outcomes tends to be lower. However, this difference may be influenced by the type of learners involved in concept mapping instruction. Low-achieving students may derive greater benefits from a kind of explicit instruction in mapping concepts, which includes elements such as clear goals, logical sequencing, guided practice, immediate corrective feedback, comprehension checks, and mastery learning (Hughes et al., 2017; Kirschner & Hendrick, 2020; Kirschner et al., 2022; Pullen & Hallahan, 2015). Additionally, concept maps that are less directed by teachers may result in a lower degree of structure, which does not significantly facilitate learners’ ability to organize the material effectively, thus potentially failing to reduce adequately the cognitive load (Amadieu et al., 2009).

Level of Schooling

Our results indicated that the impact of concept mapping is more pronounced at the intermediate level of schooling (Grades 3–8) than at the secondary level. However, this statistically significant difference may be influenced by the learning subject, that is, the higher proportion of studies focusing on chemistry within the secondary grades group and the relatively lower proportion of studies addressing physics.

Consistent with the above finding, Horton et al. (1993) found moderate to high mean effect sizes in elementary school and college and small effect sizes in middle school and high school. Also, Nesbit and Adesope (2006) found high effect sizes at the intermediate school level (Grades 4 to 8) and a negligible effect size at high school (Grades 9–12). Further research is needed to analyze how concept maps affect information processing and cognitive load at different levels of schooling.

Economic Level

Our results demonstrate a significantly higher effect size for concept mapping interventions when carried out in low- or middle-income countries compared to those conducted in high-income countries. Low- and middle-income countries often have limited educational resources and may face more significant challenges in delivering advanced instructional methods. In such contexts, concept mapping may be seen as an innovative approach that stands out as a more effective tool for enhancing learning outcomes.

Publication Type

We found a significantly higher effect size for concept mapping studies published in journals than those published in grey literature (journals vs. dissertations and conference papers). In general, empirical evidence has shown consistent differences between studies published in journals and those in the grey literature; studies published in journals are more likely to contain statistically significant results and larger effect size estimates compared to unpublished studies (McAuley et al., 2000; Paez, 2017). Like in other areas of interventions, our meta-analysis shows that excluding grey literature may exhibit an overrepresentation of studies with statistically significant findings, primarily because studies with significant outcomes are more prone to publication than those with null or negative results (Paez, 2017). This can result in inflated and less precise effect size estimates for concept mapping interventions compared to a meta-analysis incorporating grey literature.

In addition, we found statistically significant differences regarding the late year of publication. Finally, technology (pencil-and-paper vs. computer-generated maps or digitalized), learning environment (individual vs. collaborative vs. mixed or not clearly defined), and duration (low vs. medium vs. high duration) did not yield statistically significant differences.

Implications for Research and Practice

The present meta-analysis indicates that concept mapping holds significant promise as an evidence-based instructional approach in biology and physics school instruction. However, while promising, the evidence supporting its effectiveness in chemistry instruction appears to be less robust, as indicated by the combined effect sizes from these studies, which tend to be smaller. Additional research is needed to establish its effectiveness in chemistry instruction due to the limited number of existing studies and the significant heterogeneity observed.

Moreover, further research is needed to explore the instructional aspects and learning conditions that influence the effectiveness of concept maps as a teaching tool in science school classrooms. Only five studies that referred to low-achieving students were included in the meta-analysis without providing information on whether students had learning disabilities (LD). Therefore, well-designed and rigorous empirical studies are required to examine whether concept mapping tools are effective for students across different performance levels. Specifically, research on the effectiveness of concept maps is needed to advance our knowledge about evidence-based practices for low-achieving students and students with LD. Similarly, the perception that concept maps are a student-friendly instructional method for students with LD needs further investigation.

Noteworthy, Oliver (2009) pointed out the shortcomings of concept maps as a singular learning strategy without using other techniques. Concept maps may not necessarily operate similarly to improve the learning outcome across different achievement levels. However, they may be effective if tailored to students’ cognitive needs and learning abilities. Teachers should consider aligning concept maps with instruction and other evidence-based practices.

Limitations

Although the current meta-analysis followed all available guidelines and best practices, it cannot avoid limitations. High heterogeneity is a significant concern of the present meta-analysis, denoting that the studies in the meta-analysis produced estimates that differ significantly from one another. Publication bias was also detected in this meta-analysis. The funnel plot showed evidence of publication bias, as there were influential studies with low-achieving students that reported highly positive effects of concept mapping instruction. When the studies of Guastello et al. and Morfidi et al. were excluded from the analysis, although the pooled estimate was reduced from 0.776 to 0.668, the evidence of publication bias was eliminated.