Introduction

Design thinking (DT) is attracting more and more attention and interest worldwide (Aris et al., 2022). DT was introduced by Rowe (1987) and was first applied in education in 2005 (Çeviker-Çınar et al., 2017). Today, DT has been widely applied in nearly all stages of education (Pande and Bharathi, 2020), from formal to informal educational contexts (Aris et al., 2022). DT is a process, a method (Rowe, 1987), or a “philosophy” (Çeviker-Çınar et al., 2017). In education, DT is a teaching method and a learning orientation that enables learners to generate creative ideas and impactful change and actively explore problem solutions (Beckman and Barry, 2007; Lor, 2017; Retna, 2016). DT can help solve many fundamental educational issues (Koh et al., 2015). However, previous studies did not reach consensus about DT’s effects on student learning. Moreover, eliciting DT is not always easy because of its complexity and open-endedness (Becker and Mentzer, 2015). Therefore, this study carried out a meta-analysis to examine the relationship between DT and student learning.

Conceptual framework

Design thinking

DT has various definitions. The most widely used definition in education is proposed by Razzouk and Shute (2012): “an analytic and creative process that engages a person in opportunities to experiment, create and prototype models, gather feedback, and redesign.” DT is a promising, practical method that can be applied to education (Brown, 2008; Rusmann and Ejsing-Duun, 2022). It is often integrated into the teaching process as an instructional method. DT consists of a set of logically organized stages or processes, each pointing to cultivating students’ key competencies. When students are engaged in DT instruction, they need to follow DT’s steps to move forward with their projects, thereby increasing their ability to perform better. DT also points to addressing problems in real situations (Xu et al., 2024), which could increase their interest, motivation, and engagement (Grau and Rockett, 2022; Lin et al., 2020a). In sum, DT has become a dynamic, nonlinear, and spiraling process that can facilitate deep learning (Liu and Li, 2023) and eventually result in better student performance (Howard et al., 2021).

DT emphasizes learner-centeredness (Glen et al., 2014), which can help teachers and students cope with 21st century challenges and complex real-world problems (Gleason and Jaramillo Cherrez, 2021; Xu et al., 2024; Yande, 2023). For teachers, DT provides a framework for solving complex and emerging problems (Henriksen et al., 2020a); DT also provides good solution strategies and guidance for teachers to design innovative instruction and improve instruction. For students, DT can improve students’ class participation and learning intention, create favorable atmospheres and enjoyment, enhance interaction between each other and creative confidence, deepen their discussion on projects, and eventually improve teachers’ instruction (Balakrishnan, 2022; Tu et al., 2018). Moreover, DT can also nurture the competencies necessary for students, such as communication, collaboration, teamwork, problem-solving skills, creativity, empathy, critical thinking, and metacognition (Abolhasani et al., 2021; Balakrishnan, 2022; Guaman-Quintanilla et al., 2023; Retna, 2016; Rusmann and Ejsing-Duun, 2022). In general, the value of DT in education is to help students grow, empower teachers’ development, and promote teaching change.

DT models

DT has gradually become the new normal, with students readily embracing the DT process and appreciating its merits (Retna, 2016). Meanwhile, a variety of DT models are proposed for use in different domains. Simon (1969) proposed the first DT model, which entails a one-way linear process of three steps: analysis, synthesis, and evaluation. The most widely applied model in education is that of the Stanford model (Liu et al., 2024a), which has five stages: empathize, define, ideate, prototype, and test (EDIPT) (Plattner, 2009), especially in school and university educational settings. IDEO (2013) defined five stages of DT for educators: discovery, interpretation, ideation, experimentation, and evolution. To apply DT in K-12 (Liu and Li, 2023), Carroll et al. (2010) extend the EDIPT model to six stages, e.g., understand, observe, point of view, ideate, prototype, and test. Brown’s DT model has been widely used, with three stages: inspiration, ideation, and implementation (Brown, 2008). The Design Council’s DT model assists designers or non-designers in solving some of the most complex social, economic, and environmental problems. It has four stages: discover, define, develop, and implement (Design Council, 2015). The DT model selected should aim to meet both students’ needs and instructional goals (Brannon, 2022). It should be noted that the processes contained in different DT models may vary and therefore produce different results.

DT’s effects and research gaps

Recently, there have been gradually increasing explorations to investigate the impacts of DT on students’ learning performance in education. However, there is no consensus on the effectiveness of DT. The results can be classified into three types: (a) DT can promote students’ learning significantly (Albay and Eisma, 2021; Bawaneh and Alnamshan, 2023; Chang and Tsai, 2021; Dawbin et al., 2021; Hsiao et al., 2017; Kuo et al., 2022; Ladachart et al., 2022; Lin et al., 2020a; Liu and Ko, 2021; Nazim and Mohammad, 2022; Padagas, 2021; Pratomo and Wardani, 2021; Simeon et al., 2022; Tsai, 2015; Ziadat and Sakarneh, 2021); (b) DT does not significantly enhance student learning (Khongprakob and Petsangsri, 2022; Kuo et al., 2022; Lin et al., 2020b; Yalçın and Erden, 2021); (c) there are negative correlations between DT and learning outcomes (Chou and Shih, 2022; Lake et al., 2021).

It can be seen that DT’s effectiveness is still questionable. DT is an emerging topic that needs in-depth investigation (Baker III and Moukhliss, 2020). Some research gaps need to be addressed urgently. First, it lacks specific guidance and references on DT instruction. In-service teachers are unfamiliar with DT (Bressler and Annetta, 2022; Liu et al., 2024a), which may reduce DT’s effects. Students may also undergo confusion and frustration when participating in DT courses (Glen et al., 2015; Razali et al., 2022). Therefore, it is crucial to explore where the DT approach may be more appropriate for the classroom setting (Lor, 2017). For instance, what is the most effective class size, team size, duration, or DT model? Second, DT’s effects are questioned (Rao et al., 2022). Namely, a systematic assessment of DT’s effectiveness is limited (Liedtka, 2015). There is no meta-analysis to deliver robust evidence on the effectiveness of DT in education. To summarize, with DT’s widespread introduction into education, performing a meta-analysis to reveal DT’s overall effects on student performance and possible influencing moderators is necessary and valuable.

Research purpose

Considering that there is no quantitative, comprehensive evidence on DT’s effects in education, we tried to solve the following questions:

RQ1. What are the research characteristics of the included empirical studies of DT on student learning (e.g., publication year, research design, class size, grade level, duration, subject, team size, DT model, and region/countries)?

RQ2. What is the overall effect of DT on student learning?

RQ3. What are the DT’s effects on student learning under the potential moderators (e.g., learning outcome, class size, grade level, duration, subject, team size, DT model, and region)?

Method

Compared to a mere literature review, meta-analysis can provide precise quantitative effects (Grant and Booth, 2009). Meta-analysis can integrate various empirical research results to calculate the overall effect value (Lipsey and Wilson, 2001). This research was conducted based on the process proposed by Field and Gillett (2010).

Literature searching

We mainly retrieved the documents from the Web of Science (Core Collection), Scopus, and Google Scholar. Some topic words, i.e., (“Design Thinking”) AND (“Learning Performance” OR “Learning Outcomes” OR “Academic Achievement” OR “Academic Performance”), were combined to search for the target documents. The search span was confined from January 2005 to June 2023. 1204 articles were retrieved preliminarily through the search, and 1059 articles were obtained after removing duplicated literature.

Selecting criteria and process

We selected literature based on the below criteria:

(1) It must report the relationship between DT and student learning performance;

(2) It must be empirical studies (experimental, quasi-experimental, or correlational research);

(3) The research participants should receive intervention through DT teaching;

(4) It should provide necessary data for calculating effect sizes in targeted papers (e.g., sample size, mean, standard deviation, the value of t or p);

(5) It should be peer-reviewed and published in English.

After the initial of screening of titles and abstracts and the removal of duplicates, 296 articles were selected. Whole-text articles were initially assessed for eligibility, and 84 articles that met the inclusion criteria remained. Finally, after the articles were read in full, 25 peer-reviewed studies were obtained. The literature searching and selection were conducted strictly according to the standard processes (Moher et al., 2009) (Fig. 1).

Fig. 1
figure 1

Flow diagram.

Literature Quality and Bias Assessment

One database cannot include all the published literature, so searching multiple authoritative databases can control the literature search bias (Stang, 2010). Higgins et al. (2019) recommend searching at least two databases. So, we selected three databases to reduce the search literature search bias (Kelley and Kelley, 2019).

The included criteria’s inaccuracy will result in literature selection bias (Sterne et al., 2016). We strictly drew up the selection criteria to reduce this bias, e.g., study purpose and design, intervention of DT, and published language (Liu et al., 2024b).

We assessed the literature quality based on the criteria of Downs and Black (1998), which have 27 questions and five categories. We found that all selected studies got majority points in more than four of the above categories (range 18 to 21), so they were high-quality (Carter et al., 2017).

Coding potential moderators

Moderators are possible factors that influence DT’s effects. The eight moderators were divided into the background and method.

Background moderators

  • Learning outcome: DT’s learning outcomes are less examined. Examining DT’s effectiveness on different learning outcomes is necessary (Razzouk and Shute, 2012). It was coded into academic achievement, self-efficacy, learning motivation, problem-solving ability, creative thinking, and learning engagement.

  • Treatment duration: The DT process could take a long time to explore (Carroll et al., 2010), and it may moderate DT’s effect on learning. It was divided into <1, 1–3, and >3 months.

  • Class size: It is an important index of teaching effects (Retna, 2016). So, it may moderate DT’s effect on student learning. It was divided into 1–30, 31–50, 51–100, and >100.

  • Grade level: There should be a clear distinction regarding how DT is applied to different learning stages (Lor, 2017). It was divided into kindergarten, primary, junior high, high school, and university.

  • Subject: DT was not always useful across all subjects (Retna, 2016), and van de Grift and Kroeze (2016) found that it could enhance interdisciplinary education. Namely, the subject may moderate DT’s effects. It was divided into STEM, No-STEM, and multidiscipline.

  • Region: It refers to the area where the study was performed. The education system’s cultural context must also be considered when applying DT (Retna, 2016). So, the region is also considered a potential moderator. It was divided into Asia, America, Austria, Europe, and Africa.

Method moderators

  • DT model: It refers to DT’s specific processes or stages. The implementation of DT relies on specific models, and different models contain different operations. Therefore, the role of DT models should be considered. We coded the DT model into 9 types:

  • 3IE=Inspiration, Ideation, Implementation, and Evaluation;

  • UOPIPT=Understand, Observe, Perspective, Imagination, Prototype, and Test;

  • EDIPT=Empathize, Define, Ideate, Prototype, and Test;

  • EDEIPT=Empathize, Define, Elaborate, Ideate, Prototype, and Test;

  • OSIP=Observation, Synthesis, Ideation, and Prototype;

  • PAS=Preparation, Assimilation, Strategic control;

  • 2UPPI=User focus (User as an information source and User as a codeveloper), Problem framing, Prototype, and Iteration;

  • CTC=Copy, Tinker, and Create;

  • LAUNCH=Look, listen and learn, Ask, Understand, Navigate ideas, Create, and Highlight and fix.

  • Team size: This variable refers to the number of team members. DT pedagogy emphasizes the use of student teams (Beckman and Barry, 2007), and team size is one of the causes of conflicts around teamwork (Aflatoony et al., 2018). So, the team size may moderate DT’s effect. It was divided into 1–4, 5–7, and >=8.

Data analysis

CMA 3.0 was used to analyze the effect sizes and moderators’ effects. In order to overcome the differences in different studies, the Pearson correlation coefficient r was selected as the effect size (Borenstein et al., 2005). Since the paper sample sizes varied widely, the authors employed the Fisher Z-transformation based on the weighted study sample sizes to calculate the ultimate r and 95% confidence intervals (Lei et al., 2020).

Results

Publication Bias

We used the funnel plot, classic fail-safe N, and trim-and-fill method to examine the publication bias. If there is no publication bias in the data, the scatter of the funnel would be spread symmetrically. First, the funnel plot showed that the samples in this study were not evenly distributed (Fig. 2). Second, fail-safe Nfs quantifies the threshold at which publication bias becomes an issue. CMA can calculate the threshold (Nfs). Next, the fail-safe Nfs indicated that Nfs = 9179 was far larger than 220 (5*K + 10, K = 42). Last, the trim-and-fill method can create plots of potentially missing studies to search for symmetry between the literature (Duval and Tweedie, 2000). This method found just five missing values on the right of the funnel plot (Fig. 3). In sum, it can be concluded that the data included were free from publication bias.

Fig. 2
figure 2

Funnel plot.

Fig. 3
figure 3

Funnel plot after trill-and-fill.

Actually, literature selection may cause publication bias. To minimize this bias, we strictly developed the selection criteria, e.g., study purpose and design, intervention of DT, necessary data, and peer-review. Especially, we limited the language of publication to English. This may exclude some potential literature published in other languages; it is one limitation of the current research and could be addressed in the future.

Homogeneity test and sensitivity analysis

The values of Q and I2 can be used to determine whether heterogeneity exists. The result was Q = 554.908 (p < 0.001) (Table 1), which was significant. Moreover, I2 = 92.611% > 75%, according to Higgins et al. (2003), meant the heterogeneity was high. Thus, the random-effects model should be selected (Borenstein et al., 2009; Wilson et al., 2020). Moreover, moderating analyses were also necessary to be analyzed.

Table 1 Overall effect size of DT and the homogeneity.

To confirm the robustness of this research, we used the one-study-removal method to examine the sensitivity. The result suggested that each overall effect size fell within a reasonable range (from 0.418 to 0.467). Thus, this study is robust.

General characteristics of the included 25 studies

To answer RQ1, reveal the current state of empirical research on DT, and provide complementary evidence for subsequent meta-analyses, a descriptive analysis of the included literature was conducted. The literature included was published between 2015 and 2023, e.g., 1 in 2015 (4.00%), 1 in 2017 (4.00%), 3 in 2020 (12.00%), 8 in 2021 (32.00%), 6 in 2022 (24.00%), and 6 in 2023 (24.00%). The result indicated a growing interest in empirical research on the use of DT for teaching and learning in education. In terms of study design, only 2 were correlational studies (Lin et al., 2020a; Roth et al., 2020), while the other 23 were experimental studies (including pre-experiment, quasi-experiment, and true-experiment). Descriptive results are as follows:

(1) Grade level: kindergarten (N = 1, 4.00%), primary school (N = 3, 12.00%), junior high school (N = 2, 8.00%), high school (N = 9, 36.00%), and university (N = 10, 40.00%).

(2) Class size: 0–30 (N = 9, 36.00%), 31–50 (N = 10, 40.00%), and >=51(N = 6, 24.00%).

(3) Duration: 0–1 month (N = 8, 32.00%), 1–3 months (N = 7, 28.00%), and =>3 months (N = 10, 40.00%).

(4) Subject: STEM (N = 16, 64.00%), No-STEM (N = 6, 24.00%), and multidiscipline (N = 3, 12.00%).

(5) DT model: EDIPT (N = 14, 56.00%), 3IE (N = 1, 4.00%), UOPIPT (N = 1, 4.00%), LAUNCH (N = 1, 4.00%), OSIP (N = 1, 4.00%), PAS (N = 1, 4.00%), PPI2U (N = 1, 4.00%), EDEIPT (N = 1, 4.00%), CTC (N = 1, 4.00%), and Unknown (N = 3, 12.00%) (Fig. 4).

Fig. 4
figure 4

DT models.

(7) Team size: 0–4 (N = 7, 53.85%) and 5–7 (N = 6, 46.15%).

(8) Region: Asia (N = 21, 84.00%), America (N = 1, 4.00%), Australia (N = 1, 4.00%), Europe (N = 1, 4.00%), and Africa (N = 1, 4.00%) (Fig. 5).

Fig. 5
figure 5

Region.

(9) Countries: China (N = 12, 48.00%), Thailand (N = 2, 8.00%), Australia (N = 1, 4.00%), Austria (N = 1, 4.00%), Philippines (N = 2, 8.00%), Saudi Arabia (N = 2, 8.00%), Nigeria (N = 1, 4.00%), America (N = 1, 4.00%), Indonesia (N = 1, 4.00%), Jordan (N = 1, 4.00%), and Turkey (N = 1, 4.00%).

In general, the results revealed that most research used EDIPT (N = 14) as a DT model and focused primarily on the learning of STEM subjects (N = 16, 64.00%) by high school (N = 9, 36.00%) and university students (N = 10, 40.00%) in Asia (N = 21, 84.00%).

Overall effect size

When r = 0.1, there is a small effect size; r = 0.3 is a medium effect size; and r = 0.5 is a large effect size (Cohen, 2013). The overall effect size of DT was upper-medium (r = 0.436, 95% CI [0.342, 0.525], p < 0.001) (Table 1). Moreover, each study’s effect sizes were also provided (Fig. 6). The red diamond represents the overall effect size and its CI in the forest plot. Favours A meant the result was in favor of regular instruction, while Favours B meant the result was in support of DT instruction.

Fig. 6
figure 6

Forest plot.

Moderator analysis

Learning outcome

The order of effect sizes from large to small was learning engagement (r = 0.740), learning motivation (r = 0.608), academic achievement (r = 0.450), problem-solving ability (r = 0.447), creative thinking (r = 0.329), and self-efficacy (r = 0.230) (Table 2). The between-groups effect (p < 0.01) indicated that the learning outcome had a moderating effect.

Table 2 Effects of DT on different learning outcomes, class sizes, treatment durations, and grade levels.

Class size

The order of effect sizes from large to small was <=30 (r = 0.609), 31–50 (r = 0.422), and >=51 (r = 0.389) (Table 2). The result of between-group effects was Q = 0.856 (p > 0.05), indicating that the class size had no moderating effect.

Treatment duration

The result showed that the effect size of >=3 months (r = 0.535) was the largest, the next was <=1 month (r = 0.456), and 1–3 months (r = 0.245) was the smallest (Table 2). The between-groups effect (p < 0.001) indicated that the treatment had a moderating effect.

Grade level

The order of effect sizes from large to small was high school (r = 0.538), university (r = 0.463), junior high school (r = 0.443, p > 0.05), primary school (r = 0.222), and kindergarten (r = 0.174) (Table 2). The between-groups effect (p < 0.01) indicated that the grade level had a moderating effect.

Subject

The order of effect sizes from large to small was multidiscipline (r = 0.604), No-STEM (r = 0.470), and STEM (r = 0.393) (Table 3). The between-groups effect indicated that the subject had no moderating effect.

Table 3 Effects of DT on different subjects, DT models, team sizes and regions.

DT model

The order of effect sizes from large to small was OSIP (r = 0.766), EDIPT (r = 0.522), 2UPPI (r = 0.346), PAS (r = 0.301), UOPIPT (r = 0.297), 3IE (r = 0.222), CTC (r = 0.191, p > 0.05), EDEIPT (r = 0.174), and LAUNCH (r = 0.066, p > 0.05) (Table 3). The Q test of the between-groups effect was significant (p < 0.001), indicating that the DT model had a moderating effect.

Team size

The order of effect sizes from large to small was 0–4 (r = 0.477) and 5–7 (r = 0.441) (Table 3). The between-groups effect (p > 0.05) indicated that the team size had no moderating effect.

Region

The order of effect sizes from large to small was Africa (r = 0.690), Asia (r = 0.435), Australia (r = 0.355), Europe (r = 0.346), and America (r = 0.066, p > 0.05) (Table 3). The between-groups effect (Q = 50.576, p < 0.001) indicated that the region had a moderating effect.

Discussions and implications

This meta-analysis investigates DT’s effect on student learning with 42 validated effect sizes from 25 independent empirical articles. This research reveals that DT has an upper-medium effect on student learning. DT is the gaping link between the theoretical discoveries of social transformation pedagogy and the practical application of the skills needed for the future (Scheer et al., 2012). The DT process entails a set of logical stages that point to students’ key competencies. DT instruction can increase students’ involvement, establish a positive learning climate, and promote interaction and communication between teachers and students (Tu et al., 2018). Moreover, DT relies on teamwork and hands-on activities, which are beneficial for student learning (Holstermann et al., 2010; Oje, 2021; Sung et al., 2017; Swanson et al., 2019). Certainly, connecting DT with courses’ content may be a challenge (Hennessey and Mueller, 2020). Overall, if educators organize DT instruction appropriately, it will be effective in improving student learning.

Learning outcome

It has a moderating effect. Specifically, DT can promote learners’ creative thinking, learning engagement, motivation, problem-solving ability, self-efficacy, and academic achievement. Notably, the effects of learning motivation, engagement, and academic achievement are large. The DT process entails a set of logical stages that point to students’ key competencies. DT is a dynamic, nonlinear, and spiraling process that can facilitate deep learning (Liu and Li, 2023), interest, motivation, creativity, and engagement, and eventually improve student learning (Howard et al., 2021; Rao et al., 2022). However, there are significant differences in the impacts of DT on student learning outcomes. DT models consist of a set of stages, and some models are complex and challenging. So, its effect on self-efficacy is smaller than other types of learning outcomes. In sum, DT still has great potential to enhance various learning outcomes.

Class size

It has no moderating effect. Specifically, <= 30 (r = 0.609) has a large effect, >= 51 (r = 0.389) and 31–50 (r = 0.422) have upper-medium effects. The result suggests that the smaller the class size, the better DT’s effects. DT is a guided, student-oriented process where learners need close supervision, guidance, and feedback (Retna, 2016). When the class size is large (>= 51), it is hard for teachers to provide prompt guidance and feedback. Moreover, large class sizes challenge teachers’ effective classroom management and interactions (Blatchford et al., 2009). Of course, >= 51 is broad. So, DT’s effects on larger class sizes (e.g., 51–80, etc.) need more exploration. Based on the result, we recommend that educators keep the class size below 51 students. Moreover, if conditions permit, more teachers could be involved in one class (e.g., two teachers) (Retna, 2016).

Treatment duration

It has a moderating effect. Specifically, the effect of >= 3 months (r = 0.535) is large, <= 1 month (r = 0.456) has an upper-medium effect, and 1–3 months (r = 0.245) has an upper-small effect. Generally, the effect of 1–3 months is best (Yu et al., 2023), but our result is the smallest. The novelty effect may result in a larger effect at <=1 month than at 1–3 months. The decrease in the 1–3 months’ effect may be due to the novelty effect wearing off as students slowly familiarize themselves with DT and face learning challenges. Guaman-Quintanilla et al. (2023) noted that it is challenging to experience the entire process of DT within a limited time. Namely, time constraints are a challenge for students and educators (McLaughlin et al., 2023; Retna, 2016; Razali et al., 2022). Longer durations are needed for educators to conduct DT instruction to make students engage in DT (Razali et al., 2022). Actually, DT is a long-term journey to develop students’ abilities and skills, so enough time should be allocated. In short, though DT is effective for these durations, <=1 month or >= 3 months are more effective. More future research could examine the 1–3 months’ effect on DT.

Grade level

It has a moderating effect. Specifically, high school (r = 0.538) has the best effect; university (r = 0.463) has an upper-medium effect; primary school (r = 0.222) and kindergarten (r = 0.174) have small effects; and junior high school (r = 0.443, p > 0.05) has an insignificant effect. DT has been used in all stages of education, and DT is also effective. In this research, DT shows greater potential for high school and university students than for primary school and kindergarten students. DT is a task- and activity-oriented learning process that relies on team communication and collaboration, DT studies at different stages might yield different results due to cognitive-developmental differences (Mentzer et al., 2015). Given the complexity of DT, more DT instruction could be applied to university and secondary school students. Moreover, for researchers, more studies should be carried out at diverse grade levels, especially in kindergarten (k = 2) and junior high school (k = 4).

Subject

It has no moderating effect, but the effect of multidiscipline is better than that of STEM and No-STEM. This suggests that DT can foster multidisciplinary learning, consistent with previous studies (Chang and Tsai, 2021; de Figueiredo, 2021; van de Grift and Kroeze, 2016). DT has typical interdisciplinary features (Lugmayr et al., 2014) and can promote new solutions, innovation, and collaboration opportunities for complex problems in multidisciplinary areas (Cook and Bush, 2018; Gleason and Jaramillo Cherrez, 2021). At the same time, DT can be integrated into the subjects of STEM or No-STEM to promote learning and teaching (Hsiao et al., 2023). DT is taught as a concept rather than affiliated with a specific discipline (Lor, 2017). We recommend integrating DT into existing courses rather than adding additional add-on activities (Sandars and Goh, 2020), especially for multidisciplinary learning (Hsiao et al., 2023). Different disciplines or subjects have their own suitable design processes (Sung and Kelley, 2019), the result provides a broad subject division for reference. Future research could explore DT’s effects on more detailed subjects. Besides, most of DT was applied to STEM subjects (k = 32), fewer to No-STEM and multidiscipline. So, DT’s effects on both latter subjects should be viewed cautiously and pay more research attention.

DT model

It has a moderating effect, indicating that different DT models could generate heterogeneity. Specially, OSIP (r = 0.766) and EDIPT (r = 0.522) have large effects; PPI2U (r = 0.346) and PAS (r = 0.301) have lower-medium effects; UOPIPT (r = 0.297), 3IE (r = 0.222), EDEIPT (r = 0.174) have small effects; and CTC (r = 0.191, p > 0.05) and LAUNCH (r = 0.066, p > 0.05) have no significant effects. Before DT can be effectively implemented to solve complicated problems, it is essential to have a solid grasp and comprehension of the different stages of the DT process (Dam and Teo, 2019). Different DT models involve different steps or stages, which may affect the processes of cognition and learning. For instance, EDIPT is easier for middle school students (Sarooghi et al.m 2019). Based on the result of this meta-analysis, we recommend that educators adopt the models of EDIPT and OSIP in DT instruction. Importantly, educators should not rely too heavily on the pre-determined procedural DT processes, which may hinder the creative potential of DT (Wells, 2013). Educators can rationalize the DT model based on their actual situations (Li and Zhan, 2022). It is also necessary to mention that, with the exception of EDIPT, the numbers of effect sizes included in other DT models are small, so their results should be treated cautiously and more explorations are needed.

Team size

It has no moderating effect. Team sizes of 0–4 (r = 0.477) and 5–7 (r = 0.441) have upper-medium effects. Teamwork and team collaboration are great challenges for many students. DT could enhance students’ teamwork (Guaman-Quintanilla et al., 2022). Success in DT requires teamwork, and larger teams can enrich the diversity of perspectives and increase the likelihood of solutions (Sung et al., 2017). Moreover, the composition of teams is also important (Apedoe et al., 2012). Generally speaking, heterogeneous ability groups may be appropriate in DT (Lou et al., 1996), i.e., both low-ability and high-ability students, and both male and female students (Yu and Yu, 2023). From the result of this research, 2–7 members in one group are beneficial. A larger number of teams may limit the teachers’ ability to guide and facilitate each team’s, and individual students’ learning (Apedoe et al., 2012). We recommend having <=7 members in one group. Specifically, when the class size is large, 5–7 is better; otherwise, 2–4 will be better. However, the result shows a broad team size for reference only. So, future research could explore which specific composition of teams (from 2 to 7 or above) in DT instruction is better.

Region

It has a moderating effect. Specifically, Africa (r = 0.690) has a large effect, Asia (r = 0.435), and Australia (r = 0.355), and Europe (r = 0.346) have upper-medium effects, while America has an insignificant effect. This may be due to differences in cultural and educational systems in different regions. Different from individualistic cultures (e.g., America, Australia, Austria), most Asian countries are collectivist (e.g., China, Thailand, Indonesia, etc.), and students in these countries tend to value team goals more than individual goals (De Mooij and Hofstede, 2010). So, DT has an upper-medium effect on Asian students. Since the study distribution between different regions was highly uneven, this result should be treated judiciously. For instance, except for Asia, other regions’ studies are small, e.g., Australia (N = 1), Europe (N = 1), Africa (N = 1), and America (N = 1), so these regions need more attention. In general, DT positively impacts student learning in diverse regions, and DT is recommended to enhance Asian students’ learning.

Implications for future practice and work

This meta-analysis makes an evidence-based analysis of DT’s effects on student learning, and we provide some meaningful suggestions for future practice and research. These are also major contributions to the existing literature.

First, though DT’s effects on different types of learning outcomes are significantly different, it is still an effective teaching method to improve student learning. Educators can apply DT to enhance student academic performance, creative thinking, learning engagement, motivation, and problem-solving ability. Due to the limited amount of learning engagement and self-efficacy, their effects should be treated cautiously.

Second, a smaller size means a larger DT’s effect. Educators should keep the class size <51. Future research could focus more on exploring DT’s effects on larger class sizes (e.g., 51–80, etc.).

Third, treatment duration is a critical factor. <=1 month or >=3 months are more suggested. Particularly, DT’s effect is smallest when the duration is 1–3 months, and this needs more future research.

Fourth, grade level is a key factor. DT could be applied to university and high school students. DT’s effect on junior high school is insignificant. Researchers could carry out more studies at kindergarten (k = 2) and junior high school (k = 4).

Fifth, DT can be used in the subjects of STEM, No-STEM, or multidiscipline. Meanwhile,future research could explore more on No-STEM, multidiscipline, and more detailed subjects.

Sixth, the DT model is also a critical factor that should be considered. Based on the results of this study, we recommend that educators adopt the models of EDIPT. Importantly, except for EDIPT, other models’ effects need more exploration.

Seventh, in terms of team size, it is suggested to have <= 7 members in one group. Specifically, when the class size is large, 5–7 is better; otherwise, 2–4 will be preferred. However, the result shows a wide range. Future research could explore which specific composition of teams (from 2 to 7 or above) is better for DT instruction.

Eighth, regional analysis suggests that DT is most used in Asia and is most suggested to support Asian student learning. However, the number of effect sizes in other regions is very small. Thus, their results should be viewed with caution, and future researchers can take more steps to test DT’s effects in America, Africa, Australia, and Europe.

Conclusions, limitations and future research

Conclusions

This meta-analytic evidence reveals DT’s effects in education based on 25 empirical studies. We find that DT has an upper-medium positive effect on students’ learning. Specifically, DT can lead to higher learners’ creative thinking, learning engagement, motivation, problem-solving ability, self-efficacy, and academic achievement. In comparison, DT has better effects on student learning motivation, engagement, and academic achievement. Furthermore, the learning outcome, grade level, treatment duration, DT model, and region moderate DT’s effects on student learning. Namely, these moderators will affect DT’s effectiveness.

DT is on-trend worldwide (Aris et al., 2022), and it has profoundly changed many educators’ thinking about how to instruct to support learning (Hubbard and Datnow, 2020). Teachers are vital in DT instruction; they should be facilitators and navigators, not lecturers (Henriksen et al., 2020b; Retna, 2016; Rusmann and Ejsing-Duun, 2022). In sum, DT can potentially promote learning at different grade levels, but the effectiveness of DT in education depends upon the goals (Panke, 2019). It is critical to make teachers see the value of DT in classrooms (Carroll et al. 2010) and conduct DT instruction with guidance and rules. This paper provides evidence-based findings for educators and researchers.

Limitations, research gaps, and future directions

There are several limitations that should be solved for future work. First, the literature is distributed unevenly by region, grade level, and DT model, so more future studies could be taken at kindergarten (k = 2), junior high school (k = 4), America (k = 1), Australia (k = 1), Africa (k = 2), Europe (k = 2), learning engagement (k = 1), self-efficacy (k = 3), and DT model except EDIPT. Second, the literature included in this meta-analysis was published in English. Future work could include other language studies. Third, the heterogeneity is considerable, and some potential moderators may be overlooked. Future work could explore more factors that influence DT’s effectiveness, e.g., learning environments. Fourth, the included literature is not large; future research could focus on experimental design to explore DT’s effects on student learning. Last, a meta-analysis may not display the whole status and findings of DT in education. Future researchers could conduct a systematic literature review to compensate for the neglected aspects of the current research.