Introduction

There is a constant urge to adapt to the demands of the changing world, where individuals are stimulated to use their knowledge to explore, create, and re-construct solutions for challenging problems. This could be achievable by inculcating STEM (science, technology, engineering, mathematics) literacy, where the role of science education is substantially significant. Many researchers have found that informal education especially STEM-focused Project-Based Learning (PjBL) is becoming one of the prominent pedagogical methods employed to improve the quality of science education that enables students to become STEM-literate individuals (Balemen & Keskin, 2018; Chen & Yang, 2019; Ramdhayani et al., 2019; Sivia et al., 2019).

An informal PjBL is a widely employed model, that has proven to instill students with STEM skills (Habig & Gupta, 2021; Mateos-Nunez et al., 2020; Nadelson et al., 2021; Vela et al., 2020). An informal setting typically relates to an out-of-school experience such as in a camp, zoo, museum, summer program, laboratory-oriented setting, research environment, after-school program, marker space, etc. While a formal setting relates to standardized & structured classroom education. Regardless of whether the setting is formal or informal, the main framework for PjBL remains the same i.e., it is a systematic pedagogical model that helps students experience and indulge in complex, real-world tasks, resulting in an idea, a product, or prototypes (Johnson et al., 2013). There are two critical parts of a PjBl model: a question that drives learning activities as per the objectives and the solution in the form of students’ artifacts (Shpeizer, 2019). Some fundamental features of PjBL include inquiry, investigations, peer discussions, creation, revision, reflection, sharing of findings, design thinking, etc.(Larmer & Mergendoller, 2010). In this present study, PjBL is viewed as a student-centered learning model (Kokotsaki et al., 2016), with the projects being a critical component (Mergendoller & Thomas, 2005). An informal PjBL model has been opted for our research since diverse meta-analytical studies on PjBL had shown varied effectiveness [i.e., d = 0.71 (Chen & Yang, 2019), d = 0.99 (Ayaz & Söylemez, 2015) and d = 1.06 (Balemen & Keskin, 2018)]. Moreover very little has been known about moderator variables that might affect learning gains associated with informal PjBL.

Therefore, despite these recent meta-analytical studies on the significance of PjBL when compared to the traditional classroom setting (Ayaz & Söylemez, 2015; Baleman & Keskin, 2018; Chen & Yang, 2019; Ramdhayani et al., 2019), there is a dearth of meta-analytical literature focusing on the field of informal PjBL. Moreover, most of these review studies have not specified the mode of PjBL, whether employed in a formal or informal setting. The novelty of this study lies in the fact that it encompasses literature (of 12 years) employing STEM-focused informal PjBL models for school-level students exclusively, along with the moderator analysis. Wherein the moderator analysis has illustrated the impact of diverse potential variables that are likely to influence the effectiveness of informal PjBL (i.e., subject area. teaching method, assessment method, grade, location, course duration, and group size).

Review of Literature and Conceptual Framework

Comprehending the impact of potential moderator variables is essential for tailoring the pedagogies in the best possible manner (Chen & Yang, 2019). Many individual studies have revealed the implication of informal PjBL in the teaching of diverse STEM subjects. More often, they have been employed in teaching various science disciplines such as physics (Awad, 2021; Mateos-Nunez et al., 2020), biology (Bokor et al., 2014; Covert et al., 2019), chemistry (Nadelson et al., 2021), mathematics (Calabrese & Capraro, 2021; Kwon et al., 2021), engineering (Hirsch et al., 2017; Innes et al., 2012), and technology (Smit et al., 2021). Many studies have also shown the success of integrated informal PjBL models, where these informal PjBL models are integrated with other pedagogies, such as 1) problem-based learning (Awad, 2021; Hirsch et al., 2017; Kwon et al., 2021; Nugent et al., 2010), 2) inquiry-based learning (Covert et al., 2019), 3) game-based learning (Newton et al., 2020), and 4) forensic discovery-based learning models (Todd & Zvoch, 2019). However, learning is not only influenced by the subjects taught or the learning models employed, but also by many other moderator variables such as assessment method, students’ grades, location, course duration, and students’ group size, etc.

Indeed, it is widely acknowledged that assessment influences students’ learning (Cilliers et al., 2012), and the relationship between assessment and student learning can be complex and unpredictable (Al-Kadri et al., 2012). Even though, assessment is typically used to evaluate a student's knowledge, skills, and abilities; the type of assessment employed can also have a direct influence on students' learning, including their reaction to the assessment, their involvement, and their motivation (Marriott & Lau, 2008). In this regard, the study by Gao et al. (2020) has proposed future directions for developing improved assessment methods for STEM education (Gao et al., 2020). Their findings indicate that despite many programs aiming to improve students’ interdisciplinary skills, their assessments did not align with their objectives. Therefore their recommendations involve: assessments must be built to the set learning objectives, connections across disciplines must be operationalized and assessed to provide targeted student feedback, and development of practical assessment methods and guidelines must be prioritized. Typical assessment methods include direct and indirect assessments. Direct assessment methods are employed when students demonstrate their mastery of skills and knowledge through actual work. It is usually measured via professional licensure exams, standardized exams, knowledge-based pre-and post-tests, student posters, models, speeches, presentations, etc. Whereas, in the indirect assessment method, the students are assessed indirectly via focus group interviews, self-report surveys for perceived learning, exit survey, mentor/volunteers’ feedback, etc. (Martell & Calderon, 2005). Nowadays, many educators prefer to evaluate student progress by using a cutting-edge method of course-embedded assessment (Gerretson & Golson, 2005). Embedded assessments, often known as “authentic assessments," include internal-mentor-created course-level assessment plans, including individual student design logs, worksheets, homework, assignments, class tests, etc. The strength of this approach lies in the fact that it is developed by mentors who are experts in the course field, which offer them fine-grained insights that help them to make instructional adjustments in the classroom as per students’ needs (Kim et al., 2021). Correspondingly, a careful mix of embedded, direct, and indirect assessments in STEM education, a practice viewed as a triangulation of assessments that has a great significance (Ghrayeb et al., 2011), has rarely been discussed/ investigated in the literature.

Likewise, the time devoted to instruction has also been identified as a moderator variable of PjBL models (Chen & Yang, 2019). Different instructional times have been reported to be successful for informal PjBL. For example, many studies have productively employed and recommended informal PjBl instruction for a week (Bokor et al., 2014; Nugent et al., 2010; Smit et al., 2021), for 2 weeks (Biçer & Lee, 2019; Calabrese & Capraro, 2021; Todd & Zvoch, 2019; Todd & Zvoch, 2019), and more than a month (Awad, 2021; Covert et al., 2019; Stevens et al., 2016; Yin et al., 2020). Although some educators and parents believe that projects require a greater amount of instructional time, and thus blocks course content acquisition (Miller, 2018). Therefore, if we have a clearer understanding of the impact of instructional time on student learning in informal PjBL, it would demonstrate evidence supporting the amount of time that should be devoted to informal PjBL.

Therefore, this review article is motivated by Mergendoller and Thomas’s (2005) review of PjBL, which points to the demand for more evidence-based research to determine the effectiveness of PjBL, especially when compared with traditional classroom settings (Mergendoller & Thomas, 2005). According to them, future research must investigate the diverse potential moderator variables of PjBL, including subject areas, duration of the course, appropriate assessment tools, grades, location, etc., because designing, executing, and supervising PjBL are all related to students’ learning gains. The conceptual framework of this review is established on the study by Chen & Yang, 2019, which included a 20-year meta-analytical study on PjBL for STEM & non-STEM disciplines. In their study, formerly the overall effectiveness of informal PjBL has been established, followed by moderator analysis (subject area, school location, hours of instruction, information technology support, educational stage, and, students’ group size). Indeed, the significance of the various moderator variables impacting PjBL has been inconsistently reported in the literature (Ayaz & Söylemez, 2015; Balemen & Keskin, 2018; Chen & Yang, 2019), with no distinction on formal and informal PjBL. Hence, this current study strives to investigate the effectiveness of informal PjBL, since the informal (out-of-school) PjBL approach should be viewed as a process that must be tailored to assure the needs of diverse learners. In addition, the study also intends to perform a moderator analysis to compute the most effective sub-variables that might influence the effectiveness of informal PjBL. Therefore, the research questions guiding this meta-analytical article are as follows:

  1. 1.

    What is the measure of the effectiveness of informal PjBL in developing students’ learning gains (compared to the traditional classroom setting)?

  2. 2.

    Does the effectiveness of informal PjBL influenced by moderator variables such as teaching model, educational level, group size, subject area, study location, assessment model, and, instruction time?

Methods

Search Strategy

The relevant research articles were gathered via common web search engines such as Scopus, Web of Science, and Journals for Educational Research Information Center (ERIC). Specific keywords and operators (AND, OR, *) were employed to locate the research papers; i.e., ("STEM" OR “science” OR “technology” OR “engineering” OR “mathematics”) AND ("project" OR "research" OR "inquiry" OR "problem-based" OR "virtual" OR "game-based" OR "blended" OR "flip*") AND ("out of school" OR "informal" OR "outreach" OR "summer" OR "winter" OR "internship" OR "workshop" OR "camp" OR "zoo" OR "museum" OR "park"). A systematic literature review technique aligning with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines was adopted (Page et al., 2021) (refer to Figure). PRISMA is an evidence-based protocol for reporting in systematic reviews and meta-analyses (Page et al., 2021). A total of 755 papers were identified using the keyword search: i.e., Scopus (n = 330), ERIC (n = 250), and Web of Science (n = 175). Followingly, 326 and 221 articles were discarded due to duplication and exclusion criteria respectively (refer to Section 2.2: Inclusion and Exclusion Criteria).

Inclusion and Exclusion Criteria

The articles were primarily reviewed for title aptness, gist relevance, context-relatedness, abstract relevance, and retrieval (refer to Fig. 1). For this, articles were reviewed as per inclusion and exclusion criteria, established under the scope of the study. The inclusion criteria for article selection were: studies employing the informal PjBL model at the school level (e.g., primary, middle, and high school); informal PjBL model for STEM education; articles extracted during the last 12 years (2010–2022 March); peer-reviewed articles; articles written in English.

Fig. 1
figure 1

PRISMA 2020 flow diagram revealing the inclusion–exclusion criteria for studies included in the meta-analysis (Page et al., 2021)

Correspondingly, the initial exclusion criteria were the education level pertaining to kindergarten, undergraduate, and postgraduate level; teaching model for other non-STEM subjects (e.g., commerce, business, social sciences, language, etc.); qualitative research, and review papers. Final exclusion criteria were based on study specifications i.e., non-empirical studies; no pre-/post-test design or control/experiment design; study data not aligning to CMA format (e.g., not including the mean, standard deviation, Cohen’s d, t-value, p-value, etc.); insufficient data for calculating the effect size (ES) (i.e., p-/t-value = 0). Finally, aligning with the inclusion & exclusion criteria, 26 peer-reviewed empirical articles were finalized for the meta-analysis.

Conclusively, Table 1 shows the descriptive features of the 26 studies shortlisted for the meta-analysis. Figure 2 depicts the number of studies reviewed, by year of publication (2010 to May 2022). Results indicated that most of the studies were extracted from 2019 (n = 7), followed by 2021 (n = 6) and 2020 (n = 4). On examining the country of publication, most of the studies were from the U.S.A. (n = 13), followed by 1 study from Israel, Georgia, Spain, Taiwan, Switzerland, Turkey, and Malaysia (n = 1; each). While analyzing the educational level, most of the studies focused on middle schools (n = 11), followed by high schools (n = 6) and primary schools (n = 3). Some studies involved a combination of school levels, such as middle and high schools (n = 4) and middle and primary schools (n = 2). After shortlisting the articles, a meta-analysis has been performed and the results were interpreted (refer to the following sections).

Table 1 Descriptive features of the 26 shortlisted studies for the meta-analysis
Fig. 2
figure 2

Distribution of studies included in the meta-analysis by year of publication

Meta-Analysis and Interpretation

Comprehensive Meta-Analysis (CMA) software package (version 3.3.070) has been employed for meta-analysis. The Der Simonian and Laird methods were used to determine the individual and overall effectiveness [in terms of effect sizes (ESs)] with 95% confidence intervals (CI) of the studies ((DerSimonian & Laird, 2015). The raw empirical data were extracted from the shortlisted studies in the form of pre-/post- or control/treatment means, SDs, t-values, p-values, etc., to calculate the ESs (in terms of Cohen’s d index). ESs are exemplified in the form of forest plot diagrams illustrating the dispersal of the ESs of all the shortlisted studies. According to (Cohen, 2013), Cohen’s d values ≤ 0.2, ~ 0.5, and ≥ 0.8 are regarded as low, moderate, and high effectiveness respectively. A random-effects model was employed to calculate the mean ES, as this model assumes that each study tends to have a different “true” effect (Borenstein et al., 2009). To test if a set of ESs is homogeneous, Q statistics and degree of freedom (Df) are analyzed. A statistically significant QT (overall variance) value suggests that the difference between the studies is not due to sampling errors. Therefore, a further grouping of the studies is required by setting up potential moderator variables (grouping factors) to evaluate their influence on the informal PjBL.

Setting Up Moderator Variables

To determine whether the effectiveness of informal PjBL is impacted by influential variables, this study sets potential moderator variables. The seven moderator variables investigated in the study are as follows: teaching models, assessment methods, school level, instructional time, subject area, study location, and student group size. The study features were initially coded and categorized by two authors, experts in the field.

When screening the studies based on the “assessment models” employed, the categories generated were: (a) indirect assessment, (b) a combination of direct & indirect, and (c) triangulation methods. Surprisingly, no published study is known to have employed only direct assessment alone in an informal PjBL setting. To classify the studies in accordance with the assessment methods used, the assessment tools were coded as per the keywords devised. For direct assessment, the keywords employed were: pre-and post-tests, students’ posters, students’ presentations, working models, students’ designs, standard examinations, summative assessment, portfolio, rubrics, etc. For indirect assessment, the keywords screened were pre- and post-test (affective), surveys, focus group interviews, interviews, student self-report surveys, exit surveys, teachers/ mentors/ volunteers’ feedback, etc. For course-embedded assessment, the keywords were assignments, homework, worksheets, design logs, assay, course-embedded class tests, etc. For categorizing studies based on “teaching methods”, two categories were devised: (1) Informal PjBL and (2) Informal integrated PjBL. In the “Informal PjBL” category, studies exclusively employed the PjBL model, while for the “integrated PjBL” category, studies executing the PjBL model in integration with other teaching models (e.g., problem-based, inquiry-based, game-based, etc.) were included. Similarly, the articles' categorization based on “educational level” were as per (1) high school, (2) middle & high school, (3) middle school, (4) primary and middle school, and (5) primary school students (Table 2). Likewise, studies were grouped based on “students’ group size” as per i.e., (1) less than 3 students in a group, (2) within 3–4 students, (3) more than 5 students, and (4) not specified. For the “subject area”, the studies were classified based on the following categories: (1) science, (2) engineering, (3) mathematics, (4) science & technology, and (5) STEM. Grouping the studies as per the “study location” involved the following categories (1) Asia, (2) Europe, (3) Eastern U.S.A., (4) Southern U.S.A., (5) Southwestern U.S.A., and (6) Western U.S.A. Finally, the classification of articles based on “instruction time”, included the following categories i.e., (1) instructional delivery within a week, (2) within 1–2 weeks, (3) for 3 weeks, (4) equal to or more than a month, and (5) not specified.

Table 2 Results of the moderator analysis

Results

Prior to addressing the proposed research questions, the reliability of the meta-analysis has been verified by performing tests for heterogeneity, sensitivity, and publication bias. The findings have been briefed below:

Test for Heterogeneity

In order to assess the heterogeneity of the studies shortlisted for the meta-analysis, Cochran's Q statistic and I2 statistic were employed (Cochran, 1954). A significant Q statistic demonstrates that ESs come from different populations (heterogeneity), signifying the use of the random-effects model. Typically, the Q statistic is a test for the null hypothesis, where the null hypothesis (which states that the true ES is the same for all studies) is rejected, if the Q value is not equal to degrees of freedom (Df), In this study, since the Q-value is 97.59 and the Df is 25, the null is rejected. Correspondingly, Cochran's Q statistic was employed to investigate if the overall effectiveness of informal PjBL is affected by the moderator variables. For this, the between-class variance component QB was computed (QB= QTQW, where QT and QW are overall and within-class variance components respectively). If the QB value is significant, then it relates to the fact that the moderator variables influence the overall effectiveness of the studies. Similarly, the I2 statistic is the ratio of observed variance to true variance in ESs. I2 values between < 20%, 20–50%, 50%-75%, and ≥ 75% depict low, moderate, high, and very high heterogeneity respectively. Our study estimated an average I2 value of 74.38%, with a significant p-value < 0.001, indicating the need for moderator analysis.

Test For Sensitivity

Sensitivity analysis was used to determine the existence of any ES/studies that unnecessarily influence the central tendency and variability of the data. The ESs of the studies were analyzed using the “one study removed procedure” in CMA software (Borenstein et al., 2009). The result showed that, for each study removed, the highest mean in the random model was d = 0.260, n = 26, SE = 0.051, and the lowest mean was d = 0.213, n = 26, SE = 0.039. These two new weighted average ESs lie within the confidence interval of the whole dataset [n = 26, d = 0.248, 95% CI {0.151;0.346}, p < 0.001]. This suggested that no incongruities were observed to have an influential effect on the pooled average ES calculated. Finally, a test for publication bias was conducted to ensure the authenticity of the meta-analysis.

Test for Publication Bias

The existence of publication bias was estimated using a funnel plot Borenstein et al., 2009), a trim-and-fill model (Duval & Tweedie, 2000), and a classic fail-safe N (Rosenthal, 1979). The funnel plot demonstrates the relationship between ESs and standard errors (SE). An asymmetrical funnel plot (Fig. 3) revealed some amount of publication bias due to small-scale studies. To further estimate the influence of any significant publication bias, Rosenthal’s method of the classic fail-safe N was employed (Rosenthal, 1979). The publication bias is not significant if the number of missing studies in the classic fail-safe N is higher than the tolerance level of 5n + 10, where n is the number of studies included in the study (n = 26). The classic fail-safe number (i.e., 609), is greater than the tolerance level value of 140 [i.e., 5(26) + 10], demonstrating the fact that publication bias is insignificant to this study, and its results are reliable.

Fig. 3
figure 3

Funnel plot for publication bias

Meta-Analysis and Results

This section of the results addresses the proposed research questions (RQs). For RQ1 and RQ2, the measure of the effectiveness of studies employing informal PjBL was estimated using Cohen’s d value (Cohen, 2013). Cohen’s d value is the “standardized difference in means” that has been employed to compute the study’s effectiveness (effect size; ES). A positive ES would favor the informal PjBL model (treatment intervention) over the traditional classroom settings (control intervention), while a negative ES would favor the control over the treatment intervention. Furthermore, Cohen’s d values ≤ 0.2, 0.2–0.5, 0.5–0.7, ≥ 0.8 are regarded as low, medium, high, and very high effects respectively.

Results Regarding Effectiveness

When compared to the control setting, the impact of informal PjBL shows moderate effectiveness, with an ES value of 0.248 [n = 26, d = 0.248, 95% CI {0.151;0.346}, p < 0.001]. These results illustrate the significance of informal PjBL in terms of students’ academic gains, compared to the traditional classroom setting. In this context, Fig. 4 depicts a forest plot revealing the distribution of ESs (of the individual studies) shortlisted in the paper. In the forest plot, the squares (on the right) demonstrate the ESs of the different studies while the diamond (at the bottom) presents the overall ES. The confidence interval is shown by the lines over the squares and diamonds. The ESs range from low (d = 0.011) to high (d = 1.071) effects. The distribution of true ESs was also calculated using CMA prediction interval software (Fig. 5). The true ESs in 95% of all comparable populations fall in the prediction interval range of -0.19 to 0.69.

Fig. 4
figure 4

Forest plot depicting the effectiveness of informal PjBL on students’ learning gains

Fig. 5
figure 5

The distribution of true effects

Furthermore, in addition to analyzing the effectiveness of informal PjBL on students’ gains, this study also investigates possible moderator variables that might impact the informal PjBL (QT = 97.587, df = 25, p < 0.001) Therefore, the study investigates the impact of the following moderator variables: Teaching method, assessment method, grade, location, course duration, group size, and subject area. Table 2 depicts the findings of moderator analysis, illustrating that the teaching method, assessment model, student group size, subject area, and instructional time do influence the effectiveness of informal PjBL. Meanwhile, educational level and study location were shown to have no impact on the effectiveness of informal PjBL. A comprehensive dissection of the individual moderator results has been described followingly.

Results Regarding Teaching Models

QB value was significant (p-value < 0.05; refer to Table 2) for the “teaching models”, depicting that utilizing informal PjBL is more effective than integrated PjBL in improving students’ gains (for STEM education). Indeed, the p-value is not significant for the integrated PjBL model, and more studies are required to explore the impact of integrated PjBL on students’ learning. The overall ES was significantly greater for the informal PjBL model (n = 16, d = 0.277, 95% CI {0.223;0.331}, p < 0.000) than for informal integrated PjBL models [n = 10, d = 0.132, 95% CI {0.045;0.219}, p = 0.087) (refer to Fig. S1).

Results Regarding Assessment Methods

QB statistics was significant (p-value < 0.05) for “assessment methods” and, resultantly, a significant difference existed between the individual studies (Table 2). In other words, findings indicate that informal PjBL had a significantly greater impact when employing the triangulation assessment method [n = 15, d = 0.334, 95% CI {0.240;0.367}, p < 0.001] than indirect assessment exclusively [n = 6, d = 0.140, 95% CI {-0.092;0.327}, p = 0.272] or a combination of direct and indirect assessment methods [n = 5, d = 0.178, 95% CI {0.097;0.258}, p = 0.119] (refer to Fig. S2). As the p-value is not significant for the indirect and direct/indirect assessment models, it is not possible to conclude their impact on students’ gain during informal PjBL for STEM education. Therefore, more studies employing various assessment models need to be conducted to verify the impact of informal PjBL in a more accurate manner.

Results Regarding the Educational Level and Study Location

QB value was not significant for the “educational stage” and “study location” (as the associated p-value is more than the critical value), depicting that there are no significant differences between the individual studies (Table 2). In other words, the effects of PjBL conducted on primary, secondary, or high school students, hailing from different locations (countries) were not different and do not influence the effectiveness of informal PjBL (refer to Figs. S3 and S4).

Results Regarding Student Group Size

QB value was significant for group size, indicating that it does influence informal PjBL (p-value < 0.05) (Table 2). The results indicate that PjBL had a significantly greater impact when students were grouped in small groups of 3–4 members [n = 13, d = 0.237, 95% CI {0.194;0.352}, p < 0.001], rather than “less than 3” [n = 2, d = 0.142, 95% CI {0.036;0.247}, p = 0.510] or “more than 5” [n = 5, d = 0.169, 95% CI {0.070;0.267}, p = 0.776] (refer to Fig. S5). As the p-value is not significant for the “less than 3” and “more than 5”, it is not possible to conclude their influence on students’ gain during informal PjBL for STEM education.

Results Regarding the Subject Area

QB statistics have been significant (p-value < 0.05) for the subject area, indicating a significant difference between the individual studies (Table 2). The findings illustrate that PjBL had a significantly greater impact when students were taught the general science subjects i.e., physics, chemistry, biology [n = 10, d = 0.321, 95% CI {0.251;0.392}, p < 0.001], rather than engineering [n = 4, d = 0.147, 95% CI {0.049;0.254}, p = 0.003] mathematics [n = 2, d = 0.204, 95% CI {0.008;0.415}, p = 0.059]; science & technology [n = 4, d = 0.260, 95% CI {0.124;0.396}, p < 0.001], STEM [n = 6, d = 0.149, 95% CI {0.047;0.252}, p = 0.004] (refer to Fig. S6).

Results Regarding Instruction Time

Table 2 also showcases that the QB was significant (p-value < 0.05) for the instruction time, indicating a significant difference between the individual studies. The findings illustrate that PjBL had a significantly greater impact when the course duration was within a week [n = 7, d = 0.284, 95% CI {0.209;0.359}, p < 0.001] (refer to Fig. S7). Though for the “equal to or more than a month” category, the p-value was statistically significant, the overall ES showed low impacts (d = 0.175) on students’ gains. For the other categories (2 weeks and 3 weeks), the p-value associated with ES is statistically insignificant. Therefore, more studies employing 2 and 3 weeks of instruction need to be conducted to verify the results.

Discussion

Undoubtedly, the informal PjBL model requires students and teachers to reflect on, evaluate, and update themselves and their approaches continuously as it is a learning process that cannot be entirely predetermined (Chounta et al., 2017). Hence, this review incorporates a meta-analysis of 12 years of evidence-based empirical research, comparing the effects of informal PjBL (treatment intervention) and traditional classroom settings (control intervention) on students' gains. The review also investigated the effectiveness of informal PjBL by examining several moderator variables that might influence its implementation. The findings reported moderate effectiveness of the informal PjBL model in STEM education (d = 0.248), revealing a positive effect on students’ learning gains (when compared to the traditional classroom setting. All the studies showed a positive ES value, favoring the informal PjBL. Of the 26 studies, 10 studies showed a statistically significant p-value (< 0.05) [Kwon et al., 2021 (Bokor et al., 2014; Innes et al., 2012; Mateos-Nunez et al., 2020; Moreno et al., 2016; Nadelson et al., 2021; Tekbıyık et al., 2022; Yin et al., 2020; Zhou et al., 2017; Mohd Shahali et al., 2019). The distribution of ES value ranged from low [d = 0.011; (Newton et al., 2020)] to high [d = 1.071; (Moreno et al., 2016)]. The measure of effectiveness shown by this study (d = 0.248) is lower than the previous meta-analytical studies showcasing the effectiveness of PjBL models [i.e., d = 0.71 (Chen & Yang, 2019), d = 0.99 (Ayaz & Söylemez, 2015) and d = 1.06 (Balemen & Keskin, 2018)]. This disparity might be because this study is exclusively based on informal PjBL models, while the prior studies have not specified any distinctions in this regard (whether formal or informal PjBL). In addition, the aforementioned studies have included articles of primary to tertiary educational level, while this study has been limited to the school level (grades 1–12). Moreover, this study is specific to STEM education, while other studies (Balemen & Keskin, 2018; Chen & Yang, 2019) included articles with other subjects (such as social sciences). In addition, the findings of this meta-analysis (after investigating the Q statistics) suggested that students' gains in informal PjBL for STEM education are significantly influenced by many moderator variables QB = 9.08, df = 25, p < 0.001. Wherein the moderator analysis revealed that the mean weighted ES of informal PjBL was influenced by the teaching model, assessment method, students’ group size, and course duration, but not by the educational level of the participants and the study location. The particular results of the individual moderator analysis have been discussed followingly.

Correspondingly, the moderator analysis of the “teaching model” has revealed moderately high effectiveness for the informal PjBL model (d = 0.277) in improving student gains, compared to the informal integrated PjBL models (d = 0.132). A potential reason behind this could be the fact that employing a combination of pedagogies might make it more challenging for students to acquire the contents. Adopting a particular teaching model as per the student’s requirements and learning objectives has always been recommended (Bielefeldt, 2013).

Likewise, the moderator analysis of the “assessment method” has been performed since developing a valid assessment model for informal STEM learning has been quite challenging (Gao et al., 2020). The findings indicate that informal PjBL had a significantly greater impact on students’ gains (compared to a traditional classroom setting) when employing the triangulation assessment method (d = 0.303) rather than indirect assessment alone (d = 0.140) or direct and indirect assessment methods (d = 0.78). An adequate combination of the assessment tools, considering the triangulation method must be deemed since it is crucial for instructors to carefully interlink learning objectives, pedagogical models, and assessment methods (Tuunila & Pulkkinen, 2015). Wherein such a careful alignment of assessment methods with learning objectives increases the possibility for instructors to provide students with opportunities to learn, attain knowledge, and practice skills in a more efficient manner (Ghrayeb et al., 2011).

While PjBL has been practiced in Asia (e.g., Awad, 2021; Mohd Shahali et al., 2019; Lu et al., 2021), there has been a dearth of literature examining its effectiveness when compared to Western contexts. Thus, more cross-national research is required to support researchers and educators to better comprehend the effectiveness of PjBL in diverse cultural contexts. Thus the moderator analysis has been conducted and the QB statistics were found to be not statistically significant for “study location”, meaning they do not influence the effectiveness of informal PjBL. Similarly, it has been widely researched that effective implementation of informal PjBL should pertain to which educational level (Ayaz & Söylemez, 2015; Balemen & Keskin, 2018; Chen & Yang, 2019). PjBL has been widely employed in high school, and more work is required to showcase its effectiveness at the primary level (Han et al., 2015). Thus we employed moderator analysis and QB statistics were found to be not statistically significant for “educational level”, meaning they do not influence the effectiveness of informal PjBL. These findings (regarding study location & educational level) are in partial alignment with the meta-analytical review by Chen and Yang (2019). It’s also noteworthy that the context differs i.e., in terms of educational levels targeted (i.e., their study included articles of primary to tertiary level, while this study incorporated only school-level education). In addition, their study has categorized articles based on location in Europe, North America, and western and eastern Asia, whereas our study included categories of Asia, Europe, southern U.S.A., southwestern U.S.A, and western U.S.A.

While, the QB value was significant for “group size”, indicating that it does influence the informal PjBL model (p-value < 0.05). The results indicate that informal PjBL had a significantly greater impact when students are grouped in small groups of 3 to 4 members (d = 0.237) rather than “less than 3” (d = 0.142) or “more than 5” (d = 0.169). Some studies have shown that a small group size of 3–4 is suitable for a class size of 18–32 students (Calabrese & Capraro, 2021; Tekbıyık et al., 2022; Todd & Zvoch, 2019), while others didn’t specify the batch/class size (Bokor et al., 2014; Mateos-Nunez et al., 2020; Newton et al., 2020). These findings are in accordance with the study by Bertucci et al. (2010) and Kooloos et al., 2011, which recommends grouping students in groups of 3–4 and less than 5 members respectively for better learning gains, students’ participation, and satisfaction in PjBL (Bertucci et al., 2010; Kooloos et al., 2011). A potential justification could be that in a very small group, there might be less peer discussion, and in very large groups there might be a conflict of interest due to the diversity of ideas and perceptions. Interestingly, the findings from Apedoe et al. (2012), are also noteworthy and partially align with this study's findings (Apedoe et al., 2012). Though their study reported the effectiveness of pairing and grouping in 3–4, they contended these findings to be seen in conjugation with the class type (e.g., basic vs. advanced) and the type of knowledge (basic vs. advanced). For the basic knowledge (i.e., requiring no transfer of gains) the size of the group is probably not significant. However, for advanced knowledge (i.e., requiring close transfer of gains) the students in basic classrooms benefit most from groups of 3–4, while students in advanced classrooms are served better by working in pairs (Apedoe et al., 2012). Despite our study results being fascinating, none of the shortlisted articles has incorporated the concept/significance of “role assignment in groups” (Schellens et al., 2007).

Followingly the moderator analysis of the “subject area” computed a significant QB (p-value < 0.05). The results indicate that informal PjBL had a significantly greater impact when students were taught general science subjects (d = 0.321), rather than engineering, mathematics & technology. A possible reason for this finding may be that PjBL incorporates investigations, experiments, modeling, and interpreting and it is best employed in general science disciplines. These findings are in pact with the study by Ayaz and Söylemez (2015), which reported the highest effectiveness for “general science”, compared to biology, physics, and chemistry disciplines (Ayaz & Söylemez, 2015). Dissecting the various science disciplines, the study findings by Baleman and Keskin (2018) have been noteworthy (Balemen & Keskin, 2018). They reported the highest impact of PjBL models when employed for biology subjects (d = 1.147), compared to physics and chemistry. However, they didn’t compute the effectiveness of PjBL when employed for technology and engineering subjects.

Finally, the moderator analysis of “instructional time”, has been performed to comprehend the time that should be devoted to informal PjBL for the effective acquisition of students’ gains. Findings depicted a significant QB value (p-value < 0.05), suggesting that informal PjBL has a higher impact when the course duration was within a week (d = 0.284), rather than 1–2 weeks, 3 weeks, and ≥ a month. These findings are partially in accordance with the meta-analytical findings of Ayaz & Söylemez (2015), which recommended 1 ≤ h ≤ 20, to be the most effective course duration for PjBL models (Ayaz & Söylemez, 2015). Though much of the informal education span for a day, none of the articles shortlisted for this meta-analysis employed a 1-day workshop or 1 session program. Conclusively, all the seven-moderator analysis performed in the study has proven critical in understanding the impact of various moderator variables on the effectiveness of informal PjBL for STEM education (with respect to the traditional classroom setting).

The study findings should be seen in light of some limitations. This study has solely focused on peer-reviewed articles and discarded data from theses, books, conference proceedings, etc. The review incorporated only empirical literature, aligning with the CMA format. Nonetheless, our study raises important opportunities for future research. For example, more study is needed to look at longitudinal informal PjBL interventions because some student variables might take time to be expressed and assessed. Further research could also include a meta-analysis incorporating undergraduate student participants and further investigate informal PjBL for non-STEM subjects (humanities, language) could also be explored.

Conclusion

The findings of the study demonstrated a positive impact of informal PjBL on students’ learning gains (d = 0.248) when compared to traditional classroom settings. Although previous meta-analytical studies have confirmed the effectiveness of the PjBL models (Ayaz & Söylemez, 2015; Balemen & Keskin, 2018; Chen & Yang, 2019), none of them have been explicit for informal PjBL model executed for school students in STEM education. Since in an informal setting, designing, and executing the PjBL model, considering all the influential factors, might be challenging, this study has investigated the effect of potential moderator variables that might impact informal PjBL models’ effectiveness. The moderator analysis revealed that informal PjBL was influenced by the teaching model, assessment method, students’ group size, and course duration, but not by the educational level of the participants and the study location. Thereby, the paper concludes that the “informal PjBL” model could be more effective for “general science” instruction when executed within “a week”, by dispersing students in small groups of “3–4 members”, and by employing “triangulation assessment”. Thus, we believe that this research would help academicians and researchers to comprehend the moderator variables affecting informal PjBL and assist them in designing and implementing effective informal PjBL strategies at the school level for STEM education.