1 Introduction

Artificial Intelligence (AI) in Education is an emerging and booming field (Zawacki-Richter et al., 2019), where one form of its disruptive and transformative application is Intelligent Tutoring Systems (ITSs). ITS refers to computer programs “designed to incorporate techniques from the artificial intelligence (AI) community in order to provide (intelligent) tutors which know what they teach, who they teach, and how to teach it” (Nwana, 1990, p. 252). ITSs can determine the learning path, select and recommend the learning content to students, provide scaffoldings and help engage students in dialogue, and simulate one-to-one tutoring, among others (Zawacki-Richter et al., 2019). They can also provide customized experiences for different students, teachers and tutors (Churi et al., 2022). Thus, ITSs have enormous potentials to support teaching and learning, especially in large-scale distance teaching institutions where human one-to-one tutoring is very difficult (Luckin et al., 2016).

To examine the performance and effectiveness of ITSs, several types of studies have been conducted in the literature. One type of studies focused on evaluating the technical abilities of ITS and answered questions like what ITS can do; other studies focused more on evaluating the effectiveness of ITS as an intervention to improve teaching and learning in a real educational context (Colby, 2017). The second type of studies is particularly important to ITS developers and educational practitioners. From the perspective of ITS developers, ITSs are not just scientists’ cool ideas, but also something that should influence the authentic practice of education (Koedinger & Aleven, 2016). Educational practitioners are especially interested in how ITSs can help improve education. Therefore, it is important to conduct field experiments in real educational contexts (i.e., social experiment) to evaluate the effectiveness of ITSs (Corbett et al., 2001).

Social experiment is a research method in which one treatment or more than one alternative treatment are used as interventions into normal social processes and compared (Riecken & Boruch, 1974). This method summarizes the available information about how randomized experiments can be used in planning and evaluating ameliorative programs (Riecken & Boruch, 1974). Applying social experiments to investigate the effectiveness of ITS is crucial, as social experiments can root in real and natural educational environments involving the practice of end users (e.g., students and teachers). Social experiment carefully considers and controls the potential con-founding factors that may affect the observed effectiveness (Riecken & Boruch, 1974). Therefore, it allows discovering causal relationships between the proposed intervention and the effects, and provide reliable evidence to confirm the effectiveness of an intervention (Rolston, 2016).

Despite the volume of literature highlighting the importance of considering social experiment, as a method, to evaluate ITSs, the literature is still fragmented about the practices to do so and the potential challenges. Prior Systematic Literature Reviews (SLRs) on ITS in the literature focused on evaluating the technical performance of ITSs or examining the educational effectiveness of ITSs without considering the critical dimensions of social experiment (e.g., the timespan and sample size of the conducted studies). No research, to the best of our knowledge, focused on conducting an SLR on ITS from the social experiment perspective. Therefore, to address this gap, this present study aims to systematically review social experiment research investigating the effectiveness of ITSs in real and natural educational contexts.

Different from the previous review studies on ITS, this present study focused on the social experiment perspective when reviewing ITS research and only included and examined studies that focused on the educational effectiveness of ITS in real educational contexts. Conducting such literature review study can help to summarize the outcomes, features and challenges of the ITS research using social experiment methods, hence inform and guide the relevant practices in this context. For educational practitioners, the key take away of such SLRs is that they can provide field evidences on how ITSs work in real educational contexts. For researchers, they can indicate the critical challenges and factors that might influence the observed effectiveness of ITS in education. In addition, a summary of prior studies and factors that influence the success of experimental implementation can guide and inform the practitioners to implement and assess ITSs in the future. Therefore, this study contributes to the literature theoretically and practically. From a theoretical perspective, it enriches the ongoing debate on ITS and social experiment by explaining the current inconsistent results regarding the effectiveness of ITSs on learning and teaching reflected in existing literature. From a practical perspective, this study can support different stakeholders (e.g., ITS researchers and developers, and educational practitioners) learn how to implement ITSs that could effectively work in real and natural educational contexts, keeping in mind several factors (e.g., sample size or interventions).

In the following sections, related literature was reviewed, the detailed systematic review process was reported and findings were analyzed and discussed with their implications.

2 Literature review

2.1 The features of ITSs and their applications in education

ITS can customize instructional activities and strategies based on students’ characteristics and needs (Keleş et al., 2009). To provide the desired features, ITSs need to have several components in its system, namely: (1) expert module which contains knowledges for students to learn (Ma et al., 2014); (2) student diagnose module which collects and updates the information about students’ knowledge, skills, behaviors, responses, learning styles, etc. (Ma et al., 2014); (3) instructional module which focuses on the strategies and methods of teaching and delivering customized learning content (Carter, 2014); and, (4) user interface which enables the interaction between users and the system (Burns & Capps, 1988).

ITS has been applied in many subject areas to transform teaching and learning. For example, ITSs were used in computer science education to teach students programming skills, followed by medical education and math education (Mousavinasab et al., 2018). In medical education, ITSs were used to help students learn anatomy, physiology, and diagnosis related knowledge and skills. In mathematics, ITSs were used to facilitate learning numbers, spaces, patterns and structures (Mousavinasab et al., 2018).

To investigate whether ITS has a significant impact on teaching and learning, the effectiveness of ITSs must be evaluated in real and natural educational contexts with proper experimental design, reasonable duration and enough sample size. This evaluation type usually uses field trials or experiments (Koedinger & Aleven, 2016), also known as “social experiment” in social science, as its research method. The major purpose is to evaluate the effectiveness of ITS as an intervention to improve learning and teaching and answer research questions like whether ITS works effectively in a real educational context (Koedinger & Aleven, 2016).

2.2 Social experiment and its features

Social experiment is a research method used in social science, which is defined as a random assignment of participants to two groups to examine the effects caused by social policies (Social experiment, 2008). A social experiment method is a pragmatic trial, with a lot in common with field experiments (Forget, 2019). This method investigates how randomized experiments might be used in planning and evaluating ameliorative social programs (Riecken & Boruch, 1974). In social experiments, one or more treatments are used as interventions and compared (Riecken & Boruch, 1974) to evaluate the effectiveness of the intervention and answer questions like whether the intervention works in the real world (Forget, 2019).

Social experiment has a set of features. Its context is usually set in nonstationary environments in the real world (Fienberg et al., 1985). Since social experiment studies occur in a natural environment, the results can be affected by more “distracting” factors from social, political and economic perspectives. To control the effect of these factors, rigorous experimental design, matching techniques to formulate comparable groups, and advanced analysis technique are often adopted (Rolston, 2016). Participants should ideally be randomly drawn from a specified population and random assignment should ensure that differences in the average behavior of the two groups can be attributed to the treatment. However, in reality, there is less choice beyond basic eligibility; and blinding is usually impossible due to various limitations in the real world (Forget, 2019). The intervention implementation is usually flexible according to the situation in the real world (Forget, 2019). Comparator is essential (Forget, 2019); observations or measurements are used to investigate how some relevant aspects of participants’ behaviors differ from those drawn from the same population without treatment. The outcome measures and data collection are directly relevant to stakeholders, such as participants and communities. Social experiments should have evaluative conclusions about the effectiveness of the intervention (Greenberg & Shroder, 2004).

2.3 Examining the effectiveness of ITSs using social experiment methods

Using social experiments to investigate the effectiveness of ITSs is crucial and necessary, since ITSs were not supposed to be effective in principle only, but also as tools that can be integrated in and serve for a full curriculum enhancement (Corbett et al., 2001). Attentions therefore must be paid to the social contexts of schools, training centers or companies where ITSs are used and evaluated (Corbett et al., 2001). When properly implemented, the analysist can ensure that a given intervention has led to a given result (Riecken & Boruch, 1974). Social experiment examines the intervention in real contexts with stakeholders involved. It also considers the potential impact of multiple contextual factors or other con-founding factors and then use rigorous study design, appropriate group matching techniques to formulate comparable control groups, and advanced analytical technique to control the effects of these con-founding factors so that the effect of proposed intervention can be more accurately detected (Rolston, 2016). Such an advantage makes social experiments a strong method of discovering causality (Rolston, 2016).

On the other hand, there exist several crucial challenges related to applying social experiment with ITSs, which can influence the success of experiment implementation, thus affecting the obtained results. These challenges include getting the cooperation of schools to conduct the needed study, handling hardware issues on site, integrating ITS into existing social contexts of schools and instructional practices (Koedinger & Aleven, 2016). However, such a summary of the challenges was drawn from only a few studies, not comprehensively. To support successful application and evaluation of ITS, it is also necessary to comprehensively understand and summarize these challenges reflected in prior related studies.

2.4 Related SLRs focusing on ITS

Several SRLs have been conducted on ITS from different aspects (see Table 1). Some of these reviews focused on ITSs used for domain-specific learning. For example, Neagu et al. (2020) reviewed the studies focused on the efficacy of ITSs in improving on psychomotor training. Crow et al. (2018) reported key information about existing ITSs used for programming education. Feng et al. (2021), and Alabdulhadi and Faisal (2021) reviewed the ITS studies used for supporting STEM-related learning, while Atun (2020) reviewed the ITS studies used to improve reading comprehension.

Table 1 The key characteristics of prior literature reviews on ITSs

Other SLRs focused on the evaluation of technological features of ITS. For instance, Paladines and Ramirez (2020) reviewed ITSs incorporating natural dialogue systems. Soofi and Ahmed (2019) reviewed studies that focused on domains, techniques, delivery methods and validation methods of ITS. Cuéllar-Rojas et al. (2021) conducted a systematic review focusing on educational evaluation mediated by ITS. Finally, Mousavinasab et al. (2018) reviewed the overall characteristics, applications, and evaluation method of ITS.

In sum, related SLRs mostly focused on ITS for learning in a specific subject domain, the technical features of ITS, or the overall review of ITS. Yet, none of the aforementioned reviews focused on the use of social experiment methods in evaluating ITS effectiveness in education. The studies examining ITS using social experiments are important and a summary of these studies can map the overall landscape of the practice of the application and evaluation of ITS in real educational contexts and guide future studies. To cover this gap, this study conducts a SLR to synthesize research that adopted social experiment methods to explore the effectiveness of ITS as an intervention in teaching or learning in real and natural educational contexts. Guided by the key features of social experiments pointed out by Riecken and Boruch (1974), Greenberg and Shroder (2004), Forget (2019) and Rolston (2016), this review aims to answer the following research questions (RQs):

  • RQ1. What is the trend of ITSs with social experiment research in terms of publication year and the countries where they were applied?

  • RQ2. What types of ITS have been utilized and evaluated using social experiment method?

  • RQ3. What are the characteristics of ITS research using social experiment method in terms of study contexts, sample size, time span, and study design?

  • RQ4. What are the impacts of ITSs through social experiment assessment?

  • RQ5. What are the challenges of applying social experiment method to assess the effectiveness of ITS?

3 Methodology

This study followed the recommendation of Kitchenham and Charters (2007) on how to conduct a systematic literature review, which covers three stages, namely: (1) planning the review, which refers to the need for the review and the stated research questions; (2) conducting the review, which refers to the search process of the papers to be included in the review, as well as the data extraction method; and, (3) reporting the review, which describes the way of presenting the results. Each of the three stages are detailed in the next subsequent sections. Additionally, the literature screening followed the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) proposed by Moher et al. (2015).

3.1 Planning the review

A search for studies was conducted in the following databases, which are popular in the field of educational technology, namely: Web of Science, Scopus, IEEE Xplore and ERIC. To deal with the complex topic, the combination of search strings presented in Table 2 were used. Specifically, search terms for ITS were partially adapted from previous reviews (Li & Wong, 2021; Mousavinasab et al., 2018). The asterisk was used to broaden a search. The searching strings were formulated as: (“Intelligent* OR adaptive OR customized) AND (learning OR instruction OR education OR tutoring OR mentoring) AND (system OR software OR application) AND (experiment* OR trial OR evaluat* OR “social experiment”).

Table 2 Search terms

The obtained papers were then filtered according to the inclusion and exclusion criteria presented in Table 3. To ensure the quality of the obtained results, only peer-reviewed empirical studies published in journals or proceedings were included (Harris et al., 2014). The time frame was set as 2011- 2022, as 2011 was considered the year of where AI assisted people was booming. For instance, IBM’s Watson defeated television games and Apple’s Seri was released. Thus, it would be important to see how this impacted ITSs which integrate AI to assist teaching and learning. Besides, following the best evidence criterion for ensuring a good external validity proposed by Slavin (1986), this present study only included experiments that have a time span of 8 weeks and more. Regarding the sample size, we followed Guha (2008)’s suggestion about social experiments and included only studies with a sample size of 100 participants and more. The final search was conducted on October 14th, 2022, which led to finding 21,294 studies from the specified databases and 28 studies identified by going through the references of the obtained articles.

Table 3 Inclusion and exclusion criteria

Finally, two authors analyzed the retrieved papers by titles, abstracts, and, if necessary, by full text, based on the pre-defined inclusion and exclusion criteria (see Table 3). Figure 1 presents the flow diagram of the study selection process. At the end of this process, 40 studies were identified as being relevant to the purpose of this present SLR.

Fig. 1
figure 1

The studies selection process

3.2 Conducting the review

This stage includes the data extraction process. A coding schema, as shown in Table 4, was developed based on the major components of social experiments indicated in the literature review section to answer the aforementioned research questions. To reduce the opportunity for bias, an electronic data extraction form based on the coding scheme was designed (Kitchenham & Charters, 2007).

Table 4 Description of the coding scheme

3.3 Reporting the review

In this stage, the extracted data, based on the coding scheme, were compared and discussed to answer the research questions.

4 Findings

4.1 RQ1. What is the trend of ITSs with social experiment research in terms of publication year and the countries where they were applied?

A total of 40 articles (see Appendix) were finally reviewed and coded based on the coding schema. Figure 2 shows that there were several peaks in 2013, 2017 and 2020 in terms of the number of published studies on ITS with social experiment method. Specifically, 2020 was the year with the highest number of publication (9 studies).

Fig. 2
figure 2

Frequency of publications on ITS studies applied social experiment method

Studies that examined ITS using social experiment methods in the set time span (length ≥ 8 weeks) and sample size (N ≥ 100) have been carried out in several countries, as shown in Fig. 3. Specifically, 60% (n = 24) of these experiments were carried out in the USA, followed by the Netherlands (10%, n = 4) and China (8%, n = 3).

Fig. 3
figure 3

Regional distribution of the reviewed ITS studies

4.2 RQ2. What types of ITS have been utilized and evaluated using social experiment method?

As described in Table 5, five categories of ITSs have emerged which are presented from the largest to the smallest in terms of the number of studies involved: (1) recommendation and tutoring (13, 32.5%), (2) personalized support (12, 25%), (3) exercise and assessment (7, 17.5%), (4) personalization (6, 15%), (5) adaptive conversation (3, 7.5%), and (6) game-based learning (1, 2.5%). Specifically, among all the emerged individual ITSs, Cognitive Tutors were the most frequently applied and evaluated ITSs (7, 17.5%). In addition, two other ITSs which may suggest the trends of ITS development are worth mentioning. First, 100 Nano tutors were embedded in the video lesson to guide students’ understanding of narrowly defined skills (Goel & Joyner, 2017). Second, complicated tutors integrating multiple tutors have been invented and used. For example, SKOPE-IT combined ITSs Auto Tutor and ALEKS to process more complex tasks (Nye et al., 2018).

Table 5 Categories of the applied and evaluated ITSs

4.3 RQ3. What are the characteristics of ITS research using social experiment method in terms of study contexts, sample size, time span, study design and benchmarking used for evaluation?

Table 6 shows that the reviewed studies have been conducted across most of the educational levels, including kindergarten, primary schools, secondary schools, higher education, and adult learning. Specifically, secondary education is where most of the ITS studies (16, 40%) have been conducted, followed by higher education (14, 35%) and primary education (13, 32.5%). However, Adult learning (1, 2.5%) and Kindergarten education (1, 2.5%) were the least investigated using social experiment, calling for further research in this context. Finally, 5% of these studies were implemented across several educational levels. For example, Zhang and Jia (2017) implemented the experiments in both primary schools and secondary schools. Wetzel et al. (2017) conducted their experiments in secondary schools and higher education.

Table 6 Educational levels where the studies were conducted

Additionally, ITSs have been applied to support learning in multiple subject areas, such as language, math, science, computer science, medicine, history, economics, geometry, and engineering. Math (33%), language (24%), and science (17%) were the primary subject areas where ITSs were applied to support learning and teaching. In contrast, engineering (5%), history (2%), and economic (2%) were the areas where relatively less studies were conducted (see Fig. 4).

Fig. 4
figure 4

Distribution of subjects where ITS studies were conducted

Table 7 shows that 62.5% of the reviewed studies had the number of participants ranging from 100–500. It is noticeable that 7 studies (17.5%) had more than 1000 individual participants. It was also found that 5 studies (Pane et al., 2014; Wijekumar et al., 2013; Wijekumar et al., 2014; Wijekumar et al., 2020; Zhang & Jia, 2017) conducted large-scale experiments and involved the whole school as participants.

Table 7 Sample size of the review ITS studies

According to Table 8, the time span of ITS experiments varied from 8 weeks to 5 years. It is observed that for 55% of the studies, their time span is more than one year, while about 45% of ITS social experiments have the time span less than one year.

Table 8 Time span of the conducted ITS experiments

For the characteristics of experimental design (Table 9), 22.5% of the studies did not apply random assignment. In addition, five types of experimental design were used, namely (1) Random Control Trial; (2) Quasi-experiment; (3) Natural experiment; (4) Randomized Alternative-Treatment Design; and, (5) Longitudinal Study. Quasi-experiment (45%) was the most frequently used experimental design, followed by Random Control Trial (37.5%). In addition, natural experiment (5%), longitudinal study (10%), and Randomized Alternative-Treatment Design (2.5%) were rarely used. Finally, only 13.5% of the studies (n = 6) used the matching technique to formulate comparable control groups (Cung et al., 2019; Hickey et al., 2020; Mousavi et al., 2021; Pane et al. (2014); Spichtig et al., 2019; Troussas et al., 2021).

Table 9 Study design and assignment method used in the reviewed ITS experiments

To evaluate the effectiveness of ITSs, the included studies used different benchmarks for comparison, which can be classified into three types (see Table 10). First, business-as-usual was used as a benchmark for comparing with experiment groups where ITSs were used. This category included the experiment conditions where there were human tutors used or there was no additional tutoring provided. This is the largest group among the included studies (62.5%, n = 25). This result indicated that most of ITS-related social experiment studies considered ITS as a whole component when comparing it with the condition where there were no ITSs used. Due to the potential Blackbox effect caused by viewing ITS as whole, it is difficult to understand which sub-components of ITS were less effective. Second, ITS carrying alternative treatments (32.5%, n = 13). For example, ITS that enables personalization based on other student characteristics, providing no personalized content or feedback. This type of evaluation benchmarks is ITSs with alternative features, which can help overcome the shortage of Blackbox effect and help examine which part(s) of an ITS really works. The last category (5%, n = 2) is the blended mode which combined the application of ITS and other form of instruction, such as human tutor, which is used as a benchmark. This method directs researchers to compare the pure machine-enable intelligence and the hybrid intelligence based on the blended mode of ITS and human tutors.

Table 10 The criteria utilized in benchmarking and evaluating ITS

4.4 RQ4. What are the impacts of ITSs through social experiment assessment?

As shown in Table 11, learning performance was the most investigated outcome (36, 90%) to measure the effectiveness of ITS, followed by students’ help-seeking (4,10%), engagement (3, 7.5%) and interest (2, 5%), and teachers’ perceptions (5, 12.5%). The impact of ITS on each of the aforementioned outcomes is discussed in the following subsequent sections.

Table 11 Outcomes measured in the reviewed studies

4.4.1 The impact of ITS on learning performance

There are mixed types of results regarding the overall impact of ITSs on learning performance (see Table 12). 62.5% (25) of the studies reported positive effects. Specifically, researchers reported a set of situations where the positive effects of ITS on learning performance were identified. These situations included ITS explaining how to proceed in learning and why correctness (Kegel & Bus, 2012), using novel teaching strategy (Troussas et al., 2021; Wijekumar et al., 2012, 2013, 2014, 2020), having an open learner model (Long & Aleven, 2017), being designed based on cognitive science (Watkins et al., 2020), providing adaptation according to the combination of learning style and knowledge level (Alshammari & Qtaish, 2019), conducting data analysis while controlling confounding factors (Baker et al., 2020; Wijekumar et al., 2012, 2013, 2014); and, targeting students with specific features, such as white male students (Huang et al., 2013), low and middle level achieving students (Bartelet et al., 2016), being blended with human instructor-led instruction (Pane et al. 2014), among others.

Table 12 The effects of ITS on learning performance and associated studies

37% of the studies reported no significant effect on learning performance. Specifically, ITSs that provide: multiple templates of problem formats (Jiang et al., 2020), verbal explanations (Lee et al., 2013), tutoring-enhanced interactive solutions (Nye et al., 2018), a combination of outer loop feedback and inner loop feedback (Tacoma et al., 2020), character animation technology (Ward et al., 2013), and communication via spoken dialog and analysis while controlling the effects of covariates (Wijekumar et al., 2020) did not have any significant impact on learning performance.

12.5% of the studies reported negative effects caused by ITS. For example, ITS was found to have a negative effect on learning growth for higher achieving students (Bartelet et al., 2016), increased the error rates related to glossary learning (Roll et al., 2011), and ITS with multiple templates of problem format reduced student efficiency (Jiang et al., 2020). Researchers also warned that ITS can have potential negative effects on students’ problem solving when it does not explain how to proceed during learning (Kegel & Bus, 2012) and ITS’s instructional scaffolds can reduce students’ active processing (Butcher & Aleven, 2013).

4.4.2 The impact of ITS on students’ help-seeking, engagement, learning interest, attitude, and confidence

As shown in Table 13, 10% of the studies reported the effect of ITSs on help seeking. Two of them reported positive effect and two of them indicated no effect. Specifically, it is reported that ITS can improve students’ help-seeking skills (Roll et al., 2011), and reduce the assistance that students need from teachers (Craig et al., 2013). In contrast, Lee et al. (2013) and Jiang et al. (2020) indicated that ITS did not influence students’ hint use or requests. Thus, mixed effects of ITS on help-seeking were found.

Table 13 ITS’s impact on help-seeking, engagement, and learning interests

Regarding the effects of ITSs on learning engagement, there is also a mixed type of results (see Table 13). It was found that students with lower level of skills spent more time on practice tasks with ITS (Bartelet et al., 2016) and increased their situational awareness (Capone et al., 2022). However, Craig et al. (2013) found that there was no significant influence of ITS on learning task involvement. For learning interest, Bernacki and Walkington (2018) and Chang et al. (2016) both reported that ITS improved students’ learning interests (see Table 13). For attitude and confidence in math learning, Pane et al. (2014) reported that ITS did not have any significant effect.

4.4.3 Teachers’ perceptions on adopting ITS in teaching and learning

Through the use of ITSs, teachers gained better perceptions of their work, including positive perceptions of student experiences with ITSs (Baker et al., 2020). Teachers felt that students were more enthused and engaged in learning (Feng et al., 2014; Ward et al., 2011, 2013). They also perceived that ITSs reduced their workload (Craig et al., 2013; Feng et al., 2014). With more time saved, teachers focused more on problematic areas identified by the learning reports generated by ITSs, and their work focus shifted from checking the correctness of each problem to explaining and elaborating on the mistakes that students did (Feng et al., 2014).

4.5 RQ5. What are the challenges of applying social experiment method to assess the effectiveness of ITS?

It is reported that the central challenge lies in improving the effectiveness of ITS on learning (Zhang & Jia, 2017). Such challenge may stem from a set of other associated challenges reported by several studies.

First, students’ limited task involvement. This limited task involvement has different forms, such as low completion rate in the assignments (Jiang et al., 2020). Particularly for young students, learning with ITS involves regulatory skills that might be too demanding (Kegel & Bus, 2012). Cung et al. (2019), Nye et al. (2018), Roll et al. (2011) and Wijekumar et al. (2012) also mentioned high attrition issues, such as participants’ withdrawal or absenteeism. Moreover, insufficient involvement due to small sample size issues is noticed. Huang et al. (2013) and Bernacki and Walkington (2018) also reported small sample size issues, which may cause sample bias and therefore favor the control group (Craig et al., 2013), making it difficult to detect the effects of treatments (Bartelet et al., 2016) or achieve a good generalizability of the conclusions (del Olmo-Muñoz et al., 2022).

Second, handling students’ individual differences. students’ individual differences can lead to the moderate effect size of the proposed ITS intervention (Kegel & Bus, 2012) or ceiling effect (Bartelet et al., 2016) that limits the ability to detect potential significantly positive effect of ITS interventions. Students’ background and personal characteristics can vary from a person to another and from a semester to another (Bernacki & Walkington, 2018; Cung et al., 2019). Thus, how to deal with these individual indifferences can be challenging.

Third, limited resources and competencies. For example, it is reported that there was a lack of computer labs, computers, electricity outages (Wijekumar et al., 2013), high quality video equipment (Roll et al., 2011) or learning systems (Mousavi et al., 2021). In addition, there can be limited data access (Hickey et al., 2020; Butcher & Aleven, 2013; Jiang et al., 2020), and no sufficient resources for one-on-one tutoring (Ward et al., 2011), applying randomized assignment (Treceño-Fernández et al., 2020; Hickey et al., 2020), and keeping intervention dosage (Nye et al., 2018), grading scheme, and the number of benchmarks (Cung et al., 2019) consistent during the study. Additionally, in some studies, the participants did not have the necessary skills (e.g., keyboarding, Baker et al., 2020) to manage and use ITSs. A lack of available time for an experiment is also an issue (Bartelet et al., 2016; Goel & Joyner, 2017; Wetzel et al., 2017) that hinders the conducted studies from having long-term evidence (Kegel & Bus, 2012).

Fourth, methodology-related challenges. It is reported that social experiments, which can have black box effect, made the researchers cannot separate and measure the specific effects of different ITS components (Long & Aleven, 2017). In addition, social experiment requires a long-time span to capture the focused effect. Thus, as time passed by, how to address implementation dip and maintaining implementation fidelity can be challenging (Cung et al., 2019). In addition, there remains challenges in balancing the cost of designing and developing ITS and the benefits it brings (Bernacki & Walkington, 2018), as well as balancing the ability of ITSs for encouraging student engagement and providing optimal challenge levels (Nye et al., 2018).

Finally, for the study conducted recently, such as Capone et al. (2022), challenges like how instructors quickly adapted to this ITS-mediated remote teaching mode in a short time due the COVID-19 pandemic were reported, as well as how students and instructors can overcome a sense of disorientation due to the pandemic emergency.

5 Discussions and implications

ITSs have been further enhanced with artificial intelligence related technologies (Mousavinasab et al., 2018), which made them an important tool to enable personalized learning and transform teaching methods, curriculum forms and learning environments. This review study aims at systematically analyzing and synthesizing the studies conducted during 2011–2022 and examined the effectiveness of ITS using social experiment method. The findings indicated that the number of studies is slightly increasing since 2011, reflecting an increasing interest among researchers and practitioners towards using social experiments to evaluate ITSs. For the study context, there is a regional “intelligent” divide in the application of ITS, as the distribution of studies was unbalanced geographically and highly focusing on the USA as a context. The most apparent difference of the ITS application in different countries mainly lays in the number of studies conducted. Most of the studies were carried out in the USA, resulting in a diversified ITS application in this country, including using ITS in after school program to enhance learning interaction for math learning (Craig et al., 2013), generating learning tasks for practice for math learning (Jiang et al., 2020), providing tutoring dialogs throughout learning process (Nye et al., 2018). In contrast, in others countries, there were significantly less forms of ITS application since there were less studies conducted. This result is somehow not consistent with what Nye (2015) indicated that AIED community is increasing and recognizing the importance of designing technologies in the global wide and the digital divide is narrowing. It may because conducting large-scale and long-time-span social experiments is even more complex grounded in social reality, which needs the driving force related to social policies (Forget, 2019). This geographical distribution can also be explained by the statistical data of the national financial investments in AI provided by OECD.AI (2022). The countries where most of the ITS-related social experiment were conducted, also have the most AI-related financial investment. Based on this OECD data and the findings of this present study, it seems that sufficient financial investment is crucial for conducting ITS-related application and social experiment studies. Thus, how to mobilize and share resources, and mitigate this ITS related “intelligent divide” is a challenge for related stakeholders (e.g., policy makers, researchers and practitioners in this field) to address.

The featured functions of the merged ITSs primarily focused on recommendation and tutoring, followed by personalized support, exercise and assessment, personalization, adaptive conversation, and game-based learning. Among various ITSs, Cognitive Tutors were used most extensively and therefore they were the most influential. In addition, complicated ITSs which combine multiple ITSs to process complex tasks were developed and examined. For example, Goel and Joyner (2017) used 100 “Nano tutors” (processed simple tasks) and coordinated them to help students learn AI skills. SKOPE-IT, which combined 2 ITSs, namely, Auto Tutor and ALEKS, was applied to help math-related learning (Nye et al., 2018). This finding responded to and supported by Padadines and Ramírez (2020)’s suggestion that ITS solutions should be more re-usable and take advantages of the existing ITSs as building blocks to save time and costs. These new merging forms of ITSs may indicate the new trend in the principles of designing and developing ITSs in the future.

Regarding the characteristics of the studies in terms of educational context, less studies are conducted in adult learning and kindergartens. It may be because kindergarteners lack the necessary technology or regulatory skills to participate in ITS- supported learning (Casas et al., 2011). For adults, they may have less opportunities of formal, intensive and regular learning compared to K-12 students. For the focused subjects, consistent with Padadines and Ramírez (2020), the current study found that math, language, and science were the primary subjects where ITSs were examined using social experiment method. On the other hand, history, economics, and engineering were less investigated using ITS and social experiment method. However, Ma et al. (2014) found that studies using ITSs in humanities and social science had a significantly higher weighted mean effect size than those which used ITS in math, computer science, physics, literacy, chemistry, and language. Such inconsistency in results suggests the challenging areas of ITS application and potential opportunities for applying and evaluating ITS in humanity and social science subjects, calling for further investigation in this regard. For experimental design, quasi-experiment and random control trials are the dominating design methods. In contrast, longitudinal study and Randomized Alternative-Treatment Design were rarely used. There are also several studies which did not apply randomization in assigning participants, undermining the conclusion that the observed difference cross groups can be attributed to the treatment (Social experiment, 2008). Consistently with Padadines and Ramírez (2020)’s finding about lacking rigorous evaluation, this current study found that most of the studies did not apply matching techniques to formulate control groups in a solid way, suggesting that researchers should be more aware of rigorous study design in the future so that the conclusions can be more valid and generalizable.

To measure the effectiveness of ITSs, most studies used business-as-usual as benchmarking for comparison. Some of them used ITS as vehicles that carries different types of educational interventions. ITS that carried alternative treatments were used as benchmarking for evaluating the focused instructional treatments. The focus of this method usually is on evaluating the sub-components of ITS, which is beneficial for overcoming the Blackbox effect-related disadvantages of the social experiment method (Peck, 2017) and help improve ITS in a more specific way. The blended form with ITS and human tutors was also used as a benchmark. It responded the conclusion of a meta-analysis conducted by the U.S. Department of Education (2010), where blended form is better than pure human method or pure online method. It is possible that the effectiveness of solutions that combine both human and machine are better than the solutions that only has ITSs.

For the impacts of ITS, most of the reviewed studies reported positive results. Learning performance is the primary measured outcome, consistent with Ma et al. (2014)’s and Mousavinasab et al. (2018)’s findings. The effectiveness of ITSs on students’ help seeking, learning interest, engagement, attitude and confidence is relatively less examined, calling for future investigation in this regard. Based on this finding, we suggest to shift the focus of the application and evaluation of ITS from outcome-related cognitive constructs (e.g., learning performance) to the process-related constructs (e.g., engagement, interests, etc.) and non-cognitive constructs (attitudes, confidence, etc.) and from the level of individual learning to a broader social context where learning occurred. The identified mixed types of results regarding the effectiveness of ITS on learning performance and engagement are crucial issues to address. Ma et al. (2014) pointed out that moderator factors (e.g., ITS characteristics), contextual factors (e.g., educational setting) and research design can affect the actual effects of ITS on the outcome variables. However, most of the studies in this current review did not measure these factors or fully control their potential effects in the analysis, which may further lead to mixed or conflicting results. We therefore suggest that future studies examine the influence of ITS from a broader social science perspective, and consider the moderating and mediating factors when designing an experiment and conducting data analysis.

Our findings also identified a set of challenges when applying social experiment to evaluate ITSs, including students’ limited task involvement, individual differences, limited resources, methodological and contextual challenges related to social experiment itself and adaptation of teachers to this ITS-based new form of learning and curriculum. Failing to consider and address such challenges in advance can be the causes of the mixed types of results regarding to the effectiveness of ITSs. Future work can start from addressing these challenges. Moreover, close collaboration involving subject matter experts (social science experts, statistician, etc.) from different disciplines is needed to address these challenges and ensure the success throughout the design, development, application, experimentation and evaluation stages for examining the effectiveness of ITS.

Our findings further imply that researchers in this filed should consider how to increase students’ task involvement when applying ITSs, how to handle students’ individual differences and contextual features, and use the emerging new methods to design the experiment and analyze the data so that it is possible to accurately measure the effects caused by different components of ITS. Related developers may get inspired by the merging ITSs and related features for future development; researchers, practitioners, and policy makers should be aware of the digital divide across countries and regions, and share experiences and resources to advance the application of ITS in the global-wide.

6 Conclusions and limitations

This systematic review depicted a complicated landscape of the primary studies during 2011–2022 that examined ITS with social experiment method. It contributes to the literature through identifying the latest trends and challenges, and potential factors that can explain the mixed results regarding the effectiveness of ITSs in real and natural educational contexts. Overall, our findings confirmed that ITS can be very powerful to support teaching and learning. However, through the lens of social experiment, it also implies that technology itself cannot guarantee the success of ITS application. The complicated contextual and social factors in real educational fields can influence the observed effectiveness of ITSs. For study methods, this study suggests applying randomization in participants assignment, using matching technique to form comparable control groups, and conducting more rigorous analysis to control the effects of con-founding factors. In addition, more attention should be put on the non-cognitive and process-related outcomes.

It should be noted that this study has some limitations that should be acknowledged. For instance, the results of this study are limited by the used search keywords and the selected electronic databases and time span. However, despite these limitations, this study provided solid grounds for investigating the use of social experiment methods to assess the effectiveness ITSs. Future work could focus on the challenges and future research directions reported in this present study to provide more insights about this research topic.