1 Introduction

Many children fail basic reading and math standards,Footnote 1 and the number of such students has greatly increased during the covid-19 pandemic. One-on-one human tutoring can be highly effective (Nickow et al., 2020), in part because it enables students to receive personalized, differentiated instruction, but it is often prohibitively expensive. Educational software aims to provide some of this personalized instruction at scale, but can still be costly and slow to build.

Reinforcement learning (RL) could reduce the cost of developing effective learning technology by automating the process of specifying how best to support a student through their learning journey. RL algorithms learn from data to choose an intervention (such as a hint), given the current context (such as an estimate of a student’s knowledge) to maximize the expected value of some desirable outcome, such as test scores. Preliminary work on using RL for improving educational software has enabled encouraging gains on learning outcomes (Mandel et al., 2014; Chi et al., 2011; Park et al., 2019; Bassen et al., 2020) or student persistence (Mandel et al., 2014; Bassen et al., 2020). Such systems have been limited to selecting among practice items, and not all experiments with using RL for enhancing educational technology have yielded positive outcomes (see review (Doroudi et al., 2019)). It is unknown if reinforcement learning could be used to automatically tune and optimize broader types of learning systems, such as the pedagogical feedback provided in a narrative environment. We also seek to do so in a way that is interpretable and robust– two important aspects of AI for societally impactful applications, that are receiving increasing attention broadly, but have not yet been considered as much in the AI for education space.

To address this, we created a narrative-based adaptive pedagogical-supported educational software to support math concept learning for students roughly ages 9-12 and used reinforcement learning to adaptively (machine) learn the responses to provide support for student learning. Recent advances in explainability methods for deep neural networks (e.g. Lundberg and Lee (2017); Sundararajan et al. (2017)) have made it possible to use advanced tools for modeling without sacrificing interpretability. We used these methods to help understand if and how the system is learning to differentiate in order to optimize desired outcomes. An additional key consideration is whether the learned pedagogical support would generalize to a different student community, as all schools may not be able to support online adaptive RL systems. We tested if a distilled version of the decision policies learned in the first study could be used in a different population of students that was a more geographically diverse population with a lower household income distribution. In both studies, students with the lowest pretest scores improved using our RL-powered narrative AI system, and more than compared to students using a baseline system. This highlights the potential for reinforcement learning to tune educational software parameters to enhance effectiveness, in a way that is interpretable, transfers to other populations, and can help those most in need of support.

2 Related work: reinforcement learning for student learning

Reinforcement learning has seen impressive successes in areas like robotics (Levine et al., 2016) and game playing (Silver et al., 2018). The goal of a reinforcement learning algorithm is to compute a strategy (referred to as a “policy”) that specifies the intervention (such as a pedagogical activity) to choose in a particular context (e.g., a learner’s knowledge state and frustration level), in a way that is expected to maximize desired outcomes (e.g., test scores, engagement, retention). A key challenge is that the algorithm does not have prior knowledge of the statistical parameters governing the process by which contexts evolve, and outcomes occur. Instead, an algorithm must learn from experience by analyzing actual decisions made and their outcomes, a strategy with high expected outcomes.

In the context of education, there have been some promising results that reinforcement learning can improve word acquisition of preschoolers interacting with a social robot (Park et al., 2019), the persistence of learners during a fractions game (Mandel et al., 2014), the performance of college students learning introductory physics (Chi et al., 2011), undergraduates learning discrete mathematics (Zhou et al., 2019), and the outcomes and efficiency of working adults learning linear algebra (Bassen et al., 2020). However, in other settings, there has been little benefit over a reasonable control condition (Rowe & Lester, 2015; Doroudi et al., 2019). More broadly, work on intelligent tutoring systems and computer-assisted learning suggests that personalized feedback and support in educational software can be an effective way to support student learning (Corbett, 2001; Beal et al., 2010; VanLehn, 2011), but most prior work has focused on software designed to be used in the classroom where there are additional mechanisms to keep students’ attention.

We hypothesize that reinforcement learning may be particularly beneficial when learning is happening out of the classroom, or motivation and engagement are particularly critical, or in less traditional curricula that move towards different forms of instruction rather than lecture and practice. Learning sciences offer less guidance about how to best support students in these settings. Yet, such educational settings are likely to be increasingly important in the future, both due to immediate challenges due to the covid-19 pandemic and aftermath, as well as due to the types of skills needed for success in the 21st century. Reinforcement learning may inform data-driven instruction for such settings, and we focus our attention on learners outside the classroom in this work.

As another contrast between our focus and prior related work, in the context of education, it is both important and of interest to understand what the algorithm learns to do: what personalized decisions are made for different contexts and individuals, and who is most helped by the algorithm. Such issues have been historically largely unstudied in the reinforcement learning research community, with some notable exceptions (e.g. Shen et al. (2016); Zhou et al. (2022)), but are an important part of our current work.

3 Interface design

Learning science principles can often be too broad to inform the specific design decisions needed to create engaging, effective educational software. For example, a narrative-based, basic chatbot-supportedFootnote 2 educational interface can lead to significant learning and engagement gains over a no-narrative, no-chatbot variant (Ruan et al., 2020), but doing so well is subtle. Here the effective chat tutoring system actually used humans to act as chatbots, in a wizard-of-oz style study. In contrast, a different narrative-based system with standard step-by-step hints (which are common in intelligent tutoring systems) provided no benefit over the no-narrative, no-hint control condition (Ruan et al., 2020).

RL has the potential to be particularly helpful in such situations where personalization may be key. In this work, we used an informal online learning environment to teach students about the concept of volume that was previously developed (Ruan et al., 2020).Footnote 3 Learning tasks in this system are embedded in a narrative storyline. In response to student input, a companion AI tutor selects among four common pedagogical strategies: providing direct hints, generic encouragement, and guided prompts that scaffold the student (e.g., "Have you heard of a unit cube?"), or passive positive acknowledgment (emoticon smiley face). Figure 1 shows a screenshot of the software used.

Fig. 1
figure 1

Tutoring AI Guide Interface: A child solves a math problem while interacting with the AI-driven tutoring guide. The child can click on the “helpful?” button if they consider the AI tutor’s response to be helpful. The child can also click on “I want to stop playing” to quit the activity at any time

4 Approach

4.1 Feature space

Due to .past success in RL systems for adult learning (Chi et al., 2011; Bassen et al., 2020), we use a small set of features, specifically an eight-dimensional state space, described in detail below:

  • Grade: The elementary school grade a child is in, ranging from 3–5.

  • Pre-score: The score a child receives for the pre-test, ranging from 0–8.

  • Step: The step of the task a child is in, ranging from 1–6. (This is automatically defined by the task interface).

  • Failed attempts: The number of failed attempts made by the child in the current step. It is a non-negative integer. There is a single correct answer to each step.

  • NLP positive score: A score that reflects the positive sentiment in the last phrase typed in by the child. It is a float ranging from 0–1. An automatic sentiment analysis tool from NLTK (Bird et al., 2009) is used to calculate this.

  • NLP negative score: A score that reflects the negative sentiment in the last sentence mentioned by the child. It is a float ranging from 0–1. An automatic sentiment analysis tool from NLTK (Bird et al., 2009) is used to calculate this.

  • NLP help score: A score that reflects the extent to which the child asks for help in the message sent. It is a float ranging from 0–1 and calculated as the semantic similarity between the child’s message and “help”.

  • Anxiety score: The score of the math anxiety test (Carey et al., 2017) that the child takes prior to beginning the activity.

The observation vectors were normalized element-wise before being used for training and prediction. Grade, pre-score, and anxiety score are static variables. Other variables are affected by the actions the policy takes and change as the child is solving each step of the task.

4.2 RL policy learning

4.2.1 The simulation phase

RL algorithms were run on a simulator before any real-world experiments were done to get an initial estimate of the performance and test the algorithm’s potential. The simulator models children with various characteristics and their interactions with the math problem and interactive teaching support (the actions selected by the RL policy). Note that this simulator used simple, hand designed models of student learning and was not intended as a high fidelity replica of student learning: rather we used it to help explore how quickly a reinforcement learning agent might be able to learn an effective policy in such an environment (aka we expected our later experiment to be performed with hundreds, not millions, of students), and to tune the hyperparameters of our setup.

These early simulations informed our choice of a small function model for use in our later experiments. For example, we explored various multiple policy architectures and converged on 2 hidden layers, since in our simulations the parameters for a small instructional model could be learned within a couple of hundred simulated students. in this way, these experiments can serve as a very rough sanity check prior to the first experiment. The code for these simulations is available here: https://github.com/StanfordAI4HI/SmartPrimer_Gym.

4.2.2 Online learning Phase

Throughout the math-learning activity, children have access to an AI guide on a side panel that provides encouragement, hints, and companionship. The goal is for the AI guide to provide additional engagement with the math activity and provide adaptive support that facilitates learning gains. The AI guide takes on the persona of the monster that children select in the fantasy-based narrative. Before entering the math learning activity, children are brought through a short tutorial in which they communicate with the AI guide, which introduces itself and asks about the children. This tutorial serves to familiarize the children with the AI guide interface and build social rapport between the AI guide and the children. We provide a workflow in Fig. 2.

The RL decision policy takes in a vector describing features of the learner state and outputs a particular support type (of the 4 options) to provide. The RL algorithm aims to learn an automated decision policy to maximize the expected reward function, which should capture the key desired outcomes. We specify the reward when a student j finishes as:

$$\begin{aligned} R_j = \sum _{i=1}^8 [\max (0, post_{ij} - pre_{ij})] - \lambda * n_{hj} + \beta n_{uj} + \mathbbm {1}(quit_j), \end{aligned}$$

\((\lambda =0.013,\beta =0.1)\),where the first term is the sum over items of the j-th student’s clipped learning gain from pretest to post-test on item i of the assessment, the second term is a tiny penalty on the number of hints \(n_{hj}\) given by the system to the student (since too many hints may reduce learning), the third term provides a small bonus for the number of times \(n_{uj}\) child j marked an AI guide reply as helpfulFootnote 4, and the last term \(\mathbbm {1}(quit_j)=-8\) is a penalty if the learner quits before completing the task. Note that we choose to use a clipped learning gain (pair-wise increases between pretest to post test per problem). Problems were matched on the two tests to be similar with different specific numerical quantities (aka problem 3 on the test 1 was similar to problem 3 on test 2). We expected it is highly unlikely for the policy and practice with the math software to cause negative learning gains, and clipping the signal at 0 means that if a student did a problem correctly initially but not on the post test (which could occur for many reasons, including a student not focusing on the post test), did not impact the resulting reward signal. Bassen et al. (2020) previously stated that using such a clipped signal improved the stability and efficiency of reinforcement learning in their learning task. We set the hyperparameters \((\lambda =0.013\) and \(\beta =0.1)\),Footnote 5

The proximal policy optimization (PPO) algorithm (Schulman et al., 2017) was used to learn the decision policy to optimize the expected reward.Footnote 6 The policy architecture is stochastic. The hyper-parameter used in the online study was \(\epsilon =0.2\). Both the policy neural network and value function neural network had two hidden layers with 16 nodes and a tanh activation function. We used an Adam optimizer with a learning rate of 0.0025 for both. The RL policy is implemented with the RLGraph package (Schaarschmidt et al., 2019). This optimization method was chosen as it has shown potential in similar situations, for example, in Bassen et al. (2020).

Fig. 2
figure 2

The interaction between user and RL AI guide. The RL AI guide selects one of four actions and replies to the user once a message is received. The reward function is updated both during the interaction and after the child completes the post-quiz. Rewards 1–4 correspond to the reward functions described in Sect. 4.2 (Reward Function). The RL AI guide performs an update after every five children

4.2.3 Offline reinforcement learning

We also performed offline reinforcement learning to extract another policy for use in a subsequent experiment. We did this for multiple reasons. First, as described later, during online reinforcement learning, the policy had not yet converged by the end of study 1, and we wanted to compare a static learned policy to a control, where the differences might be clearer. Second, we were curious whether we might extract a higher-performing decision policy using offline learning. Third, in most experimental sciences, research is hoped to provide findings that generalize beyond the specific research setting. Such generalizability is also of key interest in machine learning. Therefore an important open issue is whether automated pedagogical strategies obtained using reinforcement learning in one setting will transfer to similar settings.

We used offline reinforcement learning policy evaluation to select among potential new automated instructional policies using the data gathered from online reinforcement learning (in our study 1, as we will shortly describe). We considered two sets of algorithms for training potential instructional policies. The first is behavior cloning (Pomerleau, 1990; Sammut et al., 1992), a popular method for leveraging offline data to train an automated policy. Behavior cloning trains the model to imitate the probability distribution of actions that are outputted by the online policy.

Recall that during our online RL experiment, PPO was used to update the RL policy deployed at regular intervals. This meant that only a few students got the same identical policy in the RL condition. Therefore behavior cloning can be used to output a single RL policy that essentially distills an aggregate policy over the entire online RL experiment: in a sense, Intuitively, though PPO does not have cumulative regret guarantees in our setting, our procedure at a high level is similar to theoretical proofs that show how an algorithm that achieves a particular cumulative regret can be used to output a single decision policy with a small simple regret by constructing a new decision policy that is an average over all the policies deployed by the algorithm up to a certain point. More precisely, behavior cloning minimizes the following loss:

$$\begin{aligned} {\mathcal {L}}_{\text {BC}}(\theta , {\mathcal {D}}) = \mathbb {E}_{(s, a, s') \sim {\mathcal {D}}}[D_{\text {KL}}(\pi _\theta (s) || p(a|s))] \end{aligned}$$

which in our setting, will create a single stochastic policy. Note that this policy may be different than any of the decision policies deployed during online RL.

The second style of algorithms we explored was offline policy gradient on the estimated performance of the trained instructional policy. This method has been used in several other offline RL optimization papers (see e.g. Metelli et al. (2018); Liu et al. (2020)). Here we used a weighted importance sampling (WIS) estimator to estimate the value of the policy,

$$\begin{aligned} {\mathcal {L}}_{\text {WIS}}(\theta , {\mathcal {D}})&= \frac{1}{\sum _{i=1}^{|{\mathcal {D}}|} \big ( \prod _{t=1}^L \frac{\pi _\theta (a_t|s_t)}{p(a_t | s_t)} \big )} \sum _{i=1}^{|{\mathcal {D}}|} \Big ( \prod _{t=1}^L \frac{\pi _\theta (a_t|s_t)}{p(a_t | s_t)} \Big ) R_i \nonumber \\&+ \quad \eta \cdot \frac{1}{\sum _{i=1}^{|{\mathcal {D}}|} \big ( \prod _{t=1}^L \frac{\pi _\theta (a_t|s_t)}{p(a_t | s_t)} \big )^2} \end{aligned}$$
(1)

where \(R_i\) is the total reward for student i. This is called policy gradient via importance sampling (POIS). We also explored whether adding an effective sample size (ESS) penalty with hyperparameter \(\eta\) would help – ESS regularizes the difference between the learned policy \(\pi _\theta\) and the behavior policy p.

We considered multiple hyperparameters for each of the two algorithm procedures (see Table 1). There are 108 hyperparameter combinations to learn our policy. We use an algorithm evaluation procedure where we partition the collected dataset into a train and validation set by randomly allocating 50% of students into one group and the rest into another. We repeat this strategy 10 times. We use this split dataset to choose the best model architecture, hyperparameters, and learning objectives, similar to what has been proposed in Nie et al. (2022). We trained our model on the training split and use weighted importance sampling (WIS) to evaluate the performance of this policy on the validation set. We apply the same learning procedure across all 10 splits and compute the average of the performances. We choose the best algorithm from the highest average performance on the validation set. We then apply this algorithm to train a policy that learns from the entire dataset.

Table 1 Hyperparameters considered during offline batch reinforcement learning. The policy network dimension describes the network structure: e.g. [4] is one layer of 4 hidden nodes, [16,16] is two hidden layers, each with 16 nodes

In our evaluation, the behavior cloned policy was estimated to outperform the online policy in the majority of splits. Also, a small 1-layer fully connected neural network with 4-dimensional hidden state and Gaussian error linear unit (Hendrycks & Gimpel, 2016) activation function outperformed other model architectures.

Therefore we used the distilled, behavior cloned policy in our second experiment.

5 Experimental setups

As a control condition, the interface included the mathematics task but had no narration and no adaptive support; similar to a mastery-style approach, students had to successfully complete one subpart before advancing.

While this may initially seem like a weak control condition, a past study (Ruan et al., 2020) on teaching an elementary school mathematics task had found that a similar control condition had performed similarly to a control condition with a narrative storyline, and slightly better than a control condition with a narrative storyline and step-wise hints (which are common in tutoring software).

In study 1, we examined the speed and effectiveness of using reinforcement learning to adapt the type of AI guide feedback given to learners. Due to covid-19 pandemic restrictions, all experiments were completed online. Subjects were randomly assigned to each condition, but with an unequal allocation– more students were assigned to the RL condition than the control condition. In total 269 elementary school students used the reinforcement learning-narrative educational software (RL). 70 students were in the control condition.

Subjects completed an 8 item assessment and a math anxiety survey (Carey et al., 2017), then used the volume education software, and then completed another assessment (identical up to numerical values, and cross-randomized across students), and Giggle Gauge, an engagement measure designed for studies with children (Dietz et al., 2020). More specifically, Giggle gauge is a seven item self-report measure of engagement that was designed to be appropriate given children’s development.

In study 2 we were interested to see if the distilled behavior cloned policy learned from the online RL process (Sect. 4.2), would transfer to a new population of subjects. We then conducted study 2 with a new set of subjects (37 participants used for analysis): subjects were randomized into the same control condition as study 1, or using the single distilled RL policy.

In study 2, we recruited a broader population more similar to that of the U.S.A. For the original study, 113 participants out of 203 provided home zip codes. For the follow-up study, 16 participants out of 30 provided home zip codes. For those that did not provide their home zip code, we use their school zip code. Using these zip codes, we obtained the median housing price and mean annual household income from the fifth American Community Survey (in 2020), accessible through an API provided by the United States Census Bureau. Figure 3 shows the difference between the student groups in study 1 and study 2. We conduct the Kolmogorov-Smirnov 2-sided test between student populations of two studies on these variables. For both mean annual household income (\(Pr(F(x)=G(x)) = 0.02 < 0.05\)) and median housing price (\(Pr(F(x)=G(x)) = 0.0005 < 0.01\)), we found a significant difference between two populations. In addition, subjects were more geographically and racially diverse (see Appendix). In addition, study 1 was done when many more U.S.A. children attended school remotely. Thus, study 2 offers a chance to examine the generalizability of learned RL policies.

Table 2 Mean (std. dev) results of children in both studies
Fig. 3
figure 3

Distribution of household income (left) or median housing price (right) in the zip codes provided by subjects in study 1 and study 2. There were significant difference in the subject pools between the two studies

6 Results

Aggregate summaries are shown in Table 2. Some subjects completed the pretest or posttest twice due to a limitation in the system. We excluded these subjects from the results presented. There was no significant difference in the amount of improvement (post-test - pretest score) between the RL narrative condition and control condition (study 1: Wilcoxon rank test \(W = 9632.5\), \(p = 0.2\), study 2: Wilcoxon rank test \(W = 185.5\), \(p = 0.281\)).

However, encouragingly, in both studies, there was a trend for subjects with a low initial pretest score (0-2) to have a much larger improvement between the pretest and post-test in the RL narrative condition (Fig. 4, top row). The average improvement for these students was 2.02 in study 1 (N=41), and 2.29 in study 2 (N=7), out of a total score range was (0-8). There was a significant difference in the change in scores between the RL condition and control condition in study 2 for those with low pretest scores (0-2) (Wilcoxon rank test \(W=2, p=0.013\)), though this difference does not persist after correcting for multiple-hypothesis testing, and all other differences for studies and pretest groups were not statistically significant under the same test.

Engagement scores range from 1 to 4 and subjects with low initial pretest scores (0-2) also trended to having much higher engagement in the RL AI guide condition (study 1 mean engagement score 3.29 (N=40), study 2, mean engagement score 3.28 (N=7)) than in the control condition (study 1 mean engagement score 2.7 (N=14), study 2, mean engagement score 2.7 (N=5)). Prior work suggests interpreting scores below 3.0 as low engagement and 3.0\(-\)3.6 as moderate engagement (Dietz et al., 2020).

Fig. 4
figure 4

Top row, Post-test - Pretest (y-axis), Bottom row, Normalized learning gain (NLG) \(\frac{Post test - Pretest}{MaxScore - Pretest}\) (y-axis). Scores are clustered by subjects with low (0-2), medium (3-5), and high (6-8) initial pretest scores. Error bars show standard errors. Note the NLG (bottom row) calculations exclude students who scored 100% on the Pretest since the NLG is not well defined

The assessment used may be subject to ceiling effects, as a number of students did receive the maximum score (8) on either the pretest or the post-test. Though the pretest scores did not significantly differ between the two conditions, in either study, since the control pretest scores were slightly higher, ceiling effects may have impacted the control condition more.

To address this, we also repeated our analysis using normalized learning gains (NLG), \(\frac{Post test - Pretest}{Maximum score - Pretest}\), which represent the fraction of improvement made by subjects, relative to the possible improvement. Note this excludes any subjects who scored the maximum score on the pretest since the NLG is not well-defined for such students. There was no significant difference between the RL narrative condition and control condition for NLG in either case (study 1, W = 4394.5, p-value = 0.6978; study 2, W = 104.5, p-value = 0.3819).

Like for posttest - pretest, we observe larger normalized learning gains for the RL narrative condition than the control condition for initially lower performing students, in both studies (Fig. 4, bottom row). The NLG performance for students with medium pretest scores is similar in both conditions, as was also seen for such subjects’ posttest minus pretest scores. The pattern for the highest performing students is slightly different than for the post-test - pretest scores but should be taken lightly: as stated, the NLG analysis ignores all students with maximum pretest scores. Note that an NLG of 75% for the initially high-performing student group would be at most a \(2*0.75=1.5\) post-test - pretest improvement (since 2 is the largest possible gain, if the student scored 6 on the pretest, and it is lower if the student scored 7), whereas a 30% improvement for the initially low performing student group is at least a gain of \(6*0.3=1.8\) on the post-test - pretest (since \(MaxScore - Pretest \ge 6\) for such subjects).

Together these analyses encouragingly suggest that the RL narrative condition trends to provide a bigger benefit to initially lower-performing students than the control condition. We now provide some additional analyses into the RL process and the potential mechanisms underlying this difference.

6.1 RL online learning

In study 1, the RL agent updated the AI guide pedagogical policy over subjects, but during the 28 policy updates (after 10 subjects each), we observed significant variability, and the performance had not converged.

We hypothesize this may be due to several factors. Likely most importantly, we saw a significant variation in the pretest scores of subjects over time. This may be in part because we performed rolling recruitment, adding additional recruitment sources during the study, which likely caused some shift in the distribution of the underlying students. In addition, the natural variation across third to fifth-graders and student background skills means that across small sets (such as the 10 trajectories used each round for PPO), it is quite possible to have a substantial difference in the pretest scores of those subjects. If any of the students are already at or near the ceiling of the pretest scores, there will be almost no potential room for improvement for the RL policy. Indeed there may be some natural regression to the mean, which means that an RL policy that looked promising in prior rounds for related states, may now look worse (depending on the particular generalization). Even without this potentially shifting population, ten trajectories (subjects) is a small size to average over when performing policy updates, so the gradient may be quite noisy. This suggests that performing stratification and trying to ensure a stable distribution of initial start states over participants might lead to faster convergence and better results.

However, despite this, through training, subjects in the AI guide condition consistently match or exceed the average performance of those in the control condition.

6.2 Investigating other explanations for the benefit to low pretest subjects

A natural question is what is the mechanism behind the improved performance of subjects in the RL narrative condition over those in the control condition, for subjects with initially low pretest scores, and whether this could be due to factors beyond the RL-narration itself.

One potential hypothesis is that there were additional differences between the two conditions. Indeed, on average, subjects spend longer on the RL narrative condition task than in the control condition. As Fig. 5 shows,Footnote 7 this was consistent for students across all three groups of pretest performance, and the difference in time spent between the two conditions was largely similar for all three groups. However, only the students in the low pretest group seemed to have a significant benefit from the RL condition. It seems unlikely that time on task is the primary reason for improved performance in the RL narrative condition.

Fig. 5
figure 5

Time on task (sec) (y-axis) by low (0-2), medium (3-5), and high (6-8) initial pretest scores. Error bars show standard errors. Students whose time on task exceeded 90 min (8 students) were excluded from the analysis since it was likely such students might have taken significant breaks

The study was conducted remotely, and a prescreening call was done with a guardian of each child participating to discuss the study, emphasize the child should do the task without assistance, and verify the child would be participating. However, it is still possible that guardians helped the children in some cases. It seems unlikely that for children with low pretest scores, guardians helped them more if the child was in the RL condition than if they were in the control condition. Indeed the control condition offered less support and hints than the RL narrative condition, so the opposite seems more likely to be true. One potential exception is that the RL narrative condition involved a storyline, and while unlikely, depending on the subject’s reading skills, it is possible that the guardian would have helped the subject to understand the text.

An interesting piece of evidence that it was the combination of the narrative and the RL text interaction that lead to student gains, is that we find students with low and medium pretest scores interacted more (sent more messages to the AI tutor) than students with high pretest scores in study 1. In particular, in study 1, the maximum number of messages sent by students was 20, with a long tail. The median number of messages sent in both the low and medium pretest score groups in study 1 was 4, and the high pretest score group in study 1 had a median of 2 messages sent. We conduct three two-sided two-sample Wilcoxon rank test on the number of messages sent by students in the RL condition (between the low and medium pretest group, low and high pretest group, and medium and high pretest group). There was no significant difference between the low and medium groups (W=1441.5, p=0.92), and there was a significant difference between the low and high (W=1348.5, p=0.0015 < 0.0167=0.05/3, correcting for the 3 tests done here) and the medium and high groups (W=3167.5, p=\(8.16*10^{-}6\), < 0.0167=0.05/3, correcting for the 3 tests done here). This helps to explain why the high pretest score students may not have benefited as much from the system – they did not (likely) need as much support, and did not interact much with the RL text based agent. However, this analysis only provides part of the insight for a potential mechanism, since the patterns of messages sent for the low and medium pretest groups was similar, and yet the performance gains (over the control condition) were larger for the low pretest students.

6.3 Integrated gradient analysis of policy on feature space

A natural question is whether benefits to subjects with low pretest scores may derive from the personalization capacity of the RL instructional policy. Indeed a key benefit of using RL to select activities is its potential to differentiate instruction if doing so is estimated to improve outcomes. Therefore it is of interest to evaluate what differentiation, if any, is done by the RL AI guide policy. However, most popular RL algorithms, including PPO, which we use here, use complex function approximators that are hard to interpret. Therefore we use a method in explainable machine learning, integrated gradient (Sundararajan et al., 2017), to decompose the multi-decision output of the RL policy used in study 2 into a linear additive sum of attribution for each input context feature.

Table 3 shows that the feature importances computed for the policy selected from offline RL and deployed in the RL condition. Recall there are three primary categories of features used to select pedagogical strategies: static features of the learner, features about the stage of the learning activity, and features about the learner’s interaction and performance during learning.

Table 3 Feature importance calculated by the integrated gradient method. Numbers represent how on average, the feature (with its original value) will positively or negatively contribute to how our RL policy decides to increase or decrease the probability of choosing an action for the current student

This analysis selected student’s pretest score and their math anxiety score as the most influential contextual features on the AI guide’s chosen response. Other student features had little to no effect. Figure 6 shows the probability of assigning actions for students from our distilled policy.

Fig. 6
figure 6

The y-axis shows the probability of choosing the first action for each group of subjects, based on their pretest scores (Bottom (0-2), Top (6-8)), and math anxiety level (Low (9-13, corresponding to the bottom 25% percentile), and High (22-45, corresponding to the top 25% percentile)). Error bar shows 95% CI

Students with higher pretest scores were more likely to receive direct hints: such students may require less of the productive struggle needed to learn new mathematics. Students with lower pretest scores may need more engaged practice, but those with high math anxiety may also perceive math as more effortful (Choe et al., 2019). Increasing the use of guided prompts may help support such students, as we observe in the policy instructional selections for low-performing, higher math anxiety students. These observed interactions between the multiple features describing student and context, and pedagogy choices, could inform expert analysis and support future hypothesis generation for learning sciences.

7 Discussion

Our work offers cautionary optimism on the potential role of reinforcement learning in optimizing pedagogical instructional policies. The personalized narrative AI guide may benefit students with the lowest pretest performance, without harming the performance of other learners. Indeed the average gain in scores for subjects with low (0-2) pretest scores was over 2 in both studies in the RL condition, which means the mean scores for such students at least doubled, in an assessment with 8 total points. Our results do not provide a definitive mechanism for this result, though the engagement scores suggest that the control condition was not engaging for subjects with low pretest scores. For such students, the RL narrative AI guide condition yielded higher engagement, similar to those with higher pretest scores. This is likely due to the RL AI guide, not the narrative, since prior work found narrative alone, with hints, yielded no benefit over no narrative and no AI guide in a volume learning task (Ruan et al., 2020).

Our encouraging result is consistent with limited prior work that personalized computer-assisted learning software may sometimes be similarly or only slightly more effective on average but may particularly benefit students who start with lower scores or take longer to complete problems (e.g. Shen et al. (2016); de Barros and Ganimian (2021)). Since the RL algorithm we used aims to maximize expected (test) outcomes, if differentiation within the available pedagogical supports can increase the outcome of any subgroups (without harming the outcomes of other subgroups), the algorithm should learn from data to provide such personalization. Our analysis did not find a significant benefit of RL over the control condition at the population level, though it is possible an effect would be observed with a larger sample size, or with different state feature representations, network architectures, or RL algorithms.

Across study 1 and 2 the comparison between the narrative RL condition and control conditions appear largely stable (Fig. 4), with a trend for the RL condition benefiting those with low pretest scores. This suggests an RL decision policy learned on one population can sometimes benefit other populations.

8 Conclusion

Our work was conducted on around 400 students, which is typically less than the number of third to fifth-graders in a school district, suggesting the feasibility of using this approach to quickly optimize digital learning environments. By combining reinforcement learning with explainable AI, this approach can provide new insights into the interaction of context and student learning that may prompt new research in learning sciences and has a high potential to help quickly identify and scale effective learning practices.