Keywords

1 Introduction

Intelligent tutoring systems (ITSs) are part of everyday life for millions of students worldwide. ITSs promote accessible learning experiences that can narrow the educational achievement gap [24] and that, in some cases, can be as effective as human tutoring [13]. In their effort to create effective learning systems, ITS designers are confronted with a plethora of design decisions ranging from specifying general instructional design principles [12] to the creation of individual learning and practice materials. Designers rely on their domain expertise and consider effects of different design choices, but in many cases it is difficult to predict which exact choice will benefit students the most [18], and often thousands of design decisions have to be made on a case by case basis (e.g., which exact hint is most effective for this specific question). In this context, the promise of data-driven design approaches is that they can leverage system usage data to evaluate the effects of different design choices inside the ITS on student learning and can improve learning outcomes by refining the ITS automatically over time.

This work describes an online tutoring system that embraces a data-driven design approach by using large-scale student data to learn which of several candidate assistance actions to provide to students after they answer a practice questions incorrectly. We report results from a study–analysing data from over 190,000 students in a Biology course–evaluating the impact of individual assistance actions and assistance policies on different measures of learning outcomes. We discuss rationales behind our methodology and provide insights for the design of future learning systems. The main contributions of this work include:

  • Quantifying effects of assistance. We evaluate effects of over 7,000 individual assistance actions on a variety of student learning outcome measures (e.g., practice completion). We study the relationship among different measures and design an assistance policy training algorithm that for each question decides on the most suitable policy training objective to optimize the student’s success at the current question as well as their overall session performance.

  • Offline policy optimization. We compute statistically significant estimates on the effects of multi-armed bandit policies trained to optimize different learning outcome measures. Studying assistance actions selected by these policies, we find that there is no single best assistance type (e.g., hint, vocabulary).

  • Live A/B evaluation. We evaluate the assistance policy trained using our algorithm in comparison to a randomized assistance policy in live use with over 20,000 students. The system’s ability to learn to teach better using data from prior students improves learning outcomes of future students significantly.

2 Related Work

2.1 Evaluating Treatment Effects Inside ITSs

Initially the effects of ITSs on student learning were evaluated at the system level by comparing a group of students that uses the ITS to a control group in a post-test [13]. Later research focuses on studying the effects of individual instructional design choices [12] and conducts experiments with students that interact with different configurations of the same learning system (e.g., [16, 17]). With the ever increasing popularity of online ITSs, large-scale student log data becomes available, which enables investigating the effects of increasingly detailed system design choices, up to the choice of individual practice questions and hints.

As part of this development, ASSISTments introduced AXIS [27], the E-TRIALS TestBed [19] and the TeacherASSSIST system [20] to allow educators and researchers to create and evaluate the effectiveness of different problem sets and on-demand assistance materials. In the context of massive open online courses (MOOCs), DynamicProblem [28] was introduced as a proof-of-concept system that supports bandit algorithms [14] to collect feedback from students regarding the helpfulness of individual assistance materials. Relatedly, the MOOClet framework [23] allows instructors to specify multiple versions of educational resources and to evaluate them in A/B tests using randomization and bandit algorithms. The UpGrade system [8] was introduced as a flexible A/B testing framework designed for easy integration into various learning systems.

This work describes a fielded online tutoring system at CK12.org that learns to provide effective assistance actions (e.g., choose one of multiple available hints) to support students after they answer practice questions incorrectly. We use offline evaluation techniques [15] to leverage log data capturing over 4,800,000 assistance requests from over 190,000 students in a Biology course. The unprecedented scale of this data enables us to compute statistically significant estimates on the effects of individual assistance actions and assistance policies on different measures of learning outcomes. We further evaluate the effectiveness of the learned assistance policy in live use with over 20,000 students.

2.2 Data-Driven Assistance Policies

Here, we provide a concise overview of related research that uses data-driven techniques to support students during the problem solving process via bandit and reinforcement learning (RL) algorithms. For a comprehensive review on RL in the education domain we refer to a survey by Doroudi et al. [6].

Barnes and Stamper [3] induced a Markov decision process (MDP) based on hundreds of student solution paths and used RL to generate new hints inside a logic ITS. Chi et al. [4] modeled a physics tutor via an MDP with 16 states and learned a RL policy to improve student learning outcomes by deciding whether to ask the student to reflect on a problem or to tell them additional information. Georgila et al. [9] used Least-Squares Policy Iteration to learn a feedback policy for an interpersonal skill training system using data describing over 500 features from 72 participants. Ju et al. [10] identified critical pedagogical decisions based on Q-value and reward function estimates derived from logs of 1,148 students inside a probability ITS. Relatedly, Ausin et al. [1, 2] explored Gaussian Process- and inverse RL-based approaches to address the credit assignment problem inside a logic ITS. A recent series of works [7, 25, 26] used a random policy to collect data from 500 students in an operational command course and explored offline RL techniques to learn adaptive scaffolding policies based on the ICAP framework.

A recent study by Prihar et al. [21] compared a multi-armed bandit algorithm based on Thompson Sampling to a random assistance policy with respect to their ability to increase students’ success on the next question. In a two-month long experiment with 2,923 questions they find the bandit algorithm to be only slightly more effective than the random policy and argue that this is due to sample size limitations (on average 6.5 samples per action). In contrast, this work accurately estimates the impact of individual actions on different measures of learning outcomes by leveraging hundreds of samples per action (Table 1). Further, in contrast to Prihar et al. [21], we quantify treatment effects by automatically providing assistance in response to incorrect student responses which avoids self-selection effects when assistance is only shown upon student request.

3 CK-12 FlexBook 2.0 System

The CK-12 Foundation is a non-profit organization that provides millions of students worldwide with access to free educational resources. CK-12’s Flexbook 2.0 systemFootnote 1 is a web-based ITS that offers a large variety of courses for different subjects and grade levels. Each course consists of a sequence of concepts. Each concept has a Lesson section with learning materials and an Adaptive Practice (AP) section where students can develop and test their understanding (Fig. 1).

Fig. 1.
figure 1

Example views from the concept Human Chromosomes. [Left] In the Lesson section the student interacts with multi-modal learning materials. [Right] During Adaptive Practice the student develops and tests their understanding by answering practice questions. In the shown example the system displays a paragraph with illustration to assist the student before they reattempt the question after an initial incorrect response.

The AP section features item response theory (IRT)-driven question sequencing and tries to select practice questions matching the student’s ability level (Goldilocks principle [12]). After the system selects a question, the student can request a hint before submitting a first response. If the first response is incorrect, the system provides immediate feedback by displaying one assistance action (e.g., a hint or vocabulary) and the student reattempts the question. Afterwards, the system uses the student’s first response to update the student’s ability estimate and selects the next practice question. This process repeats until the student completes the AP session successfully by achieving 10 correct responses or until the question pool is exhausted in which case the student can try again.

This paper centers on the question of how we can employ data-driven techniques to learn an assistance policy that selects the most effective assistance action as feedback for each individual question. We focus on CK-12’s Biology for High School course which is used by hundreds of thousands of students each year and whose content has been developed and refined for over ten years. The course covers hundreds of concepts and features over 12,000 questions corresponding to five categories: multiple-choice, select-all-that-apply, fill-in-the-blank, short-answer and true-false. The AP system associates each question with a set of potential assistance actions. An exception are true-false questions which students only attempt once. The average non-true-false question is associated with 4.8 different actions. Each action falls into one of six categories: hint, paragraph (short text from lesson), vocabulary (keyword definitions), remove distractor (removes multiple-choice/select-all-that-apply response option), first letter (shows initial of fill-in-the-blank/short-answer solution) and no assistance (as baseline).

4 Methodology

4.1 Formal Problem Statement

We denote the set of practice questions inside the system as \(Q = \{q_1, \dots , q_k\}\). Each question \(q \in Q\) is associated with a set of \(n_q\) assistance actions \(A_q = \{a_{q,1}, \dots , a_{q,n_q}\}\) that the system can use to support students after their first incorrect response. In this work, we approach the problem of learning one effective assistance policy for the entire practice system by learning one question-specific multi-armed bandit policy \(\pi _q\) for each practice question \(q \in Q\). During deployment, \(\pi _q\) responds to each assistance query for question q by selecting one assistance action \(a_q \in A_q\) and in return receives a real-valued reward which is assumed to be sampled from an action-specific and time-invariant distribution \(R_{a_q}\) with mean \(\mu _{a_q}\). The optimal question-specific assistance policy \(\pi _q^*\) maximizes the expected reward by always selecting action \(a_q^* = \mathrm {arg\,max}_{a_q \in A_q} \mu _{a_q}\).

For us, multi-armed bandits are a framework that enables our system to automatically make design decisions by learning from the observed behavior of earlier students. It is difficult for experts to predict the most effective design ahead of time [18] and the bandit framework enables the system to estimate the effects of potential design choices using student data to refine the ITS automatically over time. Section 6 discusses benefits and limitations of our bandit formulation.

4.2 Data Collection

This work focuses on an online high-school Biology course that has been in continuous refinement for over ten years. Because of this, its content base features multiple assistance actions for individual questions. This raises the question of what type of assistance action is most effective for a particular question (e.g., should one provide hints or keyword definitions?). Even if the domain experts decide on a specific type of assistance, it is still unclear what action from the reduced action set is most effective (e.g., which exact hint should one show?).

To address these questions, we conduct an experiment to quantify the impact of individual assistance actions on different measures of learning outcomes. Starting from Aug 23rd, 2022, a randomized assistance policy was deployed. Each time this policy is queried to provide assistance for a question \(q \in Q\), it uniformly (with same chance) chooses one action at random from the set \(A_q\). An overview of the data collected up to Jan 11th, 2023, is provided by Table 1.

4.3 Measures of Learning Outcomes

One key question in this work is how to define a reward function that takes as input information about a student practice session and that outputs a reward value that quantifies the degree to which the assistance provided by the system led to successful learning. This reward function is central as it serves as objective during policy training and thus directly affects the experience of future students.

Table 1. Data collection overview. The Overall column shows statistics on the raw data collected for all content in the Biology course. The Offline/Online Evaluation columns show statistics on the data that went into the offline/online evaluation experiments.

The designers of the practice system want to promote growth in student knowledge as well as student engagement. Unfortunately, student knowledge and engagement are both unobservable variables and the system is limited in that it can only access data that describes the student’s observable interactions with the website interface. Because of this, we compiled a list specifying different measures of learning outcomes that can be computed from the observed log data:

  • Reattempt correct: Binary indicator (\(\{0, 1\}\)) of whether the student is correct on the reattempt directly after the assistance action.

  • Student Ability: 3-Parameter item response theory (IRT)-based ability estimate using all first attempt responses computed at end of session. IRT is a logistic model that explains response correctness by fitting student- (ability) and question-specific (difficulty, discrimination, guessing) parameters [5].

  • Session success: Binary indicator (\(\{0, 1\}\)) of whether the student achieves 10 correct responses in the practice session.

  • Future correct rate: Proportion of student’s correct responses on first attempts on other practice questions following the assistance action.

  • Next question correct: Binary indicator (\(\{0, 1\}\)) of whether the student is correct on the next question following the assistance action (used in [21]).

  • Future response time: Measures the student’s average response time on questions after an assistance action in seconds (individual question response time values are capped at 60 s (95% percentile) to mitigate outliers).

  • Student confidence: Tertiary indicator \((\{1, 2, 3\})\) of the student’s self-reported confidence level at the end of the practice session.

In the experiments we study the relationships between these individual outcome measures (Sect. 5.1) which leads us to defining our final reward function R as

$$\begin{aligned} R(s, q) = 0.4 \cdot \mathrm {reattempt\_correct}(s, q) + 0.6 \cdot \mathrm {student\_ability}(s). \end{aligned}$$
(1)

Here, s represents information about a student’s entire practice session and q indicates the question for which the student received assistance. The reward value is computed as a weighted sum that considers the student’s success at reattempting question q as well as their overall practice session performance.

4.4 Offline Policy Optimization and Evaluation

Preprocessing. Before policy optimization and evaluation we perform the following preprocessing steps: (i) To avoid early dropouts, we only consider practice sessions in which students respond to at least five different questions. (ii) To avoid memorization effects, we only consider each student’s first practice attempt for each concept. (iii) To avoid confounding, we estimate the effects of individual assistance actions using only practice sessions in which the student did not request a hint before their first attempt. (iv) To achieve high confidence in our effect estimates we focus on practice questions with at least 100 samples per assistance action. As a result, we consider a set of 1,336 unique questions from 166 concepts associated with 7,707 assistance actions and draw from over 3,200,000 assistance queries occurring in over 1,000,000 different practice sessions (Table 1).

Optimization. To train and evaluate the effects of different assistance policies without conducting repeated live experiments we rely on offline policy optimization [15] and leverage log data collected by the randomized exploration policy. First, we estimate the effectiveness of individual assistance actions by computing the mean value for each learning outcome measure across all relevant practice sessions. From there, our experiments study various multi-armed bandit policies trained to optimize different outcome measures. In preliminary experiments, we found that when using measures with high variance as training objectives (i.e., student ability and session success), the conventional policy optimization approach–that for each question selects the assistance action estimated to be optimal–struggles to reliably identify actions that perform well in the evaluation on separate test data. For the average question we found optimizing policies for reattempt correctness–a measure with focus on a single question and thus lower variance–to be the most effective way to also boost student ability and session success due to its positive correlations to the other measures (Fig. 3 left).

Still, for a sizeable number of questions the conventional approach yielded better policies when directly optimizing for the measure of interest (Sect. 5.2). These tended to be questions with more available data or with larger differences in the effects of individual assistance actions. This motivated the design of a training algorithm that for each question automatically decides whether we have sufficient data to optimize the measure of interest (e.g., reward) directly or whether we should use the low variance reattempt correctness measure. We first use the training data to identify the two actions that optimize the measure of interest and reattempt correctness. We then conduct a one-sided Welch T-Test to decide whether the former has a significantly larger effect on the measure of interest than the reattempt correctness action and if not select the low variance reattempt correctness measure as the question-specific training objective.

Evaluation. In the offline evaluation experiments we report mean performance estimates derived from a 20 times repeated 5-fold cross validation. In each fold 80% of practice sessions are used for policy training and the remaining 20% are used for testing. This process yields a statistically unbiased estimate of the bandit policy’s performance as it simulates a series of interactions with different students inside the system and avoids overfitting effects of sampling with replacement-based approaches [15]. For the significance test we determine a suitable p-value for each individual outcome measure by evaluating \(p \in \{0.01, 0.02, \dots , 0.1\}\) via cross-validation. The final policy used in live A/B evaluation is trained using data from all practice sessions and optimizes our reward function (Eq. 1).

5 Results

5.1 Assistance Action Evaluation

We estimate the effects of individual assistance actions on different measures of learning outcomes by leveraging the student log data collected by the randomized assistance policy (Sect. 4.2). One example of the results of this evaluation process is provided by Fig. 2. It shows the question text, the set of available assistance actions and estimates on how each action affects different outcome measures. We can see how the paragraph that provides detailed information leads to the highest reattempt correctness rate. In comparison, hint 1 leads to a lower reattempt correctness, but conveys insights that improve overall session performance as captured by the final student ability score. We can also identify actions that are not helpful. For example, hint 2 and vocabulary both lead to worse outcomes than showing no assistance. Overall, these estimates are very compelling for the content creators, as they allow them to reflect on how the individual resources they designed affect the student experience in different ways.

To study the relationships between the different learning outcome measures we analyse average within question correlations across the 1,336 questions (Fig. 3). We focus on within question correlation instead of total correlation to be more robust towards effects caused by systematic differences between individual questions (e.g., difficulty). We observe that reattempt correctness is most correlated with the IRT-based student ability estimates (\(r = 0.27\)) and that it is mostly uncorrelated with next question correctness (\(r = 0.04\)). This shows that while assistance actions can improve students’ overall session performance, due to differences between individual questions, it is not enough to focus only on the next question. Matching our intuition, ability estimates correlate with session

Fig. 2.
figure 2

Example of assistance action evaluation for one individual question.

success (\(r = 0.35\)), future correctness (\(r=0.64\)) and next question correctness rates (\(r = 0.36\)). This is because these measures all consider first attempt response correctness. Student response time has a low positive correlation to student ability (\(r = 0.23\)) and self-reported student confidence shows very low correlations with the other considered measures.

Before moving on to training assistance policies we quantify the degree to which we can differentiate the effects of assistance actions for individual questions based on the available log data via analysis of variance (ANOVA). Compared to the bandit problem which tries to identify the single most effective action, ANOVA focuses on the simpler question of whether there are statistically significant differences in mean effects between individual actions. For a p-value of 0.05 ANOVA rejects the null hypothesis for reattempt correctness for \(83.2\%\) (\(n = 1,111\)), for student ability for \(13.3\%\) (\(n = 178\)) and for session success rate for \(9.6\%\) (\(n = 128\)) of questions. We can explain this by studying sample variance and the effect size gaps between the most and least effective assistance action for each outcome measure. By only focusing on the current question, reattempt correctness exhibits on average across the 1,336 question a better ratio between action effect gaps and sample variance (\(\delta = 0.230\), \(\sigma ^2 = 0.229\)) compared to the student ability (\(\delta = 0.302\), \(\sigma ^2 = 3.665\)) and session completion rate (\(\delta = 0.042\), \(\sigma ^2 = 0.085\)) measures which describe overall session performance.

Fig. 3.
figure 3

[Left] Average within question correlations between individual measures of learning outcomes across 1, 336 questions. [Right] Pareto front visualizing the estimated average performance of policies optimized to increase the final student ability estimates (x-axis) and reattempt correctness rate (y-axis) across 178 questions. Each bandit policy is marked with a number that indicates how it weights the two objectives.

Table 2. Offline evaluation of various policies across the 178 questions for which ANOVA indicated significant differences (\(p < 0.05\)) in mean action effects on student ability. The first two rows show no assistance and randomized policies as baselines. The following four rows are bandit policies optimized directly for different outcome measures and the reward function. We report mean values and \(95\%\) confidence intervals.

5.2 Offline Policy Evaluation

While ANOVA finds significant differences in mean action effects on reattempt correctness for most questions, it only detects differences on student ability and session completion for a smaller subset of questions. For our offline policy evaluation process this suggests that it is difficult to reliably identify the optimal assistance actions for the latter two measures even when having access to hundreds of samples per action. Indeed, in preliminary experiments we found that action effect rankings based on training data often deviate from rankings based on separate test data. For the average question we found training assistance policies based on reattempt correctness estimates to be the most effective way to boost all three outcome measures. This is due to its lower variance and the fact that improvements in reattempt correctness are positively correlated with improvements in student ability and session completion rates (Fig. 3 left).

Table 3. Offline evaluation of various policies across 1, 336 questions. The first two rows show no assistance and randomized policies as baselines. The following four rows are bandit policies optimized with our algorithm for different learning outcome measures and the reward function. We report mean values and \(95\%\) confidence intervals.
Table 4. Types of assistance actions selected by the multi-armed bandit policy learned using our reward function for all 1,336 questions. The individual columns show how the policy focuses on different types of assistance actions for different types of questions.

Still, for 178 (\(13.3\%\)) of the 1,366 questions ANOVA detected significant differences in action effects on student ability which is a core measure of interest. To study the relationship between reattempt correctness and student ability for these 178 questions, we train bandit policies for different objectives. Here, analog to the reward function (Eq. 1), we assign each policy a weight \(w_1 \in \{0, 0.1, \dots , 1.0\}\) and compute its reward values by linearly weighting reattempt correctness with \(w_1\) and student ability with \(1 - w_1\). We visualize the Pareto front defined by the resulting policies (Fig. 3 right) and observe performance estimates that range in reattempt correctness rates from \(60.6\%\) to \(67.2\%\) and in student ability from 0.250 to 0.332. All learned policies outperform the random policy significantly. In collaboration with domain experts we select \(w_1 = 0.4\) as reward function to train the assistance policy for live evaluation as it improves both measures substantially. Table 2 provides detailed performance statistics for policies trained to optimize different outcome measures across the 178 questions.

To train an assistance policy for all 1,366 questions we designed an algorithm that for each question decides whether we have sufficient data to optimize the measure of interest (e.g., reward) directly or whether we should use the low variance reattempt correctness measure (Sect. 4.4). Table 3 shows average performance metrics across 1,336 questions for a policy that always selects the no assistance action, the random policy, and four policies trained using our algorithm to optimize reattempt correctness rates, student ability, successful session completion rates, and reward function. The algorithm resolves the variance issue and the trained policies enhance the student experience in different ways.

Lastly, we study for which types of questions the final policy offers which types of assistance actions to maximize the reward objective. Table 4 shows for each question type for what proportion of questions the policy finds a certain assistance type to be most effective. We find that the policy utilizes a diverse blend of different assistance types for each type of question and that paragraph actions are selected most frequently overall. Because of this, we compare the effects of a policy that always selects paragraph actions to the trained reward policy in an additional experiment. Across the 1,175 questions with paragraphs, we find that the reward policy outperforms the paragraph policy in all outcome measures (reward: 0.336/0.299, reattempt correctness: \(67.3\%\)/\(61.5\%\), student ability: 0.112/0.089, session success: \(84.0\%\)/\(83.7\%\)). Thus, the data-driven approach benefits by selecting effective teaching actions on a question-by-question basis.

Table 5. Live policy evaluation. We randomly assign student practice sessions to the randomized policy (\(n = 31,527\)) and the learned bandit policy (\(n = 30,937\)) condition, track various outcome measures and report mean values and \(95\%\) confidence intervals.

5.3 Online Policy Evaluation

To evaluate the policy optimized using our training algorithm we compare its ability to provide students with effective assistance actions to the randomized assistance policy. For this a nine-day long A/B evaluation (Apr 5th to Apr 13th, 2023) was conducted in which practice sessions for the 166 studied concepts were randomly assigned to the bandit and the randomized policy condition. During this period we collected log data describing over 62,000 sessions from over 20,000 different students (Table 1). Even though the learned assistance policy implemented only 1,336 question-specific bandit policies, it was able to provide learned actions for 87,721 (\(74.9\%\)) of the 117,180 queries and only needed to default to random action selection in 29,459 (\(25.1\%\)) cases. This is because the majority of incorrect responses occur on a smaller number of questions.

Table 5 reports average performance for the two different policies. The trained assistance policy outperforms the randomized policy significantly in all outcome measures, achieving on average a \(9.8\%\) improvement in reattempt correctness rate and a 0.147 higher student ability estimate. The session success rate improvement from \(67.6\%\) to \(71.2\%\) corresponds to a \(11.1\%\) reduction in sessions in which students did not achieve the practice target. We note that in contrast to the offline evaluation (Sect. 5.2) where we estimate effects based on individual assistance queries, here we compute metrics based on the session level.

6 Discussion

The results show how the offline evaluation approach can leverage large-scale student log data to quantify the impact of individual assistance actions (e.g., hints and keyword definitions) for each question on different measures of student learning outcomes (e.g., reattempt correctness, practice completion). This allows ITS designers to monitor and reflect on fine-grained design decisions inside the system (e.g., which assistance action for which question) and enables a data-driven design process in which the designers can specify a reward function to train an assistance policy that promotes the desired student learning experience. The live use evaluation confirms that this process provides the system with the ability to learn to teach better automatically over time, by showing how the actions selected by the learned multi-armed bandit policies lead to significant improvements in learning outcomes compared to a randomized assistance policy.

By studying the assistance actions selected by our optimized policy (Fig. 4) we observe that there is no single best type of assistance that is always most effective. This emphasizes the importance of algorithms that can identify the most effective teaching action for each individual practice question based on observational data. Interestingly, the policy blends more informative (e.g., paragraphs) with less informative assistance actions (e.g., hints) and decides for some questions to provide no additional help at all. This indicates a trade-off between giving and withholding information during the learning process which is a phenoma that has been described as assistance dilemma in prior research [11].

Our methodology combines multi-armed bandit and offline policy evaluation techniques [15] with large-scale student log data to compute high confidence estimates on the effects of individual assistance actions. One inherent property of our multi-armed bandit formulation of the problem is that it focuses on selecting the teaching action that is most effective for the average student and does not attempt to provide assistance conditioned on the individual student, and does not capture synergies that could occur when certain combinations of assistance actions are shown to a student in the same practice session. While a reinforcement learning approach could be used to address both of these shortcomings, the volume of training data required for such an approach would increase dramatically, and it would be much harder to compute statistically significant estimates on the effects of individual policies before deployment. We will explore the potential of personalized assistance policies via contextual bandit and reinforcement learning algorithms [7, 22] in future work. Another future direction is the integration of online bandit algorithms [14] into the current system to keep enhancing the assistance policies by adaptively sampling individual actions based on evolving effect estimates in live deployment. Adaptive sampling is of particular interest to us as the pool of questions and assistance actions is continuously refined.

7 Conclusion

In this paper we discussed a large-scale online tutoring system that uses student log data to learn which of several candidate assistance actions (e.g., hints and paragraphs) to provide to students when they answer a particular practice question incorrectly. We used offline policy evaluation to leverage data from over 1,000,000 student practice sessions to evaluate the effects of individual assistance actions and multi-armed bandit policies on various measures of learning outcomes. We studied relationships among outcome measures and designed an algorithm to train an assistance policy that optimizes the student’s success at answering the current question, as well as their overall practice session performance. In a live evaluation with over 20,000 students we compared the trained assistance policy to a randomized assistance policy finding that the system’s ability to learn to select more effective teaching actions automatically over time enables significant improvements in learning outcomes of future students. The trained policy now supports thousands of students practicing Biology each day.