In 1960, a book was published by the name of Dynamic Programming and Markov Decision Processes and an article by the name of “Machine-Aided Learning”. The former established itself as one of the foundational early texts about Markov decision processes (MDPs), the model that underpins reinforcement learning (RL). The latter is a virtually unknown two-page vision paper suggesting that computers could help individualize the sequence of instruction for each student. Both were written by Ronald Howard, who is one of the pioneers of decision processes and is now considered the “father of decision analysis.” These two lines of work are not unrelated; in 1962, Howard’s doctoral student Richard Smallwood wrote his dissertation, A Decision Structure for Teaching Machines, on the topic of how to use decision processes to adapt instruction in a computerized teaching machine. This is perhaps the first example of using reinforcement learning (broadly conceived) for the purposes of instructional sequencing (i.e., determining how to adaptively sequence various instructional activities to help students learn). Instructional sequencing was thus one of the earliest applications of reinforcement learning.

Over 50 years later, we find that researchers continue to attempt tackling the problem of instructional sequencing with the tools of reinforcement learning in a variety of educational settings (perhaps not always realizing that this problem was first formulated and studied decades ago) and excitement for this area of research is perhaps as alive now as ever. Many researchers were drawn to this area of research, because (1) it is well known that the way in which instruction is sequenced can make a difference on how well students learn (Ritter et al. 2007), and (2) reinforcement learning provides the mathematical machinery to formally optimize the sequence of instruction (Atkinson 1972a). But over the past 50 years, how successful has RL been in discovering useful adaptive instructional policies? More importantly, looking to the future, how might RL best impact instructional sequencing?

In this paper, we seek to address these questions by reviewing the variety of attempts to apply reinforcement learning to the task of instructional sequencing in different settings. We first narrate a brief history of RL applied to instructional sequencing. We identify three waves of research in this area, with the most recent wave pointing to where the field seems to be heading in the future. Second, to assess how successful using RL for instructional sequencing has been in helping students learn, we review all of the empirical research comparing RL-induced instructional policies to baseline instructional policies. We find that over half of the studies found significant effects in favor of RL-induced policies. Moreover, we identify five clusters of studies that vary in the way RL has been used for instructional sequencing and have had varying levels of success in impacting student learning.

We find that reinforcement learning has been most successful in cases where it has been constrained with ideas and theories from cognitive psychology and the learning sciences, which suggest combining theory-driven and data-driven approaches, as opposed to purely relying on black-box data-driven algorithms. However, given that our theories and models are limited, we also find that it has been useful to complement this approach with running more robust offline analyses that do not rely heavily on the assumptions of one particular model.

Recent advances in reinforcement learning and educational technology, such as deep RL (Mnih et al. 2015) and big data, seem to be resulting in growing interest in applying RL to instructional sequencing. Our hope is that this review will productively inform both researchers who are new to the field and researchers who are continuing to explore ways to impact instructional design with the tools of reinforcement learning.

Reinforcement Learning: Towards a “Theory of Instruction”

In 1972, the psychologist Richard Atkinson wrote a paper titled “Ingredients for a Theory of Instruction” (Atkinson 1972b), in which he claims a theory of instruction requires the following four “ingredients”:

  1. “1.

    A model of the learning process.

  2. 2.

    Specification of admissible instructional actions.

  3. 3.

    Specification of instructional objectives

  4. 4.

    A measurement scale that permits costs to be assigned to each of the instructional actions and payoffs to the achievement of instructional objectives.”

Atkinson further describes how these ingredients for a theory of instruction map onto the definition of a Markov decision process (MDP). Formally, a finite-horizon MDP (Howard 1960a) is defined as a five tuple (S, A, T, R, H), where

  • S is a set of states

  • A is a set of actions

  • T is a transition function where \(T(s^{\prime }|s, a)\) denotes the probability of transitioning from state s to state \(s^{\prime }\) after taking action a

  • R is a reward function where R(s, a) specifies the reward (or the probability distribution over rewards) when action a is taken in state s, and

  • H is the horizon, or the number of time steps where actions are taken.

In reinforcement learning (RL), the goal is for an agent to learn a policyp—a mapping from states to actions or probability distributions over actions—that incurs high reward (Sutton and Barto 1998). The policy specifies for each state what action the agent should take. There exist various methods for planning in a MDP, such as value iteration (Bellman 1957) or policy iteration (Howard 1960a), which yield the optimal policy for the given MDP. However, RL refers to the task of learning a policy when the parameters of the MDP (the transition function and possibly the reward function) are not known ahead of time.

As Atkinson explained, in the context of instruction, the transition function maps onto a model of the learning process, where the MDP states are the states that the student can be in (such as cognitive states). The set of actions are instructional activities that can change the student’s cognitive state. These activities could be problems, problem steps, flashcards, videos, worked examples, game levels in the context of an educational game, etc. Finally, the reward function can be factorized into a cost function for each instructional action (e.g., based on how long each action takes) and a reward based on the cognitive state of the student (e.g., a reward for each skill a student has learned).

We note that this review specifically focuses on applications of reinforcement learning to the sequencing of instructional activities. Reinforcement learning and decision processes have been used in other ways in educational technology that we do not consider here. For example, Barnes and Stamper (2008) have used MDPs to model students’ problem solving processes and automatically generate hints for students. Similarly, Rafferty et al. (2015, 2016b) modeled student problem solving as a MDP and used problem solving trajectories to infer the MDP so they could ultimately give feedback to the students about misconceptions they might have. In these papers, the actions of the MDP are problem solving steps taken by students in the course of solving a problem, whereas in our paper, we focus on studies where the actions are instructional activities taken by an RL agent to optimize a student’s learning over the course of many activities.

As we show below, the natural formulation of the instructional process as a decision process and a problem that can be tackled by reinforcement learning drew many researchers, including psychologists like Atkinson, to this problem. In theory, RL could formalize that which was previously an art: instruction. How well it can do so in practice is the subject of investigation of this paper.

Examples of RL for Instructional Sequencing

In order to situate the rest of this paper, it is worth giving some concrete examples of how the techniques of decision processes and RL could be applied to instructional sequencing. We will begin with one of the simplest possible MDPs that could be used in the context of instructional sequencing, and then consider a series of successive refinements to be able to model more authentic phenomena, ending with the model considered by Atkinson (1972b). While there are many more ways of applying RL to instructional sequencing, this section will give us a sense of one concrete way in which it has been done, as well as introduce several of the design decisions that need to be made in modeling how people learn and using such models to induce instructional policies. In the review of empirical studies below, we will discuss a much broader variety of ways in which various researchers have used RL to implement instructional sequencing.

The first model we will consider is a simple MDP that assumes for any given fact, concept, or skill to be learned (which we will refer to as a knowledge component or KC), the student can be in one of two states: the “correct” state or the “incorrect” state. Whenever the student answers a question correctly, the student will transition to the correct state for the associated KC, and whenever the student answers a question incorrectly, the student will transition to the incorrect state for that KC. The complete state can be described with a binary vector of all the individual KC states. The set of actions is the set of items that we can have students practice, where each item is associated with a given KC. For each item, there is a 2-by-2 transition matrix that specifies the probability of its associated KC transitioning from one state to another. (For simplicity, we assume that all items for the same KC have the same probability of transitioning to the correct state.) Suppose our goal is to have the student reach the correct state for as many KCs as possible. Then we can specify a reward function that gives a reward of one whenever the student transitions from the incorrect state to the correct state, a reward of negative one whenever the student transitions from the correct state to the incorrect state, and a reward of zero otherwise. In this case, the optimal instructional policy is trivial: always give an item for the KC that has the highest probability of transitioning to the correct state among all KCs in the incorrect state.

Of course to use this policy in practice, we need to learn the parameters of the MDP using prior data. Given the assumptions we made, the only parameters in this model are the transition probabilities for each KC. In this case, the maximum likelihood transition probabilityFootnote 1 for each KC can be inferred simply by computing how many times students transitioned from the incorrect state to the correct state divided by the number of time steps where the students received an item in the incorrect state.

However, notice that the MDP presented above is likely not very useful, because it assumes our goal is just to have students answer questions correctly. A student may be able to answer questions correctly without displaying proper understanding, for example by guessing or by answering correctly for slightly wrong reasons. In reality, we may assume that students’ answers are only noisy signals of their underlying knowledge states. To model the fact that we cannot know a student’s true cognitive state, we would need to use a partially observable Markov decision process (POMDP) (Sondik 1971). In a POMDP, the underlying state is inaccessible to the agent, but there is some observation function (O) which maps states to probability distributions of observations. In our example, the observation at each time step is whether the student answers a question correctly or incorrectly, and the probability of answering a question correctly or incorrectly depends on which state the student is in for the current KC that is being taught. Again, we can assume there are two states for each KC, but we will call the states the “learned” state and the “unlearned” state, as they represent whether the student has learned the KC. If we ignore the reward function, this POMDP is equivalent to the Bayesian knowledge tracing model (Corbett and Anderson 1995), which has been used to implement cognitive mastery learning in intelligent tutoring systems (Corbett 2000). Typically BKT is not considered in the RL framework, because a reward function is not explicitly specified, although using BKT for mastery learning does implicitly follow a reward function. One possible reward function for cognitive mastery learning would be that each time our estimated probability that the student has learned a particular KC exceeds 0.95, then we receive a reward of one, and otherwise we receive a reward of zero. Such a model would then keep giving items for a given KC, until we are 95% confident that the student has learned that KC before moving on. Notice that the optimal policy under this reward function (i.e., cognitive mastery learning) can be very different from the optimal policy under other reasonable reward functions (e.g., get a reward of one for each KC that is actually in the learned state, which we cannot directly observe).

The parameters of a POMDP like the BKT model are slightly more difficult to infer, because we do not actually know when students are in each state, unlike in the MDP case. However, there are a number of algorithms that could be used to estimate POMDP parameters including expectation maximization (Welch 2003), spectral learning approaches (Hsu et al. 2012; Falakmasir et al. 2013), or simply performing a brute-force grid search over the entire space of parameters (Baker et al. 2010).

We consider one final modification to the model above, namely that which was used by Atkinson (1972b) for teaching German vocabulary words. Note that the BKT model does not account for forgetting. Atkinson (1972b) proposed a POMDP with three states for each word to be learned (or KC, in the general case): an unlearned state, a temporarily learned state, and a permanently learned state. The model allows for some probability of transitioning from either the unlearned or temporarily learned states to the permanently learned state, but one can also transition from the temporarily learned state back to the unlearned state (i.e., forgetting). Moreover, this model assumes that a student will always answer an item correctly unless the student is in the unlearned state, in which case the student will always answer items incorrectly. The reward function in this case gives a reward of one for each word that is permanently learned at the end (as measured via a delayed posttest, where it is assumed that any temporarily learned word will be forgotten). The optimal policy in this case can be difficult to compute because one needs to reason about words that are forgotten over time. Therefore, Atkinson (1972b) used a myopic policy that chooses the best next action as though only one more action will be taken. In this case, the best action is to choose the word that has the highest probability of transitioning to the permanently learned state.

Design Considerations in Reinforcement Learning

Before continuing, it is worthwhile to describe several different settings that are considered in reinforcement learning, and the design considerations that researchers need to make in considering how to apply RL. RL methods are often divided into model-based and model-free approaches. Model-based RL methods learn the model (transition function and reward function) first and then use MDP planning methods to induce a policy. Model-free methods use data to learn a good policy directly without learning a model first. Most of the studies we review in this paper have used model-based RL. All of the examples described above are model-based—a model is fit to data first and then a policy (either the optimal policy or a myopic policy) is derived using MDP/POMDP planning.

There are two different ways in which RL can be used. In online RL, the policy is learned and improved as the agent interacts with the environment. In offline RL, a policy is learned on data collected in the past, and is then used in an actual environment. For instance, in the examples we presented above, the models were fit to previously collected data in an offline fashion, which was then used to do model-based RL. While online RL can lead to more quickly and efficiently identifying a good policy, it can be more difficult to use in practice as one must determine and fix the algorithms used before collecting any data.

In online RL, the agent must decide whether to use the current best policy in order to accrue a high reward or to make decisions which it is uncertain about with the hopes of finding a better policy in the future. This is know as the exploration vs. exploitation trade-off. Exploration refers to trying new actions to gather data from less known areas of the state and action space, while exploitation refers to using the best policy the agent has identified so far. This trade-off is rarely tackled in the studies we consider below that have applied RL to instructional sequencing, with a few exceptions (Lindsey et al. 2013; Clement et al. 2015; Segal et al. 2018).

As discussed in our examples, since the cognitive state of a student usually cannot be observed, it is common to use a partially observable Markov decision process rather than a (fully observable) MDP. Planning, let alone reinforcement learning, in POMDPs is, in general, intractable, which is why researchers often use approximate methods for planning, such as myopic planning. However, some models of learning (such as the BKT model discussed above) are very restricted POMDPs, making it possible to find an optimal policy.

In model-based RL, our model is generally incorrect, not only because there is not enough data to fit the parameters correctly, but also because the form of the model could be incorrect. As we will see, researchers have proposed various models for student learning, which make rather different assumptions. When the assumptions of the model are not met, we could learn a policy that is not as good as it seems. To mitigate this issue, researchers have considered various methods of off-policy policy evaluation, or evaluating a policy offline using data from one or more other policies. Off-policy policy evaluation is important in the context of instructional sequencing, because it would be useful to know how much an instructional policy will help students before testing it on actual students. Ultimately, a policy must be tested on actual students in order to know how well it will do, but blindly testing policies in the real world could be costly and potentially a waste of student time.

From the intelligent tutoring systems literature, we can distinguish between two broad forms of instructional sequencing in terms of the granularity of the instructional activities: task-loop (or outer loop) adaptivity and step-loop (or inner loop) adaptivity (Vanlehn 2006; VanLehn 2016; Aleven et al. 2016a). In task-loop adaptivity, the RL agent must select distinct tasks or instructional activities. In step-loop adaptivity, the RL agent must choose the exact nature of each step (e.g., how much instructional scaffolding to provide) for a fixed instructional task. For example, an RL agent operating in the step loop might have to decide for all the steps in a problem whether to show the student the solution to the next step or whether to ask the student to solve the next step (Chi et al. 2009). Almost all of the papers we include in this review operate in the task loop. While step-loop adaptivity is a major area of research in adaptive instruction in general (Aleven et al. 2016a), relatively little work has been pursued in this area using RL-based approaches.

A Historical Perspective

The use of reinforcement learning (broadly conceived) for instructional sequencing dates back to the 1960s. We believe at least four factors led to interest in automated instructional sequencing during the 60s and 70s. First, teaching machines (mechanical devices that deliver step-by-step instruction via exercises with feedback) were gaining a lot of interest in the late 50s and 60s, and researchers were interested in implementing adaptive instruction in teaching machines (Lumsdaine 1959). Second, with the development of computers, the field of computer-assisted instruction (CAI) was forming and there was interest in developing computerized teaching machines (Liu 1960). Third, pioneering work on mathematical optimization and dynamic programming (Bellman 1957; Howard 1960a), particularly the development of Markov decision processes, provided a mathematical literature for studying the optimization of instructional sequencing. Finally, the field of mathematical psychology was beginning to formulate mathematical models of learning (Atkinson and Calfee 1963).

As mentioned earlier, Ronald Howard, one of the pioneers of Markov decision processes, was interested in using decision processes to personalize instruction (Howard 1960b). In 1962, Howard’s PhD student, Richard Smallwood, wrote his dissertation, A Decision Structure for Teaching Machines (Smallwood 1962), which presented what is to our knowledge the first time an RL-induced instructional policy was tested on actual students. Even though the field of reinforcement learning had not yet developed, Smallwood was particularly interested in what we now call online reinforcement learning, where the system could improve over time as it interacts with more students. In fact, he provided preliminary evidence in his dissertation that the policy developed for his computerized teaching machine did in fact change with the accumulation of more data. Smallwood’s PhD student Edward Sondik’s dissertation, The Optimal Control of Partially Observable Markov Decision Processes, was seemingly the first text that formally studied planning in partially observable Markov decision processes (POMDPs). Sondik wrote in his dissertation, “The results obtained by Smallwood [on the special case of determining optimum teaching strategies] prompted this research into the general problem” (Sondik 1971). Thus, the analysis of POMDPs, an important area of research in optimal control, artificial intelligence, and reinforcement learning, was prompted by its application to instructional sequencing.

Around the same time, a group of mathematical psychologists at Stanford, including Richard Atkinson and Patrick Suppes, were developing models of learning from a psychological perspective and were interested in optimizing instruction according to these models, using the tools of dynamic programming developed by Howard and his colleagues. Atkinson and his students tested several instructional policies that optimized various models of learning (Dear et al. 1967; Laubsch 1969; Atkinson and Lorton 1969; Atkinson 1972b; Chiang 1974).

Curiously, there is almost no work on deriving optimal policies from the mid-70s to the early 2000s. While we cannot definitively say why, there seem to be a number of contributing factors. Researchers from the mathematical optimization community (including Howard and his students) stopped working on this problem after a few years and continued to work in their home disciplines. On the other hand, Atkinson’s career in psychology research ended in 1975 when he left for the National Science Foundation (Atkinson 2014), and presumably the field of mathematical psychology lost interest in optimizing instructional policies over time. Research in automated instructional sequencing re-emerged at the turn of the twenty-first century for seemingly three reasons that completely parallel the trends that existed in the 60s. First, there was growing interest in intelligent tutoring systems, a natural testbed for adaptive instructional policies, paralleling the interest in teaching machines and computer-assisted instruction in the 60s. Second, the field of reinforcement learning formally formed in the late 1980s and early 1990s (Sutton and Barto 1998), combining machine learning with the tools of Markov decision processes and dynamic programming built in the 60s. Finally, the field of Artificial Intelligence in Education (AIED) and, later, educational data mining (EDM) were interested in developing statistical models of learning, paralleling mathematical psychologists’ interest in models of learning several decades earlier.

Even though there has been no void of research on instructional sequencing since the early 2000s, there seems to be a third wave of research appearing in this area in recent years. This is due to certain shifting trends in the research landscape that might be attracting a new set of researchers to the problem of data-driven instructional sequencing. First, there is a new “automated” medium of instruction, like the teaching machines, CAI, and ITSs of previous decades: MOOCs and other large-scale online education providers.Footnote 2 And with MOOCs comes the promise of big data. Second, the field of deep reinforcement learning has formed, leading to significantly more interest in the promise of reinforcement learning as a field. Indeed, there were around 35% more papers and books mentioning reinforcement learning in 2017 than in 2016 (as per the number of Google Scholar search hits). While initial advances in deep reinforcement learning have been focused largely on playing games such as Atari (Mnih et al. 2015) and Go (Silver at el. 2016, 2017), we have recently seen researchers applying deep reinforcement learning to the problem of instructional sequencing (Piech et al. 2015; Chaplot et al. 2016; Reddy et al. 2017; Wang et al. 2017a; Upadhyay et al. 2018; Shen et al. 2018a). Finally, in tandem with the use of deep reinforcement learning, there is also a growing movement within the AIED and EDM communities to use deep machine learning models to model human learning (Piech et al. 2015; Chaplot et al. 2016); this is a significantly different approach from the previous trends to use models that were more interpretable in the 1990s and models that were more driven by psychological principles in the 1960s.

Table 1 summarizes the trends that we believe have been responsible for the “three waves” of interest in applying reinforcement learning and decision processes to instructional sequencing. We find that there is a general trend that the methods of instructional sequencing have become more data-driven over time and the media for delivering instruction have become generally more data-generating. Perhaps researchers are inclined to believe that more computational power, more data, and better reinforcement learning algorithms makes this a time where RL can have a demonstrable impact on instruction. However, we do not think these factors are sufficient for RL to leave its mark; we believe there are insights to gain about how RL can be impactful from the literature, which is where we will look to next. Based on the growth of interest in reinforcement learning in general and deep reinforcement learning in particular, we anticipate many more researchers will be interested in tackling instructional sequencing in the coming years. We hope this history and the review of empirical literature that follows will be informative to these researchers.

Table 1 Trends in the three waves of interest in applying reinforcement learning to instructional sequencing

Review of Empirical Studies

To understand how successful RL has been in impacting instructional sequencing, we conduct a broad review of the empirical literature in this area. In particular we are interested in any studies that run a controlled experiment comparing one or more instructional policies, at least one of which is induced by an RL-based approach. We are interested in seeing how often studies find a significant difference between RL-induced policies and baseline policies, and what factors might affect whether or not an RL-induced policy is successful in helping students learn beyond a baseline policy. This review of the empirical literature was prompted by two experiments on RL-induced instructional sequencing that we ran in a fractions intelligent tutoring system. Both experiments resulted in no significant differences among the policies we tested. We were interested in identifying reasons why our experiments led to null results, and how our findings compared to other studies that tested RL-induced policies. Our own experiments have also given us insights into the challenges of applying RL to instructional sequencing, which have informed the discussion following the literature review below; details from one of our experiments along with some of the insights it provided are given in Appendix B.

Inclusion Criteria: Scope of the Review

One challenge of conducting this systematic review is determining what counts as an “RL-induced policy.” First of all, not all studies (especially ones from the 60s and 70s) use the term reinforcement learning, but they are clearly doing some form of RL or at least applying Markov decision processes to the task of instructional sequencing. Second, some studies are not clearly conducting some form of RL, but still have the “flavor” of using RL in that they find instructional policies in a data-driven way or they use related techniques such as multi-armed bandits (Gittins 1979; Auer et al. 2002) or Bayesian optimization (Mockus 1994; Brochu et al. 2010). On the other hand, some studies that do use the language of RL rely on heuristics or approximations in trying to find an instructional policy (such as myopic planning). We included all studies that had the “flavor” of using RL-induced instructional policies, even when the language of RL or related optimization techniques were not used.

There are two components to reinforcement learning: (1) optimization (e.g., MDP planning in the model-based setting) and (2) learning from data (e.g., learning the MDP in the model-based setting). For a study to be considered as using RL for instructional sequencing, it should use some form of optimization and data to find instructional policies. More formally, we included any studies where:

  • The study acknowledges (at least implicitly) that there is a model governing student learning and giving different instructional actions to a student might probabilistically change the state of a student according to the model.

  • There is an instructional policy that maps past observations from a student (e.g., responses to questions) to instructional actions.

  • Data collected from students (e.g., correct or incorrect responses to previous questions), either in the past (offline) or over the course of the study (online), are used to learn either:

    • a statistical model of student learning, and/or

    • an instructional policy.

  • If a statistical model of student learning is fit to data, the instructional policy is designed to approximately optimize that model according to some reward function, which may be implicitly specified.

Notice that this means we consider any studies that might learn a model from prior data and then use a heuristic to find the instructional policy (such as myopic planning rather than long-horizon planning). This also means we did not include any studies that applied planning to a pre-specified MDP or POMDP (e.g., a BKT model with hand-set parameters), since learning is a critical component of reinforcement learning.

Searching for all papers that match our inclusion criteria is challenging as not all papers use the same language to discuss data-driven instructional sequencing. Therefore, to conduct our search, we began with an initial set of papers that we knew matched our inclusion criteria in addition to any papers that we became aware of over time. We iteratively added more papers by performing one-step forward and backward citation tracing on the growing pool of papers. That is, for every paper that we included in our review, we looked through the papers that it cited as well all papers that cited it—as identified by Google Scholar as of December 2018—to see if any of those papers also matched our inclusion criteria. This means if we have missed any relevant studies, they are disconnected (in terms of direct citations) from the studies that we have identified. We found relevant papers coming from a diversity of different research communities including mathematical psychology, cognitive science, optimal control, AIED, educational data mining, machine learning, machine teaching, and human-robot interaction.


We found 34 papers containing 41 studies that matched our criteria, including a previously unpublished study that we ran on our fractions tutoring system, which is described in Appendix B. Before discussing these studies in depth, we briefly mention the kinds of papers that did not match our inclusion criteria, but are still related to studying RL for instructional sequencing. Among these papers, we found 19 papers that learned policies on offline data but did not evaluate the performance of these policies on actual students.Footnote 3 At least an additional 23 papers learned (and compared) policies using only simulated data (i.e., no data from real learners were used).Footnote 4 At least eight papers simply proposed using RL for instructional sequencing or proposed an algorithm for doing so in a particular setting without using simulated or real data.Footnote 5 We also found at least fourteen papers that did study instructional policies with real students, but did not match our inclusion criteria for various reasons, including not being experimental, varying more than just the instructional policy across conditions, or using hand-set model parameters.Footnote 6 For example, Corbett and Anderson (1995) compare using a model (BKT) to determine how many remediation exercises should be given for each KC, but they compare this to providing no remediation, not another way of sequencing remediation exercises. Finally, many papers have mathematically studied deriving optimal instructional policies for various models of learning, especially during the first wave of optimizing instructional sequencing (e.g. Karush and Dear, 1967; Smallwood, 1968, 1971). The sheer number of papers that study RL-induced policies in one form or another shows that there is broad interest in applying RL to instructional sequencing, especially as these papers come from a variety of different research communities.

For the studies that met our inclusion criteria, the first row of Table 2 shows how the studies are divided in terms of varying “levels of significance.” Twenty-one of the 36 studies found that at least one RL-induced policy was statistically significantly better than all baseline policies for some outcome variable, which is typically performance on a posttest or time to mastery. Four studies found no significant difference overall, but a significant aptitude-treatment interaction (ATI) favoring low-performing students (i.e., finding that the RL-induced policy performed significantly better than the baselines for lower performing students but no significant difference was detected for high performing students). Four studies found mixed results, namely that an RL-induced instructional policy outperformed at least one baseline policy but not all the baselines. Ten studies found no significant difference between adaptive policies and baseline policies. Only one study found a baseline policy outperformed an RL-induced policy.Footnote 7

Table 2 Comparison of clusters of studies based on the “significance level” of the studies in each cluster: Sig indicates that at least one RL-induced policy significantly outperformed all baseline policies, ATI indicates an aptitude-treatment interaction, Mixed indicates the RL-induced policy significantly outperformed some but not all baselines, Not sig indicates that there were no significant differences between policies, Sig worse indicates that the RL-induced policy was significantly worse than the baseline policy (which for the only such case was an adaptive policy)

Thus, over half of the studies found that adaptive policies outperform all baselines that were tested. Moreover, the studies that found a significant difference, as well as those that demonstrated an aptitude-treatment interaction, often found a Cohen’s d effect size of at least 0.8, which is regarded as a large effect (Cohen 1988). While this is a positive finding in favor of using RL-induced policies, it does not tell us why some studies were successful in showing that RL-induced policies can help students learn beyond a baseline policy, and why others were less successful. To do so, we qualitatively cluster the studies into five different groups based on how they have applied RL. The clusters generally vary in terms of the types of instructional actions considered and how they relate to each other. In paired-associate learning tasks, each action specifies the content presented to the student, but each piece of content is assumed to be independent of the rest. In the concept learning tasks cluster, actions are interrelated insofar as they give different bits of information about a particular concept. In the sequencing interdependent content cluster, the various pieces of content are assumed to be interdependent, but not in the restricted form present in concept learning tasks. In the sequencing activity types cluster, the order of content is fixed, and each potential action specifies the type of instructional activity for the fixed content. In all of these studies, the goal is to maximize how much students learn or how quickly they learn a prespecified set of material. The final cluster contains two studies that maximize objectives other than learning gains/speed.

There are many other ways in which we could have chosen to cluster the studies, including distinctions between the types of RL algorithms used (e.g., model-based vs. model-free, online vs. offline, MDP vs. POMDP), the form of instructional media that content was delivered in (e.g., CAI vs. ITSs vs. online platforms vs. educational games), and the types of baseline policies used. We chose to cluster studies based on the types of instructional actions, because we found the type of MDP or POMDP that underlies each of these clusters differs drastically from one another. In paired-associate learning tasks, the transition dynamics of the MDP can be factored into separate dynamics for each piece of content, and the key consideration becomes how people learn and forget individual pieces of content over time. If we assume content is interdependent, then the dynamics must capture the dependencies between the pieces of content. If we are trying to sequence activity types, then the dynamics must capture some relationship between activity types and what the student knows. Moreover, as we completed this literature review, it became clear that these differences play a role in the difficulty of sequencing instruction—and relatedly, the empirical success of applying RL to instructional sequencing. Table 2 shows for each cluster, the number of studies in each “significance level” that we identified above (e.g., whether the study showed a significant effect in favor of RL-induced policies, an aptitude-treatment interaction etc.). The table clearly shows that the different types of studies had very varying levels of success. Therefore, a qualitative understanding of each cluster will help us understand when and where RL can be most useful for instructional sequencing.

In what follows we describe the five clusters in more depth. For each cluster, we provide a table that gives a summary of all of the studies in that cluster and we describe some of the key commonalities and differences among the studies. In doing so, we will (a) demonstrate the variety of ways in which RL can be used to sequence instructional activities, and (b) set the stage for obtaining a better understanding of the conditions under which RL has been successful in sequencing instruction for students, which we discuss in the next section. Appendix A—including Tables 8 and 9—provides more technical details about the particulars of all the studies, including the types of models and instructional policies used in these studies.

Table 3 Summary of all empirical studies in the paired-associate learning tasks cluster

Paired-Associate Learning Tasks

The studies in this cluster are listed in Table 3. All of the studies that were run in the first wave of instructional sequencing (1960s-70s) belong to this cluster. A paired-associate learning task is one where the student must learn a set of pairs of associations, such as learning a set of vocabulary words in a foreign language. In such tasks, a stimulus (e.g., a foreign word) is presented to the student, and the student must attempt to provide the translation of the word. The student will then see the correct translation. Such tasks may also be referred to as flashcard learning tasks, because the student is essentially reviewing a set of words or concepts using “flashcards.” A key assumption in any paired-associate learning task is that the stimuli are independent of one another. For example, if one learns how to say “chair” in Spanish, it is assumed that it does not help or hinder one’s ability to learn how to say “table” in Spanish.Footnote 8 Because of this assumption, we may also think of this cluster as “sequencing independent content,” which clearly contrasts it with some of the later clusters.

The key goal in sequencing instruction for paired-associate learning tasks is to balance between (1) teaching stimuli that the student may have not learned yet, and (2) reviewing stimuli that the student may have forgotten or be on the verge of forgetting. The psychology literature has shown that sequencing instruction in such tasks is important due to the presence of the spacing effect (Ebbinghaus 1885), whereby repetitions of an item or flashcard should be spaced apart in time. Thus, a key component of many of the instructional policies developed for paired-associate learning tasks is using a model of forgetting to predict the optimal amount of spacing for each item. Early models used to sequence instruction such as the One-Element Model (OEM) ignored forgetting and only sequenced items based on predictions of whether students had learned the items or not (Bower 1961; Dear et al. 1967). Atkinson (1972b) later developed the Markov model we described in Section “?? ??”, which accounted for forgetting, and he showed that it could be used to successfully sequence words in a German to English word translation task. More recently, researchers have developed more sophisticated psychological models that account for forgetting such as the Adaptive Control of Thought—Rational (ACT-R) model (Anderson 1993; Pavlik and Anderson 2008), the Adaptive Response Time based Sequencing (ARTS) model (Mettler et al. 2011), and the DASH model (Lindsey et al. 2014). See Appendix A.1 for a brief description of these models. In some of the studies, policies using these more sophisticated models were shown to outperform RL-induced policies that used Atkinson’s original memory model (Pavlik and Anderson 2008; Mettler et al. 2011).

Thus, aside from the type of task itself, a key feature of the studies in this cluster is their use of statistical psychological models of human learning. As shown, in Table 2, in 11 out of 14 studies, RL-induced policies outperformed baseline policies. In the two studies where there were no significant differences between policies (Dear et al. 1967; Katsikopoulos et al. 2001), the model that was used was the OEM model—a simple model that does not account for forgetting and hence cannot space instruction of paired-associates over time. Similarly, Laubsch (1969) compared two RL-induced policies to a random baseline policy, and found that the policy using the OEM model did not do significantly better than the baseline while the policy based on the more sophisticated Random-Trial Increment (RTI) model did better. Finally, the only study that showed a baseline policy significantly outperformed an RL-induced policy, was the comparison of a policy based on the ARTS model with a policy based on the model used by Atkinson (1972b). The ARTS model was actually a more sophisticated psychological model than Atkinson’s, but the parameters of the model were not learned from data, and therefore, we considered the policy based on ARTS to technically be a non-RL-induced “baseline” policy.

Concept Learning Tasks

Concept learning is another type of task where several researchers have applied RL-based instructional sequencing. The studies in this cluster are shown in Table 4. Concept learning tasks are typically artificially-designed tasks that can be used to study various aspects of human cognition, and as such are commonly studied in the cognitive science literature. In a concept learning task, a student is presented with examples that either belong or do not belong to an initially unknown concept, and the goal is to learn what constitutes that concept (i.e., how to distinguish between positive examples that fit the concept and negative examples that do not). For example, (Rafferty et al. 2016a) used a POMDP to sequence instructional activities for two types of concept tasks, one of which is called the Number Game (Tenenbaum 2000), where students see numbers that either belong or do not belong to some category of numbers such as multiples of seven or numbers between 64 and 83. While such tasks are of little direct educational value, the authors’ goal was to show that models of memory and concept learning from cognitive psychology could be combined with a POMDP framework to teach people concepts quickly, which they succeeded in doing. Whitehill and Movellan (2017) extended this idea to teaching a concept learning task that is more authentic: learning foreign language vocabulary words via images. Whitehill and Movellan (2017) call this a “Rosetta Stone” language learning task, as it was inspired by the way the popular language learning software, Rosetta Stone, teaches foreign language words via images. Notice that this task differs from teaching vocabulary as a paired-associate learning task, because there are multiple images that might convey the meaning of a foreign word (i.e., the concept), and the the goal is to find a policy that can determine at any given time both what foreign vocabulary word to teach and what image to present to convey the meaning of that word. Sen et al. (2018) also used instructional policies in an educationally-relevant concept learning task, namely perceptual fluency in identifying if two chemical molecules shown in different representations are the same. In this task, the student must learn features of the representations that help identify which chemical molecule is being shown.

Table 4 Summary of all empirical studies in the concept learning tasks cluster. Details are as described in the caption of Table 3

Unlike paired-associate learning tasks, the various pieces of content that can be presented in a concept learning task are mutually interdependent, but in a very particular way. That is, seeing different (positive or negative) examples for a concept help refine one’s idea of a concept over time.Footnote 9 For example, in the Number Game, knowing that 3, 7, and 11 are in a concept might lead one to think the concept is likely odd numbers, while also knowing that 9 is not a member of the concept might lead one to believing the concept is likely prime numbers. The exact sequence of examples presented can have a large influence on a student’s guess as to what the correct concept might be. Therefore, determining the exact sequence of examples to present is critical for how to most quickly teach a given concept. Moreover, in these tasks, it is often beneficial to use information-gathering activities (e.g., giving a quiz to test what concept the student finds most likely), to determine what examples the student needs to refine their understanding.

As with the paired-associate learning task studies, one common feature among the studies in this cluster is that they have typically used psychologically inspired models of learning coming from the concept learning literature and computational cognitive science literature. For example, Rafferty et al. (2016a) considered three different psychological models of human learning of varying degrees of complexity. The simplest of these models—based on a model from the concept learning literature (Restle 1962)—assumes that students have a (known) prior distribution over concepts and at any given time they posit a concept as the correct one. When presented with an example, they change their concept to be consistent with the example presented, picking a random concept with probability proportional to their prior. In more complex models, students might have some memory of previous examples shown or might maintain a distribution over concepts at any given time. While the dynamics of such models are mostly prespecified by the structure of the model, there are certain model parameters (e.g., the probability of answering a question accurately according to one’s concept) that could be fit to data, as done by Rafferty et al. (2016a).

As seen in Table 4, the majority of studies in this cluster have been successful in showing that RL-induced policies outperformed some baselines. The studies in this cluster often had several baseline policies, including decent heuristic policies, so they set a higher bar for RL-induced policies. This could explain why two studies found mixed results where RL-induced policies outperformed some but not all baselines. Moreover, Rafferty et al. (2016a) compared the same policies on multiple concept learning tasks, and while their POMDP policies were generally better than the random baseline policies, there was no one POMDP policy that outperformed baseline policies for all concept learning tasks. This study indicates that even though RL-induced policies may be effective, the same model may not be optimal for all tasks.

Sequencing Interdependent Content

This cluster focuses on sequencing content, under the assumption that different areas of content are interdependent. The studies in this cluster are shown in Table 5. The sequencing task here is closest to traditional “curriculum sequencing,” or ordering various content areas for a given topic. However, unlike traditional curriculum sequencing, the ordering of content can be personalized and adaptive, for example based on how well students have mastered various pieces of content. While concept learning tasks also have interdependent content, the goal in concept learning tasks is to teach a single underlying concept. In this cluster, the goal is to teach a broader scope of content under the assumption that how the content is sequenced affects students ability to learn future content. An instructional policy in this context must implicitly answer questions like the following: When teaching students how to make a fraction from the number line, when should we move on to the next topic and what should that topic be? Should the next topic depend on how well the student answered questions about the number line? If the student is struggling with the next topic, should we go back and teach some prerequisites that the student might have missed? When should we review a content area that we have given the student previously?

Table 5 Summary of all empirical studies in the sequencing interdependent content cluster. Details are as described in the caption of Table 3

For these studies, typically a network specifying the relationship between different content areas or KCs (such as a prerequisite graph) must either be prespecified or automatically inferred from data. Appendix B describes one of our studies performed in a fractions tutoring system where the relationships between different KCs were automatically inferred from data. As we see from Table 2, the studies in this cluster have been the least successful, with all of them resulting in either a mixed result or no significant difference between policies. We analyze why this might be in the next section.

Sequencing Activity Types

While the previous three clusters of studies were based on the way various pieces of content did or did not depend on each other—this cluster is about how to sequence the types of activities students engage with rather than the content itself. The studies in the sequencing activity types cluster are shown in Table 6. These studies used RL to determine what activity type to give at any given time for a fixed piece of content, based on the content being taught and the work that the student has done so far. For example, Shen and Chi (2016b), Zhou et al. (2017), and Shen et al. (2018a), and Shen et al. (2018b) all consider how to sequence worked examples and problem solving tasks. Similarly, Chi et al. (2009, 2010a) consider, for each step, whether the student should be told the solution or whether the student should be asked to provide the solution, and, in either case, whether the student should be asked to justify the solution. Notice that Chi et al. (2009, 2010a) consider using RL for step-loop adaptivity as opposed to task-loop adaptivity, which all of the other studies reported in this review consider.

Table 6 Summary of all empirical studies in the sequencing activity types cluster. Details are as described in the caption of Table 3

For the studies that use RL to sequence worked examples and problem solving tasks, we note the existence of an expertise-reversal effect (Kalyuga et al. 2003), where novices benefit more from reviewing worked examples while experts benefit more from problem solving tasks. This suggests an ordering where worked examples are given prior to problem solving tasks (for learners who are initially novice). Renkl et al. (2000) have further shown that fading steps of worked examples over time, such that students have to fill-in incomplete steps of worked examples until they solve problems on their own, is more beneficial than simply pairing worked examples with problem solving tasks. Thus, in this setting, we know that the sequence of instructional activities can make a difference, which could help explain the relative empirical success of studies in this cluster.

In general, most of the studies in this cluster found either that RL-induced policies significantly outperformed baseline policies (four out of ten) or that there was an aptitude-treatment interaction favoring the RL-induced policy (four out of ten). However, the studies in this cluster often compared to a policy that randomly sequenced tasks. Thus, it is not known if the RL-induced adaptive policies explored in this cluster would do better than a more reasonable heuristic (e.g., as suggested by the expertise-reversal effect). Future work in this area is needed to determine whether RL is useful in inducing adaptive policies for sequencing activity types beyond heuristic techniques, or if RL can simply help find one of many decent policies that can outperform randomly sequencing activity types.

Maximizing Other Objectives

There are two studies that do not fit into any of the previous four clusters, because they do not optimize for how much or how fast students learn (see Table 7). Beck et al. (2000) sequence instructional activities in an intelligent tutoring system with the goal of minimizing the time spent per problem, which their resulting policy achieved. While minimizing the time per problem could result in teaching students faster, it could also lead to the policy choosing instructional activities that are less time consuming (but not necessarily beneficial for student learning). Mandel et al. (2014) try to maximize the number of levels completed in an educational game, and their RL policy does significantly increase the number of levels completed over both a random policy and an expert-designed baseline policy. While interesting, these two papers do not shed light on whether RL can be used to significantly improve student learning over strong baseline policies.

Table 7 Summary of all empirical studies in the maximizing other objectives cluster. Details are as described in the caption of Table 3

Discussion: Where’s the Reward?

We now turn to analyzing what the results of this review tell us about how impactful RL has been in the domain of instructional sequencing, and when and where it might be most impactful. We discuss a few factors which we believe have played a role in determining the success of RL-based approaches.

Leveraging Psychological Learning Theory

Our results suggest that RL has seemingly been more successful in more constrained and limited settings. For example, the cluster where RL has been most successful is paired-associate learning tasks, which treats pieces of content as independent of one another. RL has also been relatively successful in sequencing for concept learning tasks, typically constrained tasks designed for understanding aspects of cognition in lab studies rather than authentic tasks in traditional classroom settings. Moreover, RL has been relatively successful in sequencing activity types, where the agent must typically only choose between one of two or three actions. However, when it comes to sequencing interdependent content, there is not yet evidence that RL can induce instructional policies that are significantly better than reasonable baselines. This could be in part due to the fact that under the assumption that content is interrelated, the student’s state may be a complicated function of the history of activities done so far and estimating the parameters of such a model may require an inordinate amount of data.

We believe the relative success of RL in some of these clusters over others could, at least in part, be explained by the ability to draw on psychological learning theory. As mentioned earlier, for both paired-associate learning and concept learning tasks, the models that were used were informed by the psychology literature. On the other hand, for sequencing activity types and interdependent content, the models used were solely data-driven. Moreover, in the case of paired-associate learning tasks, we noted that as psychological models got more sophisticated over time, the result of using them to induce instructional policies also got more successful, to the point that policies from more sophisticated psychological models sometimes outperformed policies from simpler models (see Section “Paired-Associate Learning Tasks” for more details). We also noted that an instructional policy derived from the ARTS model (a psychological model that was not fit to data) outperformed an instructional policy derived from the data-driven model developed by Atkinson (1972b). Thus, in some cases, a good psychological theory might be more useful for finding good instructional policies than a data-driven model that is less psychologically plausible.

In addition, for paired-associate learning tasks and sequencing activity types, there are well-known results from psychology and the learning sciences that shows sequencing matters: the spacing effect (Ebbinghaus 1885) and the expertise-reversal effect (Kalyuga et al. 2003) respectively. On the other hand, for sequencing interdependent content, we do not yet have domain-general principles from the learning sciences that tell us whether and how sequencing matters.

Thus, psychology and the learning sciences can give us insights for both how to make RL more likely to succeed in finding good instructional policies as well as when to hypothesize the precise sequencing of instructional activities might matter. Settings which have been more extensively studied by psychologists—and hence where we have better theories and principles to rely upon—are often more constrained, because such settings are easier for psychologists to tackle. But this does not mean RL should only be used in simple, unrealistic settings. Rather, it suggests that we should leverage existing theories and principles when using RL, rather than simply taking a data-driven approach. We explore this idea further in Section “Planning for the Future”.

Prior Knowledge

RL may have more room for impact in instructional settings where students are learning material for the first time, because students have more room to learn and because there is less variance in students’ experiences. Almost all of the paired-associate learning tasks are in domains where students have never learned the material before, such as foreign language learning. In many of these studies, researchers specifically recruited students who did not have expertise in the foreign language. The same holds for concept learning tasks, where students are learning a concept that is artifically devised, and as such, new to the student. Moreover, many of the studies in the sequencing activity types cluster were also teaching content to students for the first time. For example, Chi et al. (2009, 2010a) explicitly recruited students that had taken high school algebra but not college physics (which is what their dialogue-based ITS covered). Zhou et al. (2017) and Shen and Chi (2016b), and Shen et al. (2018a, 2018b) all ran experiments in a university course on discrete mathematics, where the ITS was actually used to teach course content to the students. This could also possibly explain why many of these studies found an aptitude-treatment interaction in favor of low-performing students: students who have more room to improve can benefit more from a better instructional policy than students who have more prior knowledge. On the other hand, almost all of the studies in the sequencing interdependent content cluster were on basic K-12 math skills, where the student was also presumably learning the content outside of using the systems in the studies. The only exceptions to this were the lab studies run by Green et al. (2011) with university students, which actually showed that RL-induced policies did outperform random policies but not expert hand-crafted or heuristic baselines.

When students are learning material for the first time, there is also less variance in terms of students’ intial starting state, which makes RL-based approaches more likely to find policies that work for many students from prior data. Furthermore, in many of these cases, students are only being exposed to the content via the RL policy, often in a single lab session, rather than learning content through other materials. This again reduces the variance in the effect of an RL policy and makes it easier to estimate a student’s state. Indeed, only three out of 15 studies that were run in classroom settings found an RL-induced policy was significantly better than baselines and four found aptitude-treatment interactions.


Another factor that might affect why some studies were more likely to obtain significant results could be the choice of baseline policies. Among the 24 studies that found a significant effect or aptitude-treatment interaction, 17 of them (71%) compared adaptive RL-induced policies to a random baseline policy and/or other RL-induced policies that have not been shown to perform well, rather than comparing to state-of-the-art baselines. On the other hand, among the studies that did not find a significant effect, only 6 of them (35%) only compared to random or RL-induced baseline policies. This suggests that while the ordering of instructional activities matters, it does not give us insight into whether RL-based policies lead to substantially better instructional sequences than relying on learning theories and experts for sequencing. Indeed, in some studies, researchers intentionally compared to baseline policies designed to perform poorly (e.g., by minimizing rewards according to a MDP), in order to determine if instructional sequencing has any effect on student learning whatsoever (Chi et al. 2010a; Lin et al. 2015; Geana 2015).

Of course, it is important to note that random sequencing is not always unreasonable. In some cases, a random baseline may actually be a fairly decent policy. For instance, when the policy must decide whether to assign worked examples or problem solving tasks, both actions have been shown to be beneficial in general, and hence a policy that sequences them randomly is thought to be reasonable (Zhou et al. 2017; Shen et al. 2018a). Moreover, in paired-associate learning tasks, random policies may be reasonable because they happen to space problems fairly evenly. However, given that we now have better heuristics for potentially sequencing worked-examples and problem solving tasks (Kalyuga et al. 2003; Kalyuga and Sweller 2005) as well as paired-associate learning tasks (Pavlik and Anderson 2008; Lindsey et al. 2014), it would be useful to compare RL-induced policies to these more advanced baselines.

The most successful cases of demonstrating that RL-induced policies can outperform reasonable baselines are in the context of paired-associate learning tasks. Lindsey et al. (2014) compared their policy against both a policy that spaces units of vocabulary words over time and a policy that blocked units of vocabulary words. Pavlik and Anderson (2008) compared their policy against a heuristic that learners might naturally use when learning with flashcards. However, even in this context, there are more sophisticated (but not data-driven) algorithms that are commonly used in flashcard software such as Leitner system (Leitner 1972) and SuperMemo (Wozniak 1990). Future work should consider comparing to some of these state-of-the-art baseline to determine if RL-induced policies can improve upon current educational practice.

Robust Evaluations

Several of the studies that have been successful in using RL performed some kind of robust evaluation to try to evaluate in advance of the study if the proposed policy was likely to yield benefits, given some uncertainty over how students learn. Lindsey et al. (2014) justified their use of a greedy heuristic policy by some simulations they ran in prior work (Khajah et al. 2014) that showed the heuristic policy can be approximately as good as the optimal policy according to two different cognitive models (ACT-R and MCM). Rafferty et al. (2016a) also ran simulations to evaluate how well various policies would be under three different models of concept learning. Although they actually tested all policies that they ran in their simulations on actual students (for a better understanding of how effective various models and policies are), the kind of robust evaluation they did could have informed which policy to use if they did not want to test all policies. These techniques are specific instances of a method we proposed in prior work called the robust evaluation matrix (REM), which involves simulating each instructional policy of interest using multiple plausible models of student learning that were fit to previously collected data (Doroudi et al. 2017a). Mandel et al. (2014) used importance sampling, a technique that can give an unbiased estimate of the value of a policy without assuming any particular model is true, to choose a policy to run in their experiment. On the other hand, several of the studies that did not show a significant difference between adaptive policies and baseline policies, including one of our own, only used a single model to simulate how well the policies would do, and that model overestimated the performance of the adaptive policy (Chi et al. 2010a; Rowe et al. 2014; Doroudi et al. 2017a).

Of course, even robust evaluations are limited by the models considered when doing the evaluation. For example, in our second experiment reported in Appendix B, we used REM to identify a simple instructional policy that was expected to outperform a baseline according to several different models. However, our experiment showed no significant difference between the adaptive policy and the baseline. Post-hoc analyses helped us identify two factors that we had not adequately accounted for in our robust evaluations: (1) the student population in this experiment was quite different from the population in our past data that we used to fit the models, and (2) the order in which problems were presented was quite different than the order in our prior experiments. Despite the null experimental result, these evaluations led to insights about what aspects our models were not adequately considering, which could inform future studies and the development of better models of student learning.


In short, it appears that reinforcement learning has yielded more benefits to students when one or more of the following things held:

  • the sequencing problem was constrained in one or more ways (e.g., simple learning task with restricted state space or restricted set of actions),

  • statistical models of student learning were inspired by psychological theory,

  • principles from psychology or the learning sciences suggested the importance of sequencing in that setting,

  • students had fairly little prior knowledge coming in (but enough prior knowledge such that they could learn from the software they were interacting with),

  • RL-induced policies were compared to relatively weak baselines (such as randomly presenting actions or policies that were not expected to perform well), and

  • policies were tested in more robust and principled ways before being deployed on students.

This gives us a sense of the various factors that may influence the success of RL in instructional sequencing. Some of these factors suggest best practices which we believe might lead to more successfully using RL in future work. Others suggest practices that are actually best to avoid—such as using weak baseline policies when stronger baselines are available—in order to truly determine if RL-induced policies are beneficial for students. We now turn to how we can leverage some of these best practices in future work.

Planning for the Future

Our review of the empirical literature suggests that one exciting potential direction is to further combine data-driven approaches with psychological theories and principles from the learning sciences. Theories and principles can help guide (1) our choice of models, (2) the action space under consideration, and (3) our choice of policies. We briefly discuss the prospects of each of these in turn.

Psychological theory could help inform the use of reasonable models for particular domains as has been done in the case of paired-associate learning tasks and concept learning tasks in the literature. These models can then be learned and optimized using data-driven RL techniques. Researchers should consider how psychological models can be developed for educationally relevant domains beyond just paired-associate and concept learning tasks. Indeed such efforts could hopefully be productive both in terms of improving student learning outcomes in particular settings, as well as in testing and contextualizing existing or newly-developed theories.

Our results also suggest focusing on settings where the set of actions is restricted but still meaningful. For example, several of the studies described above consider the problem of sequencing worked examples and problem solving tasks, which meaningfully restricts the decision problem to two actions in an area where we know the sequence of tasks makes a difference (Kalyuga et al. 2003).

Finally, learning sciences principles can potentially help constrain the space of policies as well. For example, given that the expertise-reversal effect suggests that worked examples should precede problem solving tasks and that it is best to slowly fade away worked example steps over time, one could consider using RL to search over the space of policies that follow such a structure. This could mean rather than deciding at each time step what activity type to give to the student, the agent would simply need to decide when to switch to the next activity type. The expertise-reversal effect also suggests such switches should be based on the cognitive load on the student, which in turn can guide the representation used for the state space. Such policies have been implemented in a heuristic fashion in the literature on faded worked examples (Kalyuga and Sweller 2005; Salden et al. 2010; Najar et al. 2016), but researchers have not yet explored using RL to automatically find policies in this constrained space. Related to this, the learning sciences literature could suggest stronger baseline policies with which to compare RL-induced policies, as discussed in Section “Baselines”.

As the psychology and learning sciences literature identify more principles and theories of sequencing, such ideas can be integrated with data-driven approaches to guide the use of RL in instructional sequencing. Given that deep reinforcement learning has been gaining lots of traction in the past few years and will likely be increasingly applied to the problem of instructional sequencing, it seems especially important to find new ways of meaningfully constraining these approaches with psychological theory and learning sciences principles. A similar argument was made by Lindsey and Mozer (2016) when discussing their successful attempts of using a data-driven psychological model for instructional sequencing: “despite the power of big data, psychological theory provides essential constraints on models, and … despite the success of psychological theory in providing a qualitative understanding of phenomena, big data enables quantitative, individualized predictions of learning and performance.”

However, given that finding a single plausible psychological model might be difficult in more complex settings, a complementary approach is to explicitly reason about robustness with respect to the choice of the model. Of course, such robust evaluations are not silver bullets and they can make inaccurate predictions, but even if the results do not match the predictions, this can help prompt new research directions in understanding the limitations of the models and/or instructional policies used.

Beyond these promising directions and suggestions, we note that the vast majority of the work we have reviewed consists of system-controlled methods of sequencing instruction that target cognitive changes. However, for data-driven instructional sequencing to have impact, we may need to consider broader ways of using instructional sequencing. The following are meant to be thought-provoking suggestions for consideration that build on current lines of research in the artificial intelligence in education community. In line with our recommendation to combine data-driven and theory-driven approaches, a common theme in many of these ideas is to combine machine intelligence with human intelligence, whether in the form of psychological theories, student choice, or teacher input.

Learner Control

In this review, we have only considered approaches where an automated instructional policy determines all decisions about what a learner should do. However, allowing for student choice could make students more motivated to engage with an instructional system (Fry 1972; Kinzie and Sullivan 1989) and may benefit from the learner’s own knowledge of their current state. Among the studies reported in our empirical review, only (Atkinson 1972b) compared an RL-induced policy to a fully learner-controlled policy, and he found that while the learner-controlled policy was 53% better than random, it was not as good as the RL-induced policy (108% better than random). While this result was taken in favor of system-controlled policies, Atkinson (1972a) suggested that while the learner should not have complete control over the sequencing of activities, there is still “a place for the learner’s judgments in making instructional decisions.”

There are a number of ways in which a machine’s instructional decisions could be combined with student choice. One is for the agent to make recommendations about what actions the student should take, but ultimately leave the choice up to the student. This type of shared control has been shown to succesfully improve learning beyond system control in some settings (Corbalan et al. 2008). Green et al. (2011) found that expert policies do better than random policies, regardless of whether either policy made all decisions or gave the student a choice of three actions to take. Cumming and Self (1991) also describe such a form of shared control in their vision of “intelligent educational systems,” where the system is a collaborator to the student rather than an instructor. A related approach is to give students the freedom to select problems, but have the system provide feedback on students’ problem-selection decisions, which Long and Aleven (2016) showed can lead to higher learning gains than system control. Another approach would be for the agent to make decisions where it is confident its action will help the student, and leave decisions that it is less confident about up to the student. RL-induced policies could also take learner decisions and judgements as inputs to consider during decision making (e.g., as part of the state space). For instance, Nelson et al. (1994) showed that learners can effectively make “judgments of learning” in paired-associate learning tasks, and remarked that judgments of learning could be used by MDPs to make instructional decisions for students. Such a form of shared control has recently been considered in the RL framework for performance support (Javdani et al. 2018; Reddy et al. 2018; Bragg and Brunskill 2019), but has not been considered in the context of instructional sequencing to our knowledge.

Teacher Control

Building on the previous point, sometimes when an instructional policy does not know what to do, it could inform the teacher and have the teacher give guidance to the student. For example, Beck and Gong (2013) have shown that mastery learning policies could lead to “wheel-spinning” where students cannot learn a particular skill, perhaps because the policy cannot give problems that help the student learn. Detectors have been designed to detect when students are wheel-spinning (Gong and Beck 2015; Matsuda et al. 2016). These detectors could then relay information back to teachers, for example through a teacher dashboard (Aleven et al. 2016b) or augmented reality analytics software (Holstein et al. 2018), so that teachers know to intervene. In these cases, an RL agent could encourage the teacher to pick the best activity for the student to work on (or a recommend a set of activities that the student could choose from). Finding the right balance between learner-control, teacher-control, and system-control is an open and important area of research in instructional sequencing.

Beyond the Cognitive

Almost all of the empirical studies we have reviewed used cognitive models of learning that were designed to lead to cognitive improvements in learning (e.g., how much students learned or how fast they learned). However, RL could also take into account affective, motivational, and metacognitive features in the state space and could also be used in interventions that target these non-cognitive aspects of student learning by incorporating them into reward functions. For example, could a policy be derived to help students develop a growth mindset or to help students develop stronger metacognitive abilities? While detecting affective states is a growing area of research in educational data mining and AIED (Calvo and D’Mello 2010; Baker et al. 2012), only a few studies have considered using affective states and motivational features to adaptively sequence activities for students (Aleven et al. 2016a). For example, Baker et al. (2006) used a detector that predicts when a student is gaming the system in order to assign students supplementary exercises when they exhibit gaming behavior and Mazziotti et al. (2015) used measures of both the student’s cognitive state and affective state to determine the next activity to give the student. There has also been work on adaptive learning technologies that improve students’ self-regulatory behaviors, but this work has not aimed to improve self-regulation via instructional sequencing per se (Aleven et al. 2016a). While there is a risk that modeling metacognition or affect may be even harder than modeling students’ cognitive states in a reinforcement learning framework, there may be certain places where we can do so effectively, and the impact of such interventions might be larger than solely cognitive interventions.


We have shown that over half of the empirical studies reviewed found that RL-induced policies outperformed baseline methods of instructional sequencing. However, we have also shown that the impact of RL on instructional sequencing seems to vary depending on what is being sequenced. For example, for paired-associate learning and concept learning tasks, RL has been fairly successful in identifying good instructional policies, perhaps because for these domains, psychological theory has informed the choice of statistical models of student learning. Moreover, when determining the sequence of activity types, RL-induced policies have been shown to outperform randomly choosing activity types, especially for lower performing students. But for sequencing interdependent content, we have yet to see if a data-driven approach can drastically improve upon other ways of sequencing such as expert-designed non-adaptive curricula. While the order of content almost certainly matters for domains with interconnected content (like mathematics), it can be difficult to identify good ways to adaptively sequence content with typical amounts of data.

Even in the cases where RL has been successful, one caveat is that the baseline policies are often naïve (e.g., randomly sequencing activities) and may not represent current best practices in instructional sequencing. For this reason, it does not seem like RL-based instructional policies have significantly impacted educational practice to date. Some studies have shown that RL-induced policies can outperform more sophisticated baselines, but more work is needed in this area.

One of the key recommendations we have drawn from this review is that instructional sequencing can perhaps benefit most by discovering more ways to combine psychological theory with data-driven RL. More generally, we suggested a number of ways in which instructional sequencing might benefit by combining machine intelligence with human intelligence, whether in the form of theories from domain experts and psychologists, a teacher’s guidance, or the students’s own metacognition.

We conclude by noting that the process of using reinforcement learning for instructional sequencing has been beneficial beyond its impact on student learning. Perhaps the biggest success of framing instructional sequencing as a reinforcement learning problem has actually been its impact on the fields of artificial intelligence, operations research, and student modeling. As mentioned in our historical review, investigations in optimizing instruction have helped lead to the formal development of partially observable Markov decision processes (Sondik 1971; Smallwood and Sondik 1973), an important area of study in operations research and artificial intelligence. More recently, in some of our own work, the challenge of estimating the performance of different instructional policies has led to advancements in general statistical estimation techniques (Doroudi et al. 2017b) that are relevant to treatment estimation in healthcare, advertisement selection, and many other areas. Finally, in the area of student modeling, the robust evaluation matrix (Doroudi et al. 2017a) can help researchers not only find good policies but also discover the limitations of the models when a policy under-delivers. Not only should we use theories of learning to improve instructional sequencing, but also by trying to improve instructional sequencing, perhaps we can gain new insights about how people learn.