1 Introduction

Adaptive learning systems, a type of intelligent tutoring systems, are digital tools designed to mimic the experience of studying with a human tutor (VanLehn 2006). Like a human tutor, such a system can respond and adapt to a learner, offering suitable practice problems (e.g. targeting a particular skill that the learner is yet to fully master, or prompting the retrieval of information that the learner is likely to forget otherwise) and providing situation-specific feedback (e.g. guiding the learner towards the next step in a multi-step problem-solving process). Adaptive learning systems typically maintain an internal model of the learner and the task, which they update and refine by recording and interpreting the learner’s interactions with the interface (Desmarais and Baker 2012). Examples of such models include symbolic cognitive models that describe the procedural and declarative knowledge required to perform the learning task, like those used in cognitive tutors (Anderson et al. 1995; Koedinger and Corbett 2006), as well as models based on item response theory that quantify the difficulty of items relative to the learner’s ability (e.g. Wauters et al. 2010; Klinkenberg et al. 2011; Pelánek et al. 2017).

The success of adaptive learning systems critically depends on having a sufficiently accurate model of the learner’s knowledge state, so that the learner is offered an appropriately challenging learning experience proportionate to their current ability. However, whenever a new learner uses an adaptive learning system for the first time, or when a returning learner first studies new materials, there is a “cold-start” problem: without prior knowledge to fall back on, the system does not yet know how well the learner is likely to perform (Park et al. 2018; Pliakos et al. 2019). During this cold-start period, there is likely to be a misalignment between the system’s internal model of the learner and the learner’s actual ability, which can negatively affect the learning process. For example, the system might provide misguided feedback or present problems that are too easy or too hard, thereby delaying the benefits of adaptivity. Not only could misalignment during the initial cold-start period cause the learning system to make suboptimal tutoring choices, it could also lead to learners disengaging from the system altogether if they feel the learning experience is frustratingly easy or hard (Shernoff et al. 2003; Kennedy et al. 2014; Hamari et al. 2016). Presenting problems that are too difficult, and thereby increasing learners’ error rate, can negatively affect learners’ subjective feeling of confidence and satisfaction (Holm and Wells 2023). Experiencing (repeated) errors can also increase the likelihood that learners quit practice (ten Broeke et al. 2022; Alamri et al. 2019). Presenting problems that are too easy, on the other hand, is also not desirable, since it leads to less efficient use of the initial study time (van der Velde et al. 2021) and increases the risk that learners disengage out of disinterest (Pelánek and Effenberger 2022). Mitigating the impact of the cold start is therefore an important consideration in the design of adaptive learning systems. The current study investigates cold-start mitigation in a large-scale naturalistic sample, using predictions based on prior learning data to inform the starting state of an adaptive fact learning system.

Broadly speaking, solutions to the cold-start problem in adaptive learning systems—and other adaptive or recommender systems—take one of two forms. Firstly, there are solutions that, rather than preventing a cold start from happening, focus on alleviating the impact of a cold start once it does occur. Relevant examples in adaptive learning systems include allowing large adjustments to the learner model in the initial stages of the learning session and becoming more conservative with changes as time goes on (e.g. Klinkenberg et al. 2011), as well as prioritising problems that will yield the largest reduction in uncertainty about a learner’s ability (e.g. Chen et al. 2000; Wauters et al. 2010).

Secondly, there are solutions that address the root cause of the cold start: the inaccuracy of a system’s initial estimates (Lika et al. 2014). These solutions use data collected before the current session to inform the starting estimates of the system. In the domain of adaptive learning systems, such solutions often involve predicting a learner’s ability beforehand, for instance, by clustering similar learners based on demographic features or earlier performance in other settings (Ayers et al. 2008; Nedungadi and Remya 2014; Park et al. 2018). This way, learners can start the session at a more appropriate difficulty level, ideally requiring only small adjustments from then on.

Learner-based individualisation of initial parameters has been demonstrated to improve model accuracy in a range of common student modelling paradigms, including in Knowledge Tracing (KT) models (Corbett and Anderson 1995; Pardos and Heffernan 2010; Nedungadi and Remya 2014; Eagle et al. 2018), deep learning-based KT models (Piech et al. 2015; Zhao et al. 2020), and item response theory-based models (Park et al. 2018; Pliakos et al. 2019).

Focusing on the learner’s ability as the main driver of cold-start issues is especially relevant if the relative difficulty of individual problems is reasonably well defined. This is often the case when problems test procedural knowledge—knowing how to do something. In domains that rely on structured procedural knowledge, such as arithmetic or programming, there is typically a logical way of defining the relative difficulty of individual problems (e.g. the number or complexity of steps required to get to a solution; Stocco and Anderson 2008; Anderson et al. 2012; Griffiths et al. 2015). Difficulty is harder to anticipate when the learning task is about declarative knowledge—facts—as is the case when learning vocabulary in a foreign language or studying neuroanatomy, for example. Although one can find reliable differences in memorability between facts, these are often the result of a complex mix of factors, such as lexical, semantic, and affective properties of words (Madan 2021) or diffuse features of a visual stimulus (Broers and Busch 2021).

When the adaptive learning system teaches declarative knowledge of unknown difficulty, a cold start therefore entails a period of sub-optimal learning, while the system figures out the difficulty of each fact. In this scenario, learners can benefit from cold-start mitigation solutions that predict fact difficulty, for example on the basis of prior observations from other learners studying the same material. This way, adaptive fact learning systems, which intelligently schedule the retrieval practice of facts to optimise learning, can immediately adapt their scheduling to the expected difficulty of each fact (Mozer and Lindsey 2016; van der Velde et al. 2021).

Knowing how difficult the facts are is not enough, however. Adaptive fact learning systems also have to contend with individual differences in learners’ ability to memorise facts. One can find general differences in the shape of individuals’ learning curves (Steyvers and Schafer 2020), as well as in the rate at which learners forget information (Sense et al. 2016). The rate of forgetting in particular appears to be a reliable individual trait with substantial variation between learners, that can be predicted from resting state neural activity and functional connectivity (Zhou et al. 2020; Xu et al. 2021). For adaptive fact learning systems, this inter-individual variability means that some learners will consistently require more frequent repetitions of a fact than others to maintain a desired level of performance. In a cold-start scenario, it would take the learning system some time to recognise a learner’s ability. Initialising the system with a learner’s predicted ability could help mitigate the cold start (Wauters et al. 2010; Mozer and Lindsey 2016).

In a laboratory-based study, we previously tested several approaches to cold-start mitigation in an adaptive fact learning system that participants used to memorise place names on a map (van der Velde et al. 2021). In this system, learning happens through spaced retrieval practice. The adaptive component of the system is a continuous-valued Speed of Forgetting model parameter (SoF; denoted by the symbol \(\alpha \)) that tracks the memorability of each fact for each learner separately, enabling the system to intelligently schedule repetitions (see Sect. 2.3 for details).

Fig. 1
figure 1

Illustration of cold-start conditions in the adaptive learning system. A The adaptive Speed of Forgetting (SoF) parameter of the memory model captures how quickly a fact is expected to be forgotten after initial encoding (see Eq. 2). By default, all items start out with an initial SoF of \(\alpha = 0.3\). Small differences in \(\alpha \) translate to large differences in the expected retention interval, illustrated here by the time until the expected probability of recall drops below 5% (see Eq. 11). B Distribution of mean accuracy scores and C median response times (correct responses only) on the first three practice attempts of an item, grouped by the SoF estimate reached at the end of practice (data from the training set described in Sect. 2.2). The correlation of both behavioural measures with the eventual \(\alpha \) estimate demonstrates that low-\(\alpha \) items are relatively easy to recall at the time of presentation, while high-\(\alpha \) items are relatively difficult to recall. Ideally, a warm start would ensure uniform recall difficulty. Due to the high variability in both behavioural measures, fast adaptation of \(\alpha \) on the basis of a single noisy measurement is infeasible; multiple responses are required to make an informed estimate of SoF. D A simulated example of model-based scheduling of item presentations within a learning session, for an assumed initial SoF of \(\alpha = 0.3\), with true SoF ranging from \(\alpha = 0.2\) (slow forgetting) to \(\alpha = 0.4\) (fast forgetting). Here, practice trials (vertical lines) occur when the probability of recall is expected to be 50%. A mismatch between assumed and true SoF means that the actual difficulty of responding correctly will be (much) lower or higher than anticipated. This pattern is indeed visible in B and C: recall of higher-\(\alpha \) items is less accurate and slower than that of lower-\(\alpha \) items

The SoF parameter normally starts at a default value of \(\alpha = 0.3\) and is then refined over time, with adjustments starting after an item has been practised three times. Figure 1 illustrates the cold-start conditions that this creates for the first several practice attempts of an item. The estimated SoF determines how quickly the model expects the memory representation of a fact to decay over time, with higher \(\alpha \) values expressing an expectation of faster forgetting (Fig. 1A). The modelled probability of recall affects the scheduling of item repetitions within the learning system: the system aims to repeat items when they have decayed substantially but are still retrievable.

Ideally, the system would perfectly compensate for individual differences in memorability in its scheduling, so that retrieval difficulty is uniform over practice trials. In reality, an underestimation of the true SoF by the model means that a fact will be repeated when its memory activation has decayed further than anticipated, making the repetition more difficult than intended. Conversely, overestimating the true SoF means that a fact will be repeated when it is still easily retrievable from memory. The effects of such under- and overestimation in a cold start are visible in the response data of learners using the system. Figure 1B shows that response accuracy on the first three practice attempts of each item is lower for more difficult items (i.e. those with a higher SoF estimate at the end of practice) than for easier items. Similarly, Fig. 1C shows that, provided retrieval is successful, learners take longer to recall more difficult items than easier items. At the same time, responses to the lowest-\(\alpha \) items are particularly accurate and fast, suggesting a relative lack of challenge compared to items of an average difficulty.

Figure 1D illustrates how this cold-start effect emerges from the learning system, by means of a simulation. Repetitions are scheduled on the basis of the expected probability of recall reaching a certain value. For simplicity, here we assume that items are repeated the moment they reach a 50% recall probability. (The scheduling in the real system is slightly more complex, but adheres to the same principle; see Sect. 2.3.) The first three practice trials (denoted by vertical lines) are scheduled under the default assumption that \(\alpha = 0.3\). When an item’s true SoF is actually \(\alpha = 0.4\), the second and third presentations come relatively late: instead of an expected recall probability of 50%, we should actually expect only a 21% chance that the learner recalls the fact on the second presentation, and only a 13% chance on the third presentation. When the true SoF of an item is \(\alpha = 0.2\) while the system assumes it to be \(\alpha = 0.3\), repetitions are scheduled so soon that the learner is very likely to recall the correct answer without much effort: recall probability becomes 79% and 87% on the second and third presentation, respectively. While slightly more conservative scheduling in the real learning system means that learners’ actual performance is better than in this simulated example, these effects qualitatively match the observed performance differences in Figs. 1B, C.

The high variability in early accuracy and response times among items that later turn out to have a similar difficulty also shows that it is not practically feasibly to simply start adjusting \(\alpha \) sooner, as the uncertainty of a single noisy response is too high to make an informed assessment of difficulty. For the system to be able to present items on a schedule that achieves more uniform difficulty, we therefore require more accurate starting estimates for the SoF parameter.

In van der Velde et al. (2021), we initialised \(\alpha \) to a value predicted using prior learning data. Predictions were based on the learning history of the individual learner, observations from other learners on the same fact, or a combination of the two. We found that using fact-level predictions to initialise the model improved learning outcomes, but only if there was sufficient variability in difficulty among the facts being studied. We did not find any benefit from making learner-specific predictions, speculating that this was because the learners in the sample were too similar to one another. We therefore concluded that, given this context, knowing about the difficulty of a fact is more likely to be useful in predicting future performance than knowing about a learner’s general SoF.

Fig. 2
figure 2

Visual summary of the current study. A The learning system completes a three-step practice loop in each trial. B Each learning sequence contains all trials of a learner L studying a fact F at times \(t_0,\dots ,t_n\) and yields a final Speed of Forgetting (SoF) estimate \({\hat{\alpha }}\). 80% of these estimates are used for training. C SoF predictions are made for the remaining 20% of learning sequences using different subsets of the training data. D Predictions are evaluated by (1) comparing final SoF estimates to predicted values (where \(\alpha _\emptyset \) is the default value of 0.3) and (2) simulating the model’s behavioural predictions when using predicted SoF as starting estimates

1.1 Current study

The current study examines the same cold-start mitigation methods as van der Velde et al. (2021) in a much larger and more heterogeneous real-world sample: almost 100 million adaptive fact learning trials from about 140 thousand secondary school students across the Netherlands, of different ages and at different educational levels, studying about 80 thousand foreign vocabulary items in two different languages. To determine the feasibility of these methods and assess whether fact-specific (“what”) information is still more important than learner-specific (“who”) information in this more heterogeneous sample, we performed a post hoc simulation study (see Fig. 2). Based on our analyses, we confirm that using difficulty estimates derived from previous learning data can considerably improve predictions of future performance, even when the data are collected in an applied setting that comes with additional sources of noise (Fischer et al. 2020).

The broader scope of the current study reveals greater potential benefits of some of the cold-start mitigation methods that failed to improve learning in the lab—including methods that use learner-specific predictions—due to more pronounced differences among learners and facts. Interestingly, and somewhat surprisingly, we still find that predictions involving fact-specific information outperform learner-specific predictions in terms of predictive accuracy, although in many cases the combination of the two does slightly better still (and we expect that the relative importance of learner-specific predictions may grow in even more diverse samples). These findings suggest that, on top of knowing the difficulty of the fact, there can indeed be value in knowing a learner’s general Speed of Forgetting when trying to alleviate the cold-start problem in adaptive fact learning.

2 Methods

2.1 Data set

We used a large retrieval practice data set collected in a real-world educational setting. Participants were secondary school students in the Netherlands who took English or French as a second language. Over a period of two school years (Summer 2018 to Summer 2020), we collected data from these students practising English and French vocabulary items using the online adaptive fact learning system SlimStampen (described in detail in a following section). The sample includes students from year groups 1 (age: 12; corresponds to grade 7 in the US) through 4 (age: 16; corresponds to grade 10 in the US), and from all three levels of secondary education in the Netherlands—pre-vocational (vmbo), general secondary (havo) and pre-university (vwo). The Ethics Committee Psychology of the University of Groningen granted approval for analysis of the anonymised learning data (study code: PSY-1920-S-0397).

The learning system was integrated into the teaching materials from educational publisher Noordhoff. Vocabulary items were grouped into lists corresponding to sections in the accompanying text book. While there was a logical order to the lists, the decision of which list to practise at what time, and how often, was ultimately up to students and/or their teachers. However, once a student started a learning session, the scheduling of items within that session was entirely determined by the adaptive learning system.

The vocabulary items were paired associates: on each trial, a learner was shown a cue and asked to retrieve the corresponding answer. Trials could be open-answer or multiple-choice, though individual facts were always presented in the same format. Since the learning system is designed around the principle of spaced repetition, learners typically encountered a single fact multiple times across one or more learning sessions. The data set includes the response accuracy (correct or incorrect) and response time (measured from cue onset to the first key/button press) on each trial. The full data set contained 80,845,692 trials from the English course and 34,158,787 trials from the French course.

2.2 Data preprocessing

Due to limitations on data storage, facts did not have persistent unique identifiers; they were only uniquely identifiable when incorporating the time stamp of the learning session. In addition, the cue and expected answer were not stored, only the answer given by the student, and its correctness. Where possible, we merged facts with different time stamps that were otherwise identical (i.e. they had the same book chapter identifier, list identifier, within-session fact identifier, and inferred correct answer). As a correct response was necessary for identifying a fact, we were unable to uniquely identify facts that a student never answered correctly within a given session, resulting in a loss of 41,725 trials from the English data and 174,823 trials from the French data.

We analysed the data for English and French separately. There was likely a degree of overlap in the participant sample between the two language courses, since many Dutch students take both English and French. However, we could not identify participants across languages from their anonymised IDs, since these were assigned separately for each course.

We use the term learning sequence to denote the set of all encounters that a learner had with a given fact across one or more sessions (see Fig. 2B). Learning sequences consisting of only one or two trials were excluded from the analysis (too little information for the algorithm to identify individual differences in difficulty; a loss of 11,537,654 trials from the English data and 2,394,514 trials from the French data), as were sequences of more than 25 trials (the high number of repetitions indicating a lack of task focus; a further reduction by 1,302,150 English trials and 1,588,624 French trials).

Together, the preprocessing steps removed 15.9% (12,839,804 trials) and 11.7% (3,983,138 trials) from the English and French data, respectively. Table 1 provides a summary of the filtered data. The filtered data were randomly split into a training set and a test set: 80% of learning sequences were assigned to the training set, and the remaining 20% to the test set. Although any single learning sequence only appeared in one of the two sets, there was sufficient overlap in learners and facts that the vast majority of them were represented in the test set, as shown in Table 1. The number of trials per learning sequence was right-skewed, with a median of 3 in the English data and a median of 4 in the French data.

Table 1 Total number of learning sequences, trials, facts, and learners included in the analysis

2.3 Adaptive fact learning system

The data were collected through the SlimStampen adaptive fact learning system (van Rijn et al. 2009; Sense et al. 2016). This system was also used in the van der Velde et al. (2021) laboratory study. The learning system implements a spaced retrieval algorithm for adaptive within-session scheduling of facts (see Fig. 2A). The aim of the algorithm is to test learners on facts that they are about to forget, but before they actually forget them, so that they simultaneously benefit from the spacing effect (i.e. longer intervals between repetitions improve retention; Dempster 1988) and the testing effect (i.e. successful retrievals improve retention; van den Broek et al. 2016).

The system works by maintaining a model of the learner’s memory in which each fact is represented by a separate chunk. Following the ACT-R model of declarative memory (Anderson 2007), chunks have an activation A consisting of a summation of decaying traces, one for each of the times the chunk is encountered (i.e. whenever the learner encodes or retrieves the fact). The activation of a chunk x with encounters at \(t_0,..., t_n\) seconds ago is:

$$\begin{aligned} A_x(t) = ln\Big (\sum ^{n}_{j = 0} t_j^{-d_x(t)}\Big ) \end{aligned}$$
(1)

At the onset of each trial, the system chooses which fact to present. If one or more facts are projected to fall below the retrieval threshold within the next fifteen seconds, the system selects the one with the lowest projected activation. Otherwise, it introduces a new fact into the session.

The rate at which the activation of a chunk decays varies, partly due to the activation of the chunk at the time of its most recent encounter  (such activation-dependent decay means that a new trace added to a chunk that is already highly active decays more quickly; Pavlik and Anderson 2005), and partly due to a variable Speed of Forgetting parameter (\(\alpha \)) specific to the learner and fact. The latter allows the model to represent differences in memorability that arise from variability in learners’ ability and/or the difficulty of facts. The more difficult a learner finds a fact to remember, the higher the \(\alpha \) should be, so that the fact decays faster and is therefore repeated sooner. The decay d of a chunk x, with SoF parameter \(\alpha \) and scaling constant c, is defined as follows:

$$\begin{aligned} d_x(t) = c * e^{A_x(t_{n-1})} + \alpha _x \end{aligned}$$
(2)

By default, the system starts with the same SoF estimate of \(\alpha = 0.3\) for all facts and then updates the estimates individually on the basis of the learner’s responses.

The \(\alpha \) estimate for a fact is adjusted after each encounter, to best account for the observed response accuracy and response time. The estimated activation of a fact at the time of presentation is used to calculate an expected response time with the following equation, where \(t_{er}\) represents a fixed time cost for perceptual encoding of the stimulus and executing the motor response:

$$\begin{aligned} {\mathbb {E}}(RT) = e^{-A_x} + t_{er} \end{aligned}$$
(3)

The \(\alpha \) estimate is adjusted in a stepwise manner to minimise the discrepancy between the expected and observed response times. Responses that are faster than expected suggest that there has been less decay than anticipated, invoking a downward adjustment to the \(\alpha \) estimate. Unexpectedly slow responses or errorsFootnote 1 suggest that activation has decayed further than expected, and cause the \(\alpha \) to be adjusted upwards. To prevent excessive change on the basis of a single outlier response, adjustments require at least three responses to have occurred and take up to five of the most recent responses into account. They are also limited to a range 0.05 below and above the current estimate. These constraints make the system more robust to noisy response data, but they do mean that inaccurate starting estimates can take quite some time to correct. As we showed in van der Velde et al. (2021), starting a learning session with more accurate \(\alpha \) estimates can lead to more effective scheduling of facts.

2.4 Predicting Speed of Forgetting

We predicted the final Speed of Forgetting estimates (\(\hat{\alpha }\)) for the learning sequences in the test set using the data in the training set. To allow for a direct comparison, we used the same prediction methods as in van der Velde et al. (2021), which are illustrated in Fig. 2C: Default (a fixed value: \(\alpha _\emptyset = 0.3\)), Domain (\(\alpha _D\); an average of all fact-level predictions within the set), Fact (separate fact-level predictions \(\alpha _F\); a collaborative filtering approach), Learner (separate learner-level predictions \(\alpha _L\); a content-based approach), and Fact & Learner (separate predictions \(\alpha _{F\circledast L}\) for each fact-learner pair; a hybrid approach).

The Default method, which serves as a baseline, simply predicts a fixed starting value of 0.3. This is the default setting used in most applications of the system, as the value of 0.3 is a good approximation of the average SoF observed across populations and materials (e.g. van Rijn et al. 2009; Sense et al. 2016; Sense 2021).

The other methods use the final \(\alpha \) estimates from learning sequences in the training set to predict \(\alpha \) for sequences in the test set. Consistent with our previous work, final \(\alpha \) estimates are combined into a single prediction using a Bayesian model. The model assumes that SoF is normally distributed with unknown mean \(\mu \) and precision (the reciprocal of the variance) \(\lambda \):

$$\begin{aligned} \alpha \sim {\mathcal {N}}(\mu , \lambda ^{-1}) \end{aligned}$$
(4)

We can simultaneously infer both parameters of the distribution using its conjugate prior, the joint Normal-Gamma distribution (Murphy 2007)Footnote 2:

$$\begin{aligned} p(\mu , \lambda )&= p(\mu \vert \lambda ) * p(\lambda ) \end{aligned}$$
(5)
$$\begin{aligned} p(\mu \vert \lambda )&= {\mathcal {N}}(\mu _0, \kappa _0\lambda ^{-1}) \end{aligned}$$
(6)
$$\begin{aligned} p(\lambda )&= {\mathcal {G}}(\alpha _0, \beta _0) \end{aligned}$$
(7)

Having a conjugate prior means that the posterior also follows a Normal-Gamma distribution that we can find analytically. As Murphy (2007) shows, the posterior predictive for the next observation, given the previous n observations, is described by a t-distribution:

$$\begin{aligned} p(x \vert D) = t_{2_{\alpha _n}}\left( x \vert \mu _n, \frac{\beta _n(\kappa _n + 1)}{\alpha _n\kappa _n}\right) \end{aligned}$$
(8)

Since the learning system requires a single value for \(\alpha \) for any individual learner-fact pair, we take the mode of this posterior predictive distribution to be the predicted value. It represents the most probable SoF, given the model’s prior and the final \(\alpha \) estimates in the training data. We chose a sensible, weakly informative prior for the model, to reflect our assumption that \(\alpha \) is normally distributed around 0.3, with \(\mu _0 = 0.3\), \(\kappa _0 = 1\), \(\alpha _0 = 3\), and \(\beta _0 = 0.2\). (In the absence of any observations, this means that the Bayesian model would make the same prediction as the Default method.)

Within the training set, each prediction method uses a different subset of the available information (see Fig. 2C). The Fact method predicts the SoF for a single fact using all learning sequences in the training set pertaining to that fact, effectively averaging over the learners who encountered that fact. The Domain method is more general, predicting a single value for all learning sequences by taking the mean of all individual Fact predictions. (One can think of it as a variation on the Default prediction that is directly informed by the training data.) The Learner method makes a prediction for a single learner, using all learning sequences of that learner in the training set, thereby averaging over the different facts encountered by that learner. Finally, the Fact & Learner method produces the most specific prediction, combining the Fact prediction and the Learner prediction to produce a unique prediction for each learner-fact pair. It does so using logarithmic opinion pooling (Genest et al. 1984) on the two separate posterior predictive distributions for Fact and Learner, with \(k = 2\) equal weights w:

$$\begin{aligned} p_{LOP}(x \vert D)&= \frac{\prod _{i=1}^{k} p_i(x \vert D)^{w_i}}{\int _x\prod _{i=1}^{k} p_i(x \vert D)^{w_i}dx} \end{aligned}$$
(9)
$$\begin{aligned}&= \frac{p_{fact}(x \vert D)^{0.5} * p_{learner}(x \vert D)^{0.5}}{\int p_{fact}(x \vert D)^{0.5} * p_{learner}(x \vert D)^{0.5}dx} \end{aligned}$$
(10)

The mode of the combined distribution then becomes the predicted value. This method of combining predictions has the nice property that the relative uncertainty of each predictive distribution affects the degree to which it contributes to the final prediction. If, for instance, the Learner predictive distribution is more spread out than the Fact distribution (e.g. due to having fewer observations, or less agreement between observations), it will have a smaller effect on the mode of the combined distribution.

In keeping with van der Velde et al. (2021), we only made SoF predictions if there were at least 30 learning sequences available for a given fact (i.e. the fact was seen by at least 30 learners) or learner (i.e. the learner studied at least 30 facts). Figure 3 displays the number of available learning sequences for both. It shows that Fact predictions tended to include more sequences than Learner predictions, though there was substantial variation in both distributions.

Fig. 3
figure 3

Number of learning sequences used in each fact-specific or learner-specific Speed of Forgetting prediction, by language. Note that the distributions are shown on a logarithmic scale. For Fact predictions (top row), this value represents the number of learners who studied the fact; for Learner predictions (bottom row), it represents the number of facts studied by the learner. All predictions were based on at least 30 sequences, as indicated by the dashed vertical line

2.5 Simulation study

We predicted final Speed of Forgetting for the learning sequences in the test set, using each of the five prediction methods. As Fig. 2D shows, predictions were evaluated in two ways. Since inaccurate predictions are unlikely to be helpful in mitigating the cold-start problem, we first evaluated the accuracy of the predictions by comparing them to the SoF estimates derived at the end of each test set learning sequence (see Fig. 2D, Panel 1). We then simulated what the learning system’s estimates would have been throughout every learning sequence if it had used each of the prediction methods to set the initial SoF. Within every learning sequence, we investigated how accurately the learning system could predict response accuracy and response time on the upcoming trial (see Fig. 2D, Panel 2). The more accurate these predictions were, the closer the correspondence between the system’s model of the learner and the learner’s actual knowledge state—and therefore, the better the expected alleviation of the cold-start problem if applied in practice.

Fig. 4
figure 4

Comparison of predicted and observed Speed of Forgetting \(\alpha \) by prediction method for the two languages. Coloured points show individual learning sequences. The number of included learning sequences is shown in the top right of each box. The mean absolute error (MAE) between the predicted and observed values is shown in the bottom right. The black line shows a linear regression model fitted to the data

3 Results

3.1 Speed of forgetting

To evaluate the Speed of Forgetting prediction methods, we first calculated individual \(\alpha \) estimates for each learning sequence in both the training set and the test set. The \(\alpha \) estimates from the training set were then used to predict \(\alpha \) for the sequences in the test set, using each of the five prediction methods. Figure 4 shows how predicted SoF compares to observed SoFFootnote 3 in the test set. The predictions of \(\alpha \) are noticeably less variable than the observed values, which is to be expected as predictions are effectively averaging over the individual observations in the training set. Nevertheless, among the methods that make more than a single prediction, there is a clear correspondence between predicted and observed values, as shown by the black linear regression lines in Fig. 4, which fall on or close to the diagonal. The additional opinion pooling step involved in the Fact & Learner prediction results in a further narrowing of the distribution of predicted values (the combined prediction is always a weighted average of its two components), leading to a steeper regression line.

Fig. 5
figure 5

Absolute Speed of Forgetting prediction error by prediction method (i.e. absolute difference between predicted and observed \(\alpha \)). Methods are ordered from highest to lowest error within each sub-figure. Points show the mean; error bars denoting \(\pm 1\) standard error of the mean are too small to be visible. The lines along the top show the significance of pairwise comparisons using Tukey’s range test

To assess whether the prediction methods differed in how well they predicted Speed of Forgetting, we fitted a linear mixed-effects model to the absolute prediction error on each sequence in the test set (i.e. the absolute difference between predicted and observed \(\alpha \)), with a fixed effect of prediction method and random intercepts for learners and facts (to account for individual differences in predictability). The model was fitted to absolute prediction error since we do not care about the direction of the error, only its magnitude. We fitted separate models for the two languages. Each model was fitted to a random subset of 1,000,000 predictions to keep the computation time tractable. For both languages, this model showed a significant effect of prediction method on absolute prediction error. We performed pairwise comparisons between prediction methods using Tukey’s range test, allowing us to rank the methods in terms of their predictive performance. We used the fitted models to calculate standardised effect sizes for the best-performing prediction methods (Brysbaert and Stevens 2018; Westfall et al. 2014). As Fig. 5 shows, predictive performance in both languages was worst when using Default predictions. There were significant reductions in \(\alpha \) prediction error from Default to Domain predictions, from Domain to Learner predictions, and from Learner to both Fact and Fact & Learner predictions. Prediction error was not significantly different between the Fact prediction and the Fact & Learner prediction, indicating that including learner-specific information in the prediction of \(\alpha \) did not improve predictive accuracy compared to a prediction based solely on fact-specific information. Compared to the Default method, the best performing method (i.e. the method yielding the largest improvement) reduced the SoF prediction error by 0.0079 (SE = 0.00013; about 15.0%; standardised effect size \(d = 0.19\)) in the English data, and by 0.013 (SE = 0.00015; about 20.1%; standardised effect size \(d = 0.25\)) in the French data.

While we cannot directly convert these improvements into expected learning gains, we can compare them to our earlier findings in van der Velde et al. (2021). There we found that a reduction in SoF prediction error of a similar magnitude (29.2% and 9.9% in the first and second practice block, respectively) considerably improved average recall accuracy on a subsequent test, by about 6.8 percentage points.

The memory model itself can also give us a sense of these differences in predictive accuracy. As Fig. 1 illustrates, over- or underestimating the true SoF, and thereby the activation of the memory item, can unintentionally change the difficulty of the learning task. A more accurate SoF prediction should mean that repetitions of an item are scheduled at a moment that minimises such unintended deviations in difficulty. Similarly, starting with a more accurate SoF estimate should also improve the memory model’s ability to predict the learner’s response accuracy and speed whenever an item is repeated. This is the focus of the analysis in Sect. 3.2.

A possible reason why Fact predictions outperformed Learner predictions is that the training data tended to include more learning sequences per fact than per learner (see Fig. 3), and that having more sequences to train the model generates a more accurate prediction. To rule out this possibility, we calculated the Pearson correlation between the number of sequences involved in each prediction and the absolute SoF prediction error. For Fact predictions, this correlation was close to zero in both the English data (\(r(3,299,409) = -0.0035, p <.001\)) and the French data (\(r(1,231,545) = 0.038, p <.001\)). The same was true for Learner predictions, in the English data (\(r(3,250,059) = 0.037, p <.001\)) as well as the French data (\(r(1,195,759) = -0.026, p <.001\)). Repeating this correlation analysis on just predictions at the lower end of the distribution, between 30 and 100 sequences, gives the same outcome (in that case, none of the correlation estimates exceeds 0.014 in magnitude), which shows that it is not simply a case of quickly diminishing returns. From this, we conclude that differences in predictive performance between the Fact method and the Learner method are not caused by differences in the amount of available training data, but rather reflect genuine differences in the informativeness of prior observations.

3.2 Predicting behavioural outcomes

Given that all data-driven predictions of the Speed of Forgetting outperformed the Default prediction, we simulated the effect of using these improved predictions as starting estimates for \(\alpha \) in the learning sequences in the test set. For each trial in the learning sequence, we used the adaptive learning system’s model to calculate the associated fact’s activation at that moment, on the basis of the initial SoF estimate and the accuracy and speed of learner’s responses up to that point. The activation was then used to predict response accuracy and response time. Since learners’ response data was fixed, any change in these behavioural predictions would be due to the initial SoF estimate being used.

We focused our analysis on the third trial in each learning sequence included in the test set. Response accuracy and response time in the first two trials are not representative of recall performance, since the first trial is always a study trial in which the correct answer is already given, and the second trial typically follows directly after the first, at which point the information is still readily available. The third trial is therefore the first that tests delayed recall. Looking at the third trial rather than later trials also maximises the number of learning sequences included in the analysis. Sequence length ranged from 3 to 25 trials, but the majority of sequences were no longer than 3 or 4 trials.

Fig. 6
figure 6

Absolute prediction error on behavioural measures in the third trial of each learning sequence in the test set, by prediction method. A Absolute difference between predicted and observed response accuracy. B Absolute difference between predicted and observed response time. Methods are ordered from highest to lowest error within each sub-figure. Points show the mean; error bars denoting \(\pm 1\) standard error of the mean are too small to be visible. The lines along the top show the significance of pairwise comparisons using Tukey’s range test. Note that the vertical axes are truncated for readability

3.2.1 Response accuracy

The probability of recalling a fact with an estimated activation A is given by a logistic function, where \(\tau \) is the retrieval threshold and s is transient activation noise (Anderson 2007):

$$\begin{aligned} p = \frac{1}{1 + e^{-(A - \tau )/s}} \end{aligned}$$
(11)

We kept \(\tau \) fixed at the value normally used in the adaptive learning system: \(\tau = -0.8\). The noise parameter does not have a defined value in the adaptive learning system, but for this simulation analysis we set it to a typical value of \(s = 0.2\).

We used Eq. (11) to calculate the estimated recall probability at the onset of the third trial in each learning sequence. This probability, a continuous value between 0 and 1, was then compared to the accuracy of the observed response, a binary value of 0 or 1. As before, we looked at the absolute prediction error, which is shown in Fig. 6A. Once again, we fitted separate linear mixed-effects models to a random subset of 1,000,000 predictions from each of the two languages, with a fixed effect for the prediction method and random intercepts for facts and learners. Both models found a significant effect of prediction method on absolute prediction error. Pairwise comparisons between prediction methods yielded similar results in both courses. The Default prediction was outperformed by all data-driven methods, and consistent with the previous analysis, we found significant drops in prediction error from Default to Domain, Domain to Learner, and Learner to Fact. Here, the Fact & Learner method performed best of all, reducing prediction error relative to the Default method by 4.63 percentage points (SE = 0.042; about 10.3%; standardised effect size \(d = 0.33\)) in the English course, and by 4.95 percentage points (SE = 0.045; about 10.4%; standardised effect size \(d = 0.33\)) in the French course.

Fig. 7
figure 7

Comparison of predicted and observed response times by prediction method for the two languages. The figure only includes the third trial in each learning sequence, excluding trials with incorrect responses, and the axes are truncated to aid readability. The number of included learning sequences is shown in the top right of each box. The mean absolute error (MAE; in seconds) between the predicted and observed response time is shown in the bottom right. The black line shows a linear regression model fitted to the data

3.2.2 Response time

The memory model used in the adaptive fact learning system also supposes a link between the activation of a fact and the time required to retrieve it: the more active the fact, the faster it is to retrieve (Anderson 2007). The time it takes to give a correct response is described by the following equation, where \(t_{er}\) is the additional time cost for encoding the stimulus and performing the motor responseFootnote 4:

$$\begin{aligned} RT = e^{-A} + t_{er} \end{aligned}$$
(12)

While it is possible to estimate the value of \(t_{er}\) from a learner’s response data (van der Velde et al. 2022), here we simply used a fixed value of 0.3 s, a reasonable default that was also used in van der Velde et al. (2021).Footnote 5

For each learning sequence in the test set and using the activation value from each of the five prediction methods, we calculated the expected response time in the third trial with Eq. (12). Looking only at trials in which the learner responded correctly, we then compared this expected response time to the observed response time. Figure 7 shows this comparison. As before, we fitted a linear mixed-effects model to the absolute prediction error, with a fixed effect for prediction method and random intercepts for facts and learners. Separate models were fitted for the two languages, using a random subset of 1,000,000 predictions per language. Given that the overall effect of prediction method on absolute prediction error was significant, we then performed pairwise comparisons between prediction methods using Tukey’s test. As Fig. 6B shows, the absolute error differed significantly between prediction methods. All data-driven methods outperformed the Default prediction, although the relatively high variance of the data meant that standardised effect sizes were small (as is typical with response time data; Brysbaert and Stevens 2018). Although there were slight differences between the two courses, the methods involving fact-level predictions, Fact and Fact & Learner, brought about the largest reduction in prediction error. Compared to the Default prediction method, the best-performing method reduced the absolute prediction error by 64.93 ms (SE = 6.49; about 4.9%; standardised effect size \(d = 0.03\)) in the English course and by 118.48 ms (SE = 6.22; about 8.3%; standardised effect size \(d = 0.06\)) in the French course.

4 Discussion

In a post hoc simulation study conducted on a naturalistic data set of almost 100 million trials generated by about 140 thousand learners, we found that we could predict individual Speed of Forgetting (SoF) from prior learning data and that using these predictions as starting estimates in an adaptive fact learning system yielded more accurate modelling of learners’ memory performance during practice. These improvements can be harnessed to increase the efficiency of the learning system, reduce the negative effects of a cold start, and potentially improve learning outcomes.

By showing that SoF can be predicted using prior learning data, we confirm that our earlier finding of improved predictions in a lab study (van der Velde et al. 2021) holds up in a real-world educational application. Other studies have similarly reported better predictions of learners’ performance in applied settings (e.g. Nedungadi and Remya 2014; Park et al. 2018; Pliakos et al. 2019). Whether such improvements ultimately lead to a meaningful change in students’ learning outcomes—which we did find in the lab but could not test here—remains to be seen.

Encouragingly, the reduction in SoF prediction error we observed here is similar in magnitude to the reduction we found in our laboratory-based study. There, the improved predictions helped the adaptive fact learning system prioritise difficult items over easier ones in its scheduling from the start of the learning session, ultimately leading to a 6.8-percentage-point increase in retention of the studied items (i.e. a gain of over half a grade point on a 10-point scale). One may expect that the SoF parameter is more difficult to estimate and predict outside the laboratory, because of the additional sources of noise present in real-world learning settings, and that this may diminish any practical benefit of cold-start mitigation. However, previous studies in educational settings showed that SoF estimates were indeed accurate enough to improve learning (van Rijn et al. 2009) and were predictive of university students’ examination grades (Sense 2021). In conjunction with the findings from van der Velde et al. (2021), this robustness of the adaptive model parameter in naturalistic settings suggests that we should also expect real-world learning gains from the cold-start mitigation methods presented in this study.

Furthermore, our simulation of a warm start demonstrates that beginning a learning session with data-driven Speed of Forgetting predictions improves the model’s ability to predict the accuracy and speed of learners’ responses during that learning session. While individual response times remained difficult to predict due to their large variance, we did observe a small improvement in predictive accuracy. Improvements were more clearly visible in the prediction of response accuracy. Having a better sense of the state of a learner’s memory, both at the start and during a learning session, means that repetitions can be scheduled at more appropriate moments. As Fig. 1 illustrates, even small changes in SoF estimates can have a sizeable impact on the (expected) difficulty at the start of a learning session. When the effects of consistent over- or underestimation compound across multiple items, this may noticeably affect how challenging learners experience the task to be (Holm and Wells 2023; ten Broeke et al. 2022; Pelánek and Effenberger 2022).

In their DASH model for personalised fact learning,  Lindsey et al. (2014) describe memory strength as a combination of fact difficulty, student ability, and study history, finding that each of these factors contributes to recall performance. Our use of both fact-specific and learner-specific information in predicting performance is an application of that idea in the context of cold start mitigation. The results we report here suggest a bigger contribution for difficulty than ability in predicting performance (given that study history is accounted for in the ACT-R memory model). While we anticipated that learner-specific predictions would be more powerful in this diverse population than in the laboratory, we still see that fact-specific predictions consistently outperformed the similarly fine-grained learner-specific predictions, both in terms of predicting final SoF estimates and in terms of predicting behaviour during practice. We did find some evidence, though not consistently, that predictions combining learner-level and fact-level information improved on predictions that were based on fact-level information alone. Additionally, learner-specific predictions did tend to outperform the more general Domain prediction. In all, these results suggest that knowing a learner’s general SoF (in other words, their ability to memorise facts) can indeed improve predictions, albeit to a lesser extent than knowing about the difficulty of a fact.

It is worth noting that there is a similarity between estimates of fact difficulty and estimates of learners’ prior knowledge: a fact that learners already know can appear to the learning system as being easier to memorise. However, the difficulty estimates we use in the current work are distinct from prior knowledge estimates in that they are based on all encounters a learner has with a fact, rather than just the initial encounter. As such, they reflect how difficult it is to retrieve a fact over multiple spaced repetitions, knowing with certainty that the learner has encountered the fact before. That said, explicitly predicting prior knowledge of facts can also be useful in optimising adaptive learning, and it is worth exploring approaches that combine the two (Pelánek et al. 2017; Wilschut et al. 2022).

The finding that learner-specific prediction can work, even when the learning task is not so much about skills but rather about declarative knowledge, is in line with the idea of a learner’s general forgetting rate being an individual trait. Learners’ SoF has been found to persist over time (Sense et al. 2016), and can be related to resting state measures of brain activity (Zhou et al. 2020; Xu et al. 2021). The larger the individual differences between learners, the more important learner-specific predictions are likely to be.

While we previously tested cold-start mitigation in a very particular population (undergraduate students), the current sample includes a larger and more diverse group of learners, making it much more representative of the educational domain in which an adaptive learning system is supposed to work. It is important to recognise that learning systems are used by, and should be designed for, diverse populations of learners with different abilities and in strongly varying socio-cultural contexts (Ogan and Johnson 2015). Ignoring this diversity can make these systems less effective, particularly for those underrepresented during the development process (Baker et al. 2019). The current sample is varied in some respects (e.g. educational level and socio-economic status), but there are still limitations to consider. For instance, participants in this study were geographically constrained to the Netherlands and limited to a specific age range. The data do not capture age-related declines in memory and information processing, which would cause larger variance among learners’ SoF estimates (Glisky 2007). Learners with disabilities or special educational needs, for whom learner-specific adjustments may make an otherwise difficult to use system more usable, are also likely to be underrepresented in the current sample (Nganji 2012). Addressing such limitations can further extend the applicability of this and other adaptive learning systems.

4.1 Practical implications

The results of this simulation study show that, even in the absence of any prior knowledge about the learner, a fact-specific prediction can already significantly improve the adaptive learning system’s memory model. In contrast to methods that predict performance using additional sources of information, such as demographic properties of the learner or features of the fact, or computationally expensive techniques, the methods tested here are relatively simple. They rely solely on learning data that is already available, and they are agnostic to the type of fact being learnt. These properties make for a general method of cold-start alleviation that is easy to implement. While each prediction in the current study was typically based on quite a large number of learning sequences, we found no indication that predictions that used more data outperformed predictions based on the minimum of 30 learning sequences. This shows that similar cold-start mitigation should be possible in smaller educational settings. (A benefit of using a Bayesian model to predict SoF is that one can set a prior to quantify expectations, so that the model will still make a reasonable prediction when there are fewer observations.)

While adaptive learning systems are seeing more widespread use, their inherent complexity relative to non-adaptive systems can hinder deployment at scale (Aleven et al. 2009; Baker 2016; Pelánek and Effenberger 2022). It is therefore promising that fact-specific predictions already have a meaningful impact on the cold-start problem, as a database of fact-level (“what”) difficulty estimates is easy to maintain and update, and so is—to a lesser extent—the learner-level (“who”) equivalent. Although a combined Fact & Learner prediction may outperform simpler Fact predictions in some cases, the improvement in predictive accuracy may not be worth the additional complexity involved. In a similar vein, while the Bayesian approach we take here has benefits, such as the ability to express the degree of uncertainty in a prediction, a simple average of observations could provide much of the same benefit while being much easier to compute.

Successfully alleviating the cold-start problem can enhance the educational benefits offered by an adaptive learning system (van der Velde et al. 2021). We have demonstrated that relatively simple prediction methods can considerably improve the accuracy of the learning system’s internal model, enabling it to make even better use of the learner’s time.