Alleviating the Cold Start Problem in Adaptive Learning using Data-Driven Difficulty Estimates

van der Velde, Maarten; Sense, Florian; Borst, Jelmer; van Rijn, Hedderik

doi:10.1007/s42113-021-00101-6

Alleviating the Cold Start Problem in Adaptive Learning using Data-Driven Difficulty Estimates

Original Paper
Open access
Published: 15 March 2021

Volume 4, pages 231–249, (2021)
Cite this article

Download PDF

You have full access to this open access article

Computational Brain & Behavior Aims and scope Submit manuscript

Alleviating the Cold Start Problem in Adaptive Learning using Data-Driven Difficulty Estimates

Download PDF

Maarten van der Velde ORCID: orcid.org/0000-0003-4849-2676¹,
Florian Sense¹,
Jelmer Borst² &
…
Hedderik van Rijn¹

3154 Accesses
15 Citations
6 Altmetric
Explore all metrics

Abstract

An adaptive learning system offers a digital learning environment that adjusts itself to the individual learner and learning material. By refining its internal model of the learner and material over time, such a system continually improves its ability to present appropriate exercises that maximise learning gains. In many cases, there is an initial mismatch between the internal model and the learner’s actual performance on the presented items, causing a “cold start” during which the system is poorly adjusted to the situation. In this study, we implemented several strategies for mitigating this cold start problem in an adaptive fact learning system and experimentally tested their effect on learning performance. The strategies included predicting difficulty for individual learner-fact pairs, individual learners, individual facts, and the set of facts as a whole. We found that cold start mitigation improved learning outcomes, provided that there was sufficient variability in the difficulty of the study material. Informed individualised predictions allowed the system to schedule learners’ study time more effectively, leading to an increase in response accuracy during the learning session as well as improved retention of the studied items afterwards. Our findings show that addressing the cold start problem in adaptive learning systems can have a real impact on learning outcomes. We expect this to be particularly valuable in real-world educational settings with large individual differences between learners and highly diverse materials.

Large-scale evaluation of cold-start mitigation in adaptive fact learning: Knowing “what” matters more than knowing “who”

Article Open access 21 June 2024

Mitigating Knowledge Decay from Instruction with Voluntary Use of an Adaptive Learning System

Multi-armed Bandit Algorithms for Adaptive Learning: A Survey

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The process of learning increasingly takes place in digital environments. Adaptive learning systems, such as intelligent tutoring systems or cognitive tutors, allow students to learn in a digital environment tailored to them, offering a personalised approach that is often not possible in a traditional classroom setting (VanLehn 2006). By analysing a learner’s actions, an adaptive learning system can provide error-specific feedback at an appropriate level of detail, select suitable practice problems to try next, and provide insight into the learner’s mastery of the required skills. Adaptive learning systems have been used in many domains, such as learning country names on a map (Mettler et al. 2016), solving simple arithmetic problems (Klinkenberg et al. 2011) and algebra exercises (Corbett et al. 1997), writing computer code (Reiser et al. 1985), and responding to cardiac issues (Eliot et al. 1996).

The defining advantage of adaptive learning systems over non-adaptive systems is their ability to adjust to individual learners’ continually evolving knowledge and skills. Nevertheless, if there is no prior information about the capabilities of a learner or about the difficulty of a set of materials, even an adaptive system can initially be poorly attuned to the needs of a learner. In this so-called cold start scenario, adaptation can only occur once a learner starts using the system. It takes some time to identify the skill level of a new learner or to establish the difficulty of a new problem. During this time, the system can be just as poorly adapted to the individual as a non-adaptive learning system, by presenting exercises that are too easy or too difficult, or by providing inadequate or excessive feedback. Such a cold start may discourage new students from continuing to use the system. Studies have shown that in order to stay engaged, students require a learning experience that is challenging but has a difficulty that is proportionate to their own skill level—the so-called balance hypothesis (Shernoff et al. 2003; Kennedy et al. 2014; Hamari et al. 2016). With its potential to create an unbalanced learning experience, the cold start is therefore an important problem to address.

Different adaptive learning systems cope with the cold start problem in different ways. For instance, a system that faces a cold start may initially make large adjustments to its internal model following each response, and then gradually decrease the size of the adjustment with each subsequent response. The dynamic K factor in the Elo rating system used by Klinkenberg et al. (2011) is an example of this: it scales the change in an ability estimate as a function of that estimate’s uncertainty. Alternatively, a system might adapt its presentation schedule to reduce uncertainty (Wauters et al. 2010). This can be done by prioritising items that are most informative for determining a learner’s skill level, rather than optimising for learning or motivation (Chen et al. 2000). However, such methods attempt to deal with the consequences of the cold start problem, rather than addressing the problem at its root: inaccurate initial estimates of item difficulty and learner ability.

Inspiration for addressing the root cause can be found in the field of recommender systems, where an analogous problem exists. Recommender systems make recommendations that are tailored to individual users, typically relying on user-generated data to do so. Well-known examples include the personalised recommendations for products to buy on Amazon, automatically generated playlists on Spotify, and movie suggestions on Netflix. The cold start problem in recommender systems has received significant attention, and various solutions have been put forth. Broadly, these fall into one of three categories (Lika et al. 2014). Firstly, there are collaborative filtering methods, which generate recommendations for a new user on the basis of other users’ preferences (e.g. recommending movies based on others’ ratings, Bobadilla et al., 2012). These methods work well when little or nothing is known about a user. However, because their recommendations are crowd-sourced, such methods do rely on users having similar interests, an assumption that does not always hold. Secondly, there are content-based methods, which make recommendations to existing users based on their profiles and previous interactions with the system (e.g. recommending news articles based on previously read articles; see Lops et al. (2011) for a review). Content-based methods are geared towards the individual, so while they can work well in a heterogeneous population, they miss out on commonalities between users. Finally, there are hybrid methods that combine collaborative filtering and content-based methods in some way, with the aim of overcoming their respective shortcomings (e.g. recommending scientific literature using information about both users and items, Popescul et al., 2001).

Because recommender systems and adaptive learning systems share the problem of creating an accurate user model, cold start mitigation strategies used in recommender systems can also be applicable to adaptive learning systems. For instance, Pardos and Heffernan (2010) used the accuracy of students’ first step in solving an algebra problem to decide on their assumed level of prior knowledge. This stereotype-based approach, a content-based method, entailed quickly making assumptions about a learner’s skill level based on their initial interaction with the system. More recently, Nedungadi and Remya (2014) took a hybrid approach to the cold start problem. They identified clusters of similarly skilled learners, based on previously collected data, and used learners’ membership of these clusters to predict their performance on new problems. In a similar vein, Park et al. (2018) used background information about new learners, such as their grade level and gender, to assign the learners to groups. The learning data collected from other group members was then used to generate an initial ability estimate.

While all of these studies demonstrated ways to mitigate the cold start problem in adaptive learning systems, they did so in post hoc simulations performed on data that had already been collected. To our knowledge, the proposed mitigation strategies have not yet been scientifically tested in an applied setting. The current study addresses this issue by implementing several cold start mitigation strategies in an adaptive learning system, and investigating their effects on users’ performance during a learning session, as well as on subsequent learning outcomes. The study includes collaborative filtering, content-based, and hybrid methods, enabling us to compare their efficacy in a practical application.

We used an adaptive fact learning system to test these mitigation strategies. The system lets students memorise declarative information through retrieval practice. It has been validated in experimental settings and applied successfully in the classroom, helping students learn facts in many domains, including foreign vocabulary, biopsychological terms, and geography (van Rijn et al. 2009; Sense et al. 2016; Sense et al. 2018; van den Broek et al. 2019). The adaptive nature of the system lies in its scheduling of presentations within a learning session. In each trial, the system quizzes the learner on the fact for which the expected learning gains are largest at that point in time. This selection is based on a computational model of the learner’s memory in which each fact is represented. The memory model keeps track of the estimated difficulty of each fact for a particular learner, operationalised in a continuous-valued rate of forgetting estimate. The memory model adjusts its rate of forgetting estimates for individual facts on the basis of the accuracy and latency of the learner’s responses. To avoid getting stuck on a single fact, the system does not allow more than two successive presentations of any fact. Facts are introduced one at a time, with new facts only being added when the model deems all previously encountered facts to be encoded sufficiently well. The average rate of forgetting associated with a particular learner provides an index of that learner’s (domain-specific) ability to memorise facts (Sense et al. 2018).

When a learner first uses the adaptive fact learning system, or begins learning a new set of facts, the memory model starts out with the assumption that all facts are equally difficult and therefore have the same (default) rate of forgetting. This means that there is a cold start, during which difficult facts are not repeated often enough, while easy facts are presented too often. The current study attempts to alleviate this cold start by harnessing known individual differences in rate of forgetting between facts and between learners—some facts are consistently easier than others, and some learners are consistently better at memorisation than others (see e.g. Sense et al. 2016; Zhou et al. 2020)—for changing the initial estimates of the memory model, creating a “warm start” instead. Informed initial estimates are derived from observations of other learners studying the same fact (a collaborative filtering approach), from an earlier session in which the same learner studied different facts (a content-based approach), or from a combination of the two (a hybrid approach). In all cases, the previously observed rates of forgetting are used to update a Bayesian model that yields a new predicted rate of forgetting.

This paper reports on two experiments. In both experiments, participants memorised a set of facts using the adaptive learning system, modified to use one of the cold start mitigation strategies outlined above. We measured participants’ performance during the learning session and on a subsequent delayed recall test. Both experiments were preregistered, and the first was replicated in an online sample. Experiment 1 found that all mitigation strategies were better at predicting the rate of forgetting inferred at the end of the learning session than the system’s default starting estimate, but there was no difference between the mitigation strategies in this regard. Exploratory analysis showed higher response accuracy in learning sessions in which mitigation strategies were used, suggesting that informed initial estimates did indeed lead to a presentation schedule more conducive to successful retrieval. The expected improvement in performance on the delayed recall test did not materialise, however. We hypothesised that this was due to the generally high difficulty of the learning material, which reduced opportunities for introducing new facts and thereby limited the potential positive effect of a more efficient presentation schedule. For this reason, we conducted a second experiment in which participants learned a more naturalistic set of facts that included both easy and difficult items, with a simplified design. This experiment did indeed find higher accuracy on the delayed recall test, along with higher accuracy during the learning session. Together, these experiments demonstrate that mitigating the cold start in an adaptive learning system can benefit learners’ performance, both during and after a learning session, especially when heterogeneous item sets are used.

Adaptive Fact Learning System

The adaptive fact learning system used in this study relies on a computational cognitive model of human memory to create an optimal repetition schedule for facts during a learning session. The system is an extension of the one developed by Pavlik and Anderson (2005, 2008). Learners study a set of paired associates in a sequence of trials, each time typing the answer corresponding to the prompt shown on screen. New items are introduced once, with an initial presentation of both the prompt and the associated response. A learner’s responses are used to inform the system’s individualised estimate of the difficulty of each fact. The repetition schedule is adjusted accordingly, so that more study time is allocated to difficult facts and less time to easy facts. The system aims to capitalise on both the spacing effect (longer spacing between repetitions improves retention; e.g. Dempster, 1988), by spreading repetitions as far apart as possible, and the testing effect (active, effortful recall improves retention; e.g. van den Broek et al., 2016), by repeating facts at a moment when they can still be recalled.

The cognitive model that underpins the adaptive learning system represents each fact by a memory chunk with a certain activation. This activation, which corresponds to the strength of the declarative representation of the chunk, is boosted whenever the chunk is (re)created. It then decays over time. To calculate a chunk’s activation at a particular time, the model uses the memory strength equation from the ACT-R cognitive architecture (Anderson 2007). The activation A of a chunk x at time t, given n previous encounters at t₁,...,t_n seconds ago, is:

$$ A_{x}(t) = ln\left( \sum\limits_{j = 1}^{n} t_{j}^{-d_{x}(t)}\right) $$

(1)

Before the start of each trial, the model chooses a fact to rehearse, based on the activations of the facts that have been encountered so far. It calculates the projected activation of all facts fifteen seconds from the current time and selects whichever fact has the lowest value, since that fact is expected to be forgotten soonest. If, however, this fact still has an activation higher than a predefined retrieval threshold, the model randomly picks a new fact to practice instead. In this way, new facts will be introduced whenever none of the current facts need urgent rehearsal. The system also limits the number of successive presentations of any single fact to two. This means that, if selecting the least active fact would constitute a third successive presentation, the system repeats the next best candidate or, if none is available, introduces a new fact.

Some facts are easier to remember than others. The model accounts for these differences by allowing the rate at which the activation decays over time to vary between facts. It assumes that the activation of a difficult fact decays more rapidly than that of an easy fact, resulting in earlier and more frequent repetitions. This principle is illustrated in Fig. 1. The decay d_x(t) of a chunk x at time t is a function of the activation of the chunk at the time of its most recent encounter, together with a chunk-specific rate of forgetting α_x:

$$ d_{x}(t) = c * e^{A_{x}(t_{n-1})} + \alpha_{x} $$

(2)

By default, α_x is initialised to a value of 0.3. Over the course of a learning session, the model uses the speed and accuracy of the learner’s responses to update this value whenever the associated fact is rehearsed. At the moment a fact is presented to the learner, the model calculates an expected response time using the ACT-R equation for retrieval time (Anderson 2007), adding a fixed amount of time t₀ for perceptual and motor processes:

$$ \mathbb{E}(RT) = e^{-A_{x}} + t_{0} $$

(3)

The discrepancy between expected and observed response time determines how the model adjusts its rate of forgetting estimate for the fact. A faster-than-expected response signals that activation has not yet decayed to the predicted level, so the true rate of forgetting must be lower than assumed. Conversely, an unexpectedly slow or incorrect response implies that activation has decayed further than anticipated, meaning that the current rate of forgetting estimate is too low. An incorrect response is recorded in the model as a slow response (1.5 times the maximum expected response time for a fact with an activation equal to the retrieval threshold). To prevent overcompensation on the basis of a single discordant response, the model takes the five most recent responses for a fact into account when updating its rate of forgetting estimate. No adjustments are made until a fact has been presented at least three times. To update the rate of forgetting estimate, the model performs a binary search within a range of 0.05 below or above the current estimate. The value that minimises the mismatch between the expected and observed response times becomes the new estimate.

Previous work has shown this adaptive fact learning system to work in controlled laboratory settings (van Rijn et al. 2009; Nijboer 2011; Sense et al. 2016), as well as in applied educational settings (Sense et al. 2018). However, since updates to the rate of forgetting estimate only begin after the third repetition and are constrained to a relatively narrow range around the current value, it can take quite some time to correct an inaccurate starting estimate. Rate of forgetting estimates typically range from about 0.1 to about 0.5. Accurately representing items at the edges of this range already requires them to be presented six times. Having better initial estimates may therefore lead to a significantly more effective presentation schedule.

Bayesian Prediction of Rate of Forgetting

We evaluated four methods for predicting initial rate of forgetting estimates from previously collected learning data. These methods differ in terms of their granularity. The most granular is Fact & Learner, a hybrid method, which makes an individual prediction $\alpha _{F_{i} \circledast L_{j}}$ for the rate of forgetting value of every learner-fact pair. Less specific are Fact, a collaborative filtering method that predicts a single value $\alpha _{F_{i}}$ for each fact, and Learner, a content-based method that predicts a single value $\alpha _{L_{j}}$ for each learner. The least granular prediction method is Domain, a collaborative filtering method that only predicts a single value α_D for all learners and facts by taking the mean of all fact-level predictions. For comparison, we also included a Default condition in which the predicted rate of forgetting $\alpha _{\varnothing }$ is always the default value of 0.3.

All prediction methods rely on a Bayesian model to arrive at a predicted rate of forgetting. This model assumes that the rate of forgetting α_i for a fact or learner i is normally distributed with some unknown mean μ_i and unknown precision (the reciprocal of the variance) λ_i:

$$ \alpha_{i} \sim \mathcal{N}(\mu_{i}, \lambda_{i}^{-1}) $$

(4)

Using the conjugate prior of this distribution, the joint Normal-Gamma distribution, the Bayesian model can simultaneously infer the mean and precision (Murphy 2007):

$$ \begin{array}{@{}rcl@{}} p(\mu_{i}, \lambda_{i}) &=& p(\mu_{i} | \lambda_{i}) * p(\lambda_{i}) \end{array} $$

(5)

$$ \begin{array}{@{}rcl@{}} p(\mu_{i} | \lambda_{i}) & =& \mathcal{N}(\mu_{0}, \kappa_{0}\lambda_{i}^{-1}) \end{array} $$

(6)

$$ \begin{array}{@{}rcl@{}} p(\lambda_{i}) &=& \mathcal{G}(\alpha_{0}, \beta_{0}) \end{array} $$

(7)

Since the prior is conjugate, the posterior is also a Normal-Gamma distribution, and can be found analytically rather than through computationally expensive Markov chain Monte Carlo, which is an important consideration for our implementation in the Web browser. The posterior predictive for the next observation, after having seen n data points, follows a t-distribution (see Murphy, 2007, for a derivation):

$$ p(x|D) = t_{2_{\alpha_{n}}}\left( x | \mu_{n}, \frac{\beta_{n}(\kappa_{n} + 1)}{\alpha_{n}\kappa_{n}}\right) $$

(8)

We selected a weakly informative prior, with μ₀ = 0.3,κ₀ = 1,α₀ = 3, and β₀ = 0.2. This particular prior was chosen because it reflects our assumption that the rate of forgetting is normally distributed around 0.3, which previous studies have shown to be a reasonable average across materials and learners (e.g. van Rijn et al., 2009, Sense et al., 2016), and because it yields a sensible prior predictive distribution. To arrive at a single predicted value, we take the mode of the posterior predictive distribution, which represents the rate of forgetting with the highest probability of being observed, given the model and the data.

The model is updated using rate of forgetting estimates obtained from prior learning data. Estimates are only included when they are based on at least three presentations, so that the adaptive model has had at least one opportunity for adjustment^{Footnote 1}. Figure 2 illustrates the process of making fact-level, learner-level, and fact- and learner-level predictions on the basis of previous observations.

In the Fact & Learner prediction, the predicted rate of forgetting for a given fact when studied by a particular learner is based on the combination of the Fact prediction (which uses previously estimated rates of forgetting of that fact among other learners) and the Learner prediction (which uses previously estimated rates of forgetting of the learner on other facts). First, a separate posterior predictive distribution is generated for each prediction, following the process outlined above. The two distributions are then combined into a single posterior predictive distribution p_LOP, using logarithmic opinion pooling (Genest et al. 1984) with k = 2 equal weights w:

$$ \begin{array}{@{}rcl@{}} p_{LOP}(x|D) &=& \frac{{\prod}_{i=1}^{k} p_{i}(x|D)^{w_{i}}}{\int{\prod}_{i=1}^{k} p_{i}(x|D)^{w_{i}}dx} \end{array} $$

(9)

$$ \begin{array}{@{}rcl@{}} &=& \frac{p_{fact}(x|D)^{0.5} * p_{learner}(x|D)^{0.5}}{\int p_{fact}(x|D)^{0.5} * p_{learner}(x|D)^{0.5}dx} \end{array} $$

(10)

The mode of this pooled distribution becomes the predicted rate of forgetting of the fact for this learner. By using logarithmic pooling, we ensure that the resulting distribution is always unimodal (given the two unimodal inputs), even when the separate predictive distributions are very different. Because this method weights the two predictive distributions equally, their relative contribution to the final prediction is determined by their uncertainty. If one distribution is relatively spread out—reflecting less agreement in the data, or fewer observations—it exerts a smaller pull on the mode of the combined distribution than the other, more peaked distribution. In this way, the Fact & Learner prediction is biased towards whichever of its two components is more certain.

Experiment 1: Comparison of Mitigation Strategies

The first experiment consisted of an estimation phase, in which rate of forgetting estimates were collected for facts and learners, and a comparison phase, in which those estimates were used to mitigate the cold start problem in a new learning session via one of the mitigation strategies (see Fig. 3). In the estimation phase, the system obtained rate of forgetting estimates for a set of facts in one participant sample. In addition, using comparable but different facts, it determined individual rate of forgetting estimates for learners in a second participant sample. During the comparison phase, predictions based on the collected rate of forgetting estimates were used as starting values for the adaptive algorithm in a new learning session, done by the participants from the second sample. We performed a between-subject comparison of five different prediction types in the comparison phase: fact-level prediction, learner-level prediction, fact- and learner-level prediction, domain-level prediction (the average of all fact-level predictions), and, as a control condition, the fixed value used by the current system. We compared learning performance, as measured during the learning session and on a subsequent delayed recall test, and model-based measures between these five conditions.

Methods

Materials

Participants learned the names of two sets of thirty small towns across the contiguous USA, shown as dots on a map (see Fig. 4). To minimise the influence of existing knowledge, all towns, which were selected from the Simplemaps US cities database^{Footnote 2}, had no more than 5000 inhabitants, and each town shared its name with at least four other towns in the database. All towns had single-word names with no special characters and no more than eight letters. A complete list of the towns in both sets can be found in the online supplement^{Footnote 3}.

Participants

We recruited 241 participants from the research participant pool for first-year psychology students at the University of Groningen. They were split into two samples, one for each phase of the experiment. There were 82 participants in the first sample and 159 in the second. Data collection was conducted in the lab and participants were compensated with course credits.

For the online replication, which we conducted to examine how well our findings would extrapolate to a different population, an additional 217 participants were recruited in the USA via Amazon’s MTurk platform. Of these participants, 85 were in the first sample, with the remaining 132 being in the second sample. The MTurk participants were selected to be similar to the student population tested in the lab in terms of age and education. To be eligible for participation, they were required to have a HIT approval rate of 95% or higher, be located in the USA, be born between 1992 and 1999 (age range, 19 to 27 years), and have a US high school diploma. The MTurk participants completed the experiment online and were compensated in accordance with the US federal minimum wage.

The experiment was approved by the Ethical Committee of Psychology at the University of Groningen (study codes: 18215-SO and 18216-SO). All participants gave informed consent.

Procedure

Estimation Phase

The task was implemented in jsPsych (de Leeuw 2015) and ran in a Web browser. A minimum browser window size of 1280 × 768 pixels was required to display the experiment. For participants in the lab, the experiment was presented in full screen on a desktop computer. Since we could not fully control the circumstances under which MTurk participants completed the task, we asked them to ensure that they were in a quiet environment in which they would not be interrupted by phone calls, notifications, or other distractions. In addition to this, they were warned that they could not click away to another tab or window during the experiment. If they did, the task would be terminated. To assess participants’ basic typing proficiency, and to acclimatise them to the response and feedback format, sessions started with a six-item typing test in which participants typed a word shown on screen. They received feedback about their accuracy. All participants were found to have sufficient typing ability.

The typing test was followed by a 10-min learning session using the adaptive fact learning system. The two participant samples studied different sets of facts, but the procedure was otherwise the same. The order in which new facts were introduced was randomised. The adaptive fact learning system determined when and how often each fact was repeated. An example trial is shown in Fig. 4. In each trial, participants saw a map of the USA on which one town was highlighted in red. The twenty-nine other towns in the set were marked with smaller, grey dots. Participants typed the name of the highlighted town in lowercase in a text box below the map, confirming their response by pressing the enter key. If they did not recall the name of the town, they were instructed to simply press enter. New towns were introduced through a study trial in which the town’s name was displayed directly above the text box. Feedback was always shown below the text box directly after a response had been made. The answer that the participant had given remained visible in the text box. If a participant’s response matched the correct answer exactly, the feedback Correct was shown below the text box for 600 ms. A response with a Damerau-Levenshtein distance of 1 from the correct answer was considered almost correct (the assumption being that the participant did in fact know the correct answer, but made a single typing error), and was treated as correct by the model. In this case, the text Almost correct, along with the correct answer, appeared below the text box for 1.2 s. Finally, a response with a Damerau-Levenshtein distance from the correct answer of two or more was considered incorrect, and prompted the text Incorrect, along with the expected answer, to appear below the text box for 4 s.

Participants in the first sample were done with the task after completing the 10-min learning session. They saw a debriefing screen showing the number of trials they had completed, the number of towns they had encountered, and their overall response accuracy. Participants in the second sample, however, took a delayed recall test following a 5-min Tetris game that served as a filler task. In the recall test, facts that had been studied were presented one-by-one in the same manner as before. Feedback was withheld until the end of the test.

Comparison Phase

Participants in the second sample completed another block, once again consisting of a 10-min learning session, 5 min of Tetris, and a delayed recall test of the studied facts. This time, participants studied the other set of facts, and the adaptive fact learning system used one of the cold start mitigation strategies. Participants were randomly assigned to one of the five prediction types, which are described in the next section. The block ended with the debriefing screen.

Analysis

Before starting data collection in the second sample, we preregistered a set of analyses at https://osf.io/vwg6u/. The preregistration stated that we would collect data from a minimum of 25 participants per condition by a specified date. We did not reach this minimum by the deadline, so we continued data collection. After the data had been collected, we realised that several of the planned analyses were ill-conceived. The analyses reported in this paper therefore deviate from our preregistration in two ways: they include data collected after the deadline, and in some cases use different methods. An online supplement containing the data and full analysis scripts, including analyses conducted only on the subset of the data collected before the deadline as well as the originally planned analyses, is available at https://osf.io/snfyz/.

We used Bayesian statistics for all analyses because Bayesian methods allow for the quantification of evidence in favour of the null model in the event that no difference is found between conditions (Gallistel 2009). Following Jeffreys (1961), an effect was considered meaningful if the data provided at least moderate evidence in its favour, i.e., a Bayes factor of 3 or higher. All analyses were performed in R (version 3.6.3; R Core Team, 2018). Bayesian ANOVAs and t-tests were done using the BayesFactor package with default settings (version 0.9.12-4.2; Morey and Rouder, 2018).

Bayesian regression models were fitted with the brms package (version 2.12.0; Bürkner, 2017). In cases with multiple measurements per participant and/or item, the models included the appropriate random intercepts. In each analysis, we fitted a maximal model with main effects of prediction type (Default, Domain, Fact, Learner, Fact & Learner) and population (Lab, MTurk) plus the interaction of these terms. Both this maximal model and models with a simpler fixed effects structure were compared to a baseline intercept-only model (see the rows in Table 1) using the bridgesampling package (version 0.7-2; Gronau et al., 2020). Following Rouder and Morey (2012), fixed effects had weakly informative Cauchy(0,1) priors. All model runs used 4 Markov chains with 10,000 iterations each, including 5000 warm-up samples. Other options were left at their default setting.

Table 1 Bayesian model comparisons

Full size table

Results

Figure 5 provides an overview of performance in the comparison phase of experiment 1. Statistical analyses associated with each subfigure are summarised in Table 1, giving an overview of the effects across various outcome measures at a glance. Further details regarding the type of model used in each analysis as well as the specific findings are given in the following sections.

Rate of Forgetting Prediction

To assess the accuracy of the rate of forgetting predictions—made using the final rates of forgetting observed during the estimation phase—we compared them to the final rates of forgetting estimated at the end of the learning session (Fig. 5a). Prediction accuracy was quantified as the root-mean-square error (RMSE) of the predicted rate of forgetting relative to the rate of forgetting observed at the end of the session. The RMSE was calculated within-subject, and then averaged across subjects within each condition. A Bayesian ANOVA confirmed main effects of prediction type and population on prediction accuracy (Table 1A). The next best model included only an effect of prediction type, suggesting that this factor contributed more to the strength of the winning model than the effect of population. That said, follow-up pairwise comparisons of prediction accuracy between conditions were generally inconclusive. Only a one-sided Bayesian t-test comparing RMSE in all four predictive conditions (M= 0.084, SD= 0.025) to the Default condition (M = 0.102, SD = 0.035) found strong evidence for an improvement in prediction accuracy (BF = 1.0 × 10³). The main effect of population indicated that RMSE was slightly lower in the lab population (M= 0.083, SD= 0.023) than in the MTurk population (M = 0.093, SD = 0.033).

Learning Session Performance

The number of distinct facts studied during the learning session did not differ between conditions (Fig. 5b). We modelled the data using a Bayesian Poisson regression. Model comparison showed the data to be most likely under the intercept-only null model (Table 1B). According to this model, participants encountered 17.7 distinct facts on average (95% CI [17.2, 18.2]). In contrast, an analysis of response accuracy during the learning session did find a difference between conditions (Fig. 5c). A comparison of Bayesian logistic mixed-effects models showed that the data were most likely under a model that included main effects of prediction type and population (Table 1C). This model found similarly sized improvements in response accuracy over the Default condition in all four cold start mitigation conditions: on average, response accuracy in these conditions was about 7.3 percentage points higher (95% CI [3.9, 11.0]). In addition, participants in the lab outperformed MTurk participants by about 5.5 percentage points (95% CI [3.1, 8.0]).

Delayed Recall Test Performance

There was no difference between conditions in the number of correctly recalled facts on the delayed recall test (Fig. 5d). A comparison of Bayesian Poisson regression models showed strong evidence in favour of the intercept-only null model (Table 1D), which found an average test score across conditions of 15.0 (95% CI [14.5, 15.4]). We obtained a similar result when looking at response accuracy on just the items that participants had encountered in the preceding learning session (Fig. 5e): the Bayesian logistic mixed-effects model that was most likely to have generated the data was the intercept-only null model (Table 1E). This model showed that overall accuracy on studied items was 87.9% (95% CI [85.2%, 90.3%]) across conditions.

Discussion

In this experiment, we attempted to improve the initial difficulty estimates of the cognitive model underlying our adaptive fact learning system, with the aim of reducing the cold start problem and thereby improving learning outcomes. We predicted the rate of forgetting of individual facts for individual learners by training a Bayesian model on previous data from the learning system. Several prediction methods with different granularity were compared to the fixed starting estimate currently used by the system.

Although the rate of forgetting predictions made by the Bayesian model were closer to observed values than the default prediction, this did not translate into an improvement in preregistered learning outcomes. Exploratory analysis did suggest that response accuracy increased during the learning session, indicating that facts were repeated on a more appropriate schedule than they would have been otherwise.

We expected fine-grained predictions to be more accurate than coarser ones, but comparisons between prediction types were largely inconclusive. It appears that domain-level predictions were already able to capture most of the relevant difficulty information, and predictions at finer levels of detail did not provide sufficient additional information to make a meaningful difference. Fine-grained predictions did show a larger spread than domain-level predictions (Fig. 5a), suggesting that there were facts that were consistently easy or difficult and learners with consistently higher or lower rates of forgetting, in line with earlier work (e.g. Sense et al., 2016). It should be noted that both the participant samples and the set of materials were fairly homogeneous, especially when compared to the more diverse educational settings found in real life. In settings where there are greater differences in skill between students, or differences in difficulty between facts, one would expect a domain-level prediction to fall short.

The stimulus material was also more difficult than expected: in both populations, predicted rates of forgetting were generally higher than the default value, which was found to be a reasonable average in previous studies. A higher rate of forgetting leads to more frequent repetitions of an item during the learning session. The learning system would therefore have had less opportunity to introduce new facts during the learning session than if some of the facts had been easier. The difficulty of the material meant that, despite having more accurate starting estimates, the system could not really capitalise on the differences between facts.

In short, while the use of informed starting estimates in the adaptive learning system may have improved response accuracy during the learning session, test scores remained unaffected, possibly as a result of having a uniformly difficult stimulus set. To test whether this was indeed the case, we conducted a second experiment with a more diverse set of facts that included both easy and difficult items but with a similarly homogeneous participant sample. For simplicity, this experiment focused on fact-level predictions.

Experiment 2: More Variable Fact Difficulty

Figure 6 shows the design of experiment 2. As in experiment 1, the experiment consists of an estimation phase, in which we collect the rate of forgetting estimates that form the basis for the difficulty predictions, and a comparison phase, in which those predictions are used as starting estimates in the adaptive learning system. Experiment 2 uses a more varied set of stimuli, with the aim of letting the system capitalise on the larger differences in difficulty between facts. For this reason, we only use fact-level predictions. These predictions are derived from the learning sessions of one set of participants during the estimation phase, and subsequently applied in the learning sessions of another set of participants during the comparison phase. To see if cold start mitigation does improve learning outcomes under these circumstances, we perform a within-participant comparison of learning performance between a cold start session that uses default predictions and a warm start session that uses fact-level predictions.