The first experiment consisted of an estimation phase, in which rate of forgetting estimates were collected for facts and learners, and a comparison phase, in which those estimates were used to mitigate the cold start problem in a new learning session via one of the mitigation strategies (see Fig. 3). In the estimation phase, the system obtained rate of forgetting estimates for a set of facts in one participant sample. In addition, using comparable but different facts, it determined individual rate of forgetting estimates for learners in a second participant sample. During the comparison phase, predictions based on the collected rate of forgetting estimates were used as starting values for the adaptive algorithm in a new learning session, done by the participants from the second sample. We performed a between-subject comparison of five different prediction types in the comparison phase: fact-level prediction, learner-level prediction, fact- and learner-level prediction, domain-level prediction (the average of all fact-level predictions), and, as a control condition, the fixed value used by the current system. We compared learning performance, as measured during the learning session and on a subsequent delayed recall test, and model-based measures between these five conditions.
Methods
Materials
Participants learned the names of two sets of thirty small towns across the contiguous USA, shown as dots on a map (see Fig. 4). To minimise the influence of existing knowledge, all towns, which were selected from the Simplemaps US cities databaseFootnote 2, had no more than 5000 inhabitants, and each town shared its name with at least four other towns in the database. All towns had single-word names with no special characters and no more than eight letters. A complete list of the towns in both sets can be found in the online supplementFootnote 3.
Participants
We recruited 241 participants from the research participant pool for first-year psychology students at the University of Groningen. They were split into two samples, one for each phase of the experiment. There were 82 participants in the first sample and 159 in the second. Data collection was conducted in the lab and participants were compensated with course credits.
For the online replication, which we conducted to examine how well our findings would extrapolate to a different population, an additional 217 participants were recruited in the USA via Amazon’s MTurk platform. Of these participants, 85 were in the first sample, with the remaining 132 being in the second sample. The MTurk participants were selected to be similar to the student population tested in the lab in terms of age and education. To be eligible for participation, they were required to have a HIT approval rate of 95% or higher, be located in the USA, be born between 1992 and 1999 (age range, 19 to 27 years), and have a US high school diploma. The MTurk participants completed the experiment online and were compensated in accordance with the US federal minimum wage.
The experiment was approved by the Ethical Committee of Psychology at the University of Groningen (study codes: 18215-SO and 18216-SO). All participants gave informed consent.
Procedure
Estimation Phase
The task was implemented in jsPsych (de Leeuw 2015) and ran in a Web browser. A minimum browser window size of 1280 × 768 pixels was required to display the experiment. For participants in the lab, the experiment was presented in full screen on a desktop computer. Since we could not fully control the circumstances under which MTurk participants completed the task, we asked them to ensure that they were in a quiet environment in which they would not be interrupted by phone calls, notifications, or other distractions. In addition to this, they were warned that they could not click away to another tab or window during the experiment. If they did, the task would be terminated. To assess participants’ basic typing proficiency, and to acclimatise them to the response and feedback format, sessions started with a six-item typing test in which participants typed a word shown on screen. They received feedback about their accuracy. All participants were found to have sufficient typing ability.
The typing test was followed by a 10-min learning session using the adaptive fact learning system. The two participant samples studied different sets of facts, but the procedure was otherwise the same. The order in which new facts were introduced was randomised. The adaptive fact learning system determined when and how often each fact was repeated. An example trial is shown in Fig. 4. In each trial, participants saw a map of the USA on which one town was highlighted in red. The twenty-nine other towns in the set were marked with smaller, grey dots. Participants typed the name of the highlighted town in lowercase in a text box below the map, confirming their response by pressing the enter key. If they did not recall the name of the town, they were instructed to simply press enter. New towns were introduced through a study trial in which the town’s name was displayed directly above the text box. Feedback was always shown below the text box directly after a response had been made. The answer that the participant had given remained visible in the text box. If a participant’s response matched the correct answer exactly, the feedback Correct was shown below the text box for 600 ms. A response with a Damerau-Levenshtein distance of 1 from the correct answer was considered almost correct (the assumption being that the participant did in fact know the correct answer, but made a single typing error), and was treated as correct by the model. In this case, the text Almost correct, along with the correct answer, appeared below the text box for 1.2 s. Finally, a response with a Damerau-Levenshtein distance from the correct answer of two or more was considered incorrect, and prompted the text Incorrect, along with the expected answer, to appear below the text box for 4 s.
Participants in the first sample were done with the task after completing the 10-min learning session. They saw a debriefing screen showing the number of trials they had completed, the number of towns they had encountered, and their overall response accuracy. Participants in the second sample, however, took a delayed recall test following a 5-min Tetris game that served as a filler task. In the recall test, facts that had been studied were presented one-by-one in the same manner as before. Feedback was withheld until the end of the test.
Comparison Phase
Participants in the second sample completed another block, once again consisting of a 10-min learning session, 5 min of Tetris, and a delayed recall test of the studied facts. This time, participants studied the other set of facts, and the adaptive fact learning system used one of the cold start mitigation strategies. Participants were randomly assigned to one of the five prediction types, which are described in the next section. The block ended with the debriefing screen.
Analysis
Before starting data collection in the second sample, we preregistered a set of analyses at https://osf.io/vwg6u/. The preregistration stated that we would collect data from a minimum of 25 participants per condition by a specified date. We did not reach this minimum by the deadline, so we continued data collection. After the data had been collected, we realised that several of the planned analyses were ill-conceived. The analyses reported in this paper therefore deviate from our preregistration in two ways: they include data collected after the deadline, and in some cases use different methods. An online supplement containing the data and full analysis scripts, including analyses conducted only on the subset of the data collected before the deadline as well as the originally planned analyses, is available at https://osf.io/snfyz/.
We used Bayesian statistics for all analyses because Bayesian methods allow for the quantification of evidence in favour of the null model in the event that no difference is found between conditions (Gallistel 2009). Following Jeffreys (1961), an effect was considered meaningful if the data provided at least moderate evidence in its favour, i.e., a Bayes factor of 3 or higher. All analyses were performed in R (version 3.6.3; R Core Team, 2018). Bayesian ANOVAs and t-tests were done using the BayesFactor package with default settings (version 0.9.12-4.2; Morey and Rouder, 2018).
Bayesian regression models were fitted with the brms package (version 2.12.0; Bürkner, 2017). In cases with multiple measurements per participant and/or item, the models included the appropriate random intercepts. In each analysis, we fitted a maximal model with main effects of prediction type (Default, Domain, Fact, Learner, Fact & Learner) and population (Lab, MTurk) plus the interaction of these terms. Both this maximal model and models with a simpler fixed effects structure were compared to a baseline intercept-only model (see the rows in Table 1) using the bridgesampling package (version 0.7-2; Gronau et al., 2020). Following Rouder and Morey (2012), fixed effects had weakly informative Cauchy(0,1) priors. All model runs used 4 Markov chains with 10,000 iterations each, including 5000 warm-up samples. Other options were left at their default setting.
Table 1 Bayesian model comparisons Results
Figure 5 provides an overview of performance in the comparison phase of experiment 1. Statistical analyses associated with each subfigure are summarised in Table 1, giving an overview of the effects across various outcome measures at a glance. Further details regarding the type of model used in each analysis as well as the specific findings are given in the following sections.
Rate of Forgetting Prediction
To assess the accuracy of the rate of forgetting predictions—made using the final rates of forgetting observed during the estimation phase—we compared them to the final rates of forgetting estimated at the end of the learning session (Fig. 5a). Prediction accuracy was quantified as the root-mean-square error (RMSE) of the predicted rate of forgetting relative to the rate of forgetting observed at the end of the session. The RMSE was calculated within-subject, and then averaged across subjects within each condition. A Bayesian ANOVA confirmed main effects of prediction type and population on prediction accuracy (Table 1A). The next best model included only an effect of prediction type, suggesting that this factor contributed more to the strength of the winning model than the effect of population. That said, follow-up pairwise comparisons of prediction accuracy between conditions were generally inconclusive. Only a one-sided Bayesian t-test comparing RMSE in all four predictive conditions (M= 0.084, SD= 0.025) to the Default condition (M = 0.102, SD = 0.035) found strong evidence for an improvement in prediction accuracy (BF = 1.0 × 103). The main effect of population indicated that RMSE was slightly lower in the lab population (M= 0.083, SD= 0.023) than in the MTurk population (M = 0.093, SD = 0.033).
Learning Session Performance
The number of distinct facts studied during the learning session did not differ between conditions (Fig. 5b). We modelled the data using a Bayesian Poisson regression. Model comparison showed the data to be most likely under the intercept-only null model (Table 1B). According to this model, participants encountered 17.7 distinct facts on average (95% CI [17.2, 18.2]). In contrast, an analysis of response accuracy during the learning session did find a difference between conditions (Fig. 5c). A comparison of Bayesian logistic mixed-effects models showed that the data were most likely under a model that included main effects of prediction type and population (Table 1C). This model found similarly sized improvements in response accuracy over the Default condition in all four cold start mitigation conditions: on average, response accuracy in these conditions was about 7.3 percentage points higher (95% CI [3.9, 11.0]). In addition, participants in the lab outperformed MTurk participants by about 5.5 percentage points (95% CI [3.1, 8.0]).
Delayed Recall Test Performance
There was no difference between conditions in the number of correctly recalled facts on the delayed recall test (Fig. 5d). A comparison of Bayesian Poisson regression models showed strong evidence in favour of the intercept-only null model (Table 1D), which found an average test score across conditions of 15.0 (95% CI [14.5, 15.4]). We obtained a similar result when looking at response accuracy on just the items that participants had encountered in the preceding learning session (Fig. 5e): the Bayesian logistic mixed-effects model that was most likely to have generated the data was the intercept-only null model (Table 1E). This model showed that overall accuracy on studied items was 87.9% (95% CI [85.2%, 90.3%]) across conditions.
Discussion
In this experiment, we attempted to improve the initial difficulty estimates of the cognitive model underlying our adaptive fact learning system, with the aim of reducing the cold start problem and thereby improving learning outcomes. We predicted the rate of forgetting of individual facts for individual learners by training a Bayesian model on previous data from the learning system. Several prediction methods with different granularity were compared to the fixed starting estimate currently used by the system.
Although the rate of forgetting predictions made by the Bayesian model were closer to observed values than the default prediction, this did not translate into an improvement in preregistered learning outcomes. Exploratory analysis did suggest that response accuracy increased during the learning session, indicating that facts were repeated on a more appropriate schedule than they would have been otherwise.
We expected fine-grained predictions to be more accurate than coarser ones, but comparisons between prediction types were largely inconclusive. It appears that domain-level predictions were already able to capture most of the relevant difficulty information, and predictions at finer levels of detail did not provide sufficient additional information to make a meaningful difference. Fine-grained predictions did show a larger spread than domain-level predictions (Fig. 5a), suggesting that there were facts that were consistently easy or difficult and learners with consistently higher or lower rates of forgetting, in line with earlier work (e.g. Sense et al., 2016). It should be noted that both the participant samples and the set of materials were fairly homogeneous, especially when compared to the more diverse educational settings found in real life. In settings where there are greater differences in skill between students, or differences in difficulty between facts, one would expect a domain-level prediction to fall short.
The stimulus material was also more difficult than expected: in both populations, predicted rates of forgetting were generally higher than the default value, which was found to be a reasonable average in previous studies. A higher rate of forgetting leads to more frequent repetitions of an item during the learning session. The learning system would therefore have had less opportunity to introduce new facts during the learning session than if some of the facts had been easier. The difficulty of the material meant that, despite having more accurate starting estimates, the system could not really capitalise on the differences between facts.
In short, while the use of informed starting estimates in the adaptive learning system may have improved response accuracy during the learning session, test scores remained unaffected, possibly as a result of having a uniformly difficult stimulus set. To test whether this was indeed the case, we conducted a second experiment with a more diverse set of facts that included both easy and difficult items but with a similarly homogeneous participant sample. For simplicity, this experiment focused on fact-level predictions.