We strive to learn and remember information efficiently and effectively to achieve a variety of goals in everyday life, from a list of items to purchase at the grocery store and where we parked our car to the names of people we meet. We may attempt to do so through elaborate imagery (Bower, 1970; Paivio, 1969), associating newly learned material with prior knowledge (Brod, Werkle-Bergner, & Shing, 2013; Ryan, Moses, Barense, & Rosenbaum, 2013), or making the material relevant to ourselves (i.e., the self-reference effect; Carson, Murphy, Moscovitch, & Rosenbaum, 2016; Symons & Johnson, 1997). Even the simple and often spontaneous act of bringing this information to mind (retrieval practice) can be effective, especially when an interval is imposed between repeated presentations of the material (i.e., spaced repetition), as compared to when the repeated material is presented in immediate succession (i.e., massed repetition). The utility of this spacing effect as a learning technique has been evident in a vast number of laboratory experiments performed using a wide variety of methods in diverse populations (Cepeda, Pashler, Vul, Wixted, & Rohrer, 2006).

Extensive research has also demonstrated that the act of successful retrieval serves as a “memory modifier” to effectively increase the likelihood of successfully retrieving the same information at a future time (Bjork, 1975). The mere act of taking a test, or retrieving information from memory, improves subsequent retrieval, even more than passive repetition or restudy of the information; this is referred to as the testing effect (Bjork, 1975, 1988; Roediger & Karpicke, 2006). Although the effects of spacing and testing are both very well-supported by past studies (e.g., Gerbier & Toppino, 2015; Roediger & Karpicke, 2006), it is unclear how the two may be optimally combined to enhance knowledge retention in a real-world setting (although see the work by Storm, Bjork, & Storm, 2010). By analyzing a large, naturally occurring dataset of over 10,000 employees in the private sector (e.g., pharmaceutical and higher education companies and a large grocery retailer), the present study will help realize the potential of spaced retrieval as it extends to real-world learning in the workplace.

Attempts have been made to determine the efficacy of spaced repetition in the real world, primarily as it applies to classroom teaching (Kapler, Weston, & Wiseheart, 2015; Pashler et al., 2007; Sobel, Cepeda, & Kapler, 2011) and clinical interventions (Balota, Duchek, Sergent-Marshall, & Roediger, 2006; Cermak et al., 1996; Green, Weston, Wiseheart, & Rosenbaum, 2014; Kim, Saberi, Wiseheart, & Rosenbaum, 2018), but its utility beyond these settings and in large, unconstrained population samples has yet to be determined. Workplace training could benefit from learning strategies that employ spaced repetition of information. Whether engaging in safety training or learning of new product information, workplace training often occurs in the context of varying levels of experience and motivation to learn, and changes in content and delivery of that content. Workplace training is also subject to unpredictable learning schedules in terms of time available to devote to learning in a given session or day, length of time between exposures to study material, and number of exposures to study material. This creates ideal conditions for examining the spacing effect and how to optimize it to promote learning in a heterogeneous, real-world context.

Here we analyzed longitudinal data that were collected in the context of workplace training. The data were collected via a learning management system (LMS) that delivers training content to employees across multiple sessions. An important feature embedded into the architecture of the LMS is multiple presentations of a given piece of information, with varying delays and number of intervening items between repetitions, resulting in training/learning sessions that are naturally spread (or spaced) apart and according to various schedules across individuals. In light of the extant literature, supported by numerous laboratory-based studies (for reviews see Cepeda et al., 2006; Gerbier & Toppino, 2015), we hypothesized that our large, real-world dataset would demonstrate a significant interaction between spacing and retention interval. More specifically, we hypothesized that the optimal spacing interval would increase as the retention interval increases. In this way, the present investigation was hypothesis-driven based on past empirical work, unlike the more exploratory nature of most studies using big data.Footnote 1

Method

Axonify Inc.: The LMS

The data analyzed in the present study were collected using an LMS provided by Axonify Inc. As mentioned above, repeated and distributed retrieval practice of training content is integrated into the architecture of the LMS, allowing for investigation of spaced retrieval and its optimization using this large, real-world dataset. Through the LMS, client companies deliver training to employees across multiple sessions, with each session lasting approximately 3 min. In a typical session, employees answer two to three questions. When employees answer questions incorrectly, their learning is reinforced with corrective feedback. Thus, each training session was composed of questions and corrective feedback. The LMS offers the use of multiple question formats for content delivery, as described further below. The question formats used for training, along with the specific content delivered using the LMS (e.g., health and safety rules, product information), were selected at the discretion of the client companies.

Importantly, the LMS allows administrators to keep track of the performance of each individual employee, and thus identify the content for which individual employees need additional training. Training sessions are delivered on an individual basis, allowing content delivery to be personalized to individual employees so that training is focused on content that employees have yet to sufficiently acquire. Since the LMS platform enables the training process to be delivered to employees by way of computers, laptops, smartphones, point of sale systems, and security terminals, training can take place anywhere and at anytime. Typically, however, employees receive training in the workplace. Heterogeneity in the nature of the corresponding data comes with the real-world settings in which the data were collected. For example, some employees work on a part-time basis and consequently receive less training than full-time employees. Despite differences in the frequency of content delivery for a given employee and the nature of the content, all training delivered using the LMS implemented spaced retrieval and measured the impact it has on knowledge retention, which is central to the present investigation.

Data

We analyzed longitudinal data from 10,514 individuals (employees), collected in the context of workplace training from five client companies. Companies 1 and 5 are retailers, and, in both cases, the employees who received training are associates who work in their stores. Specifically, Company 1 is a high-end department store retailer, and employees received training on safety and store operations. Company 5, on the other hand, is a US grocery retailer, and employees received training on loss prevention, safety, and store operations. The data contributed by Companies 2 and 4 are from sales teams for two different pharmaceuticals companies, and the employees from both companies received training on product knowledge. Finally, Company 3 is a higher-education institution, and the corresponding data are from internal staff trained on training procedures. These individuals varied in age, gender, and occupation. Unfortunately, however, demographic data were not available for the present research, due to privacy policies and the nature of the data collection. Since we did not receive any personal or identifying information with the data, we were not required to receive informed consent from the employees whose data we analyzed or approval from our institutional ethics committees.

As mentioned above, the individual companies provided the content on which their employees were trained using the LMS and also selected the question format used throughout training (e.g., multiple choice, multiple answer, advanced multiple choice, matching, and/or fill in the blanks). Table 1 shows the breakdown (in proportions) of the question formats used by each company. The client companies also determined the number of unique questions (vs. occurrences of a question) delivered to an individual during training. Table 2 presents summary statistics for the number of unique questions delivered to employees for each individual client company using the LMS described above. Table 3 presents a breakdown of the numbers of correct and incorrect responses on a first retrieval event (training section) as a function of question format and company. The Appendix lists sample questions for each of the question formats used by each company.

Table 1 Proportions of question formats used by each client company
Table 2 Summary statistics for the number of unique questions presented to employees in each company
Table 3 Numbers of correct/incorrect responses on first retrieval event as a function of question format and client company

As we describe further below, our analyses only took into account individuals’ performance on the first three occurrences of a question. Following the work of Cepeda et al. (2009), our analyses only considered data for which individuals provided an incorrect response to the first occurrence of a question, followed by a correct response for the second occurrence of a question (please refer to Fig. 1). We focused on items that were first answered incorrectly because we could not be certain whether those items that were first answered correctly were already known prior to the learning event in question. For this reason, we considered these data (i.e., items that were first answered correctly) to be potentially confounded, and their exclusion minimized the possibility that an individual employee was reviewing material that had already been learned, and possibly mastered, rather than learning new material. This type of confound does not apply in laboratory studies, since the type of information that is typically learned in laboratory studies is purposefully made to be random (e.g., pairs of words that are assembled to have little or no preexperimental relation). In the present study, however, the material was not random, and some of it is likely to have been previously known by the participant.

Fig. 1
figure 1

The data considered in our analyses consisted of responses to questions to which individuals provided an incorrect response on its first occurrence, followed by a correct response on the second occurrence of the question. Responses to the third occurrence of a given question served as our response variable, which was dependent on a set of fixed and random factors.

Because the focus of the present investigation was to assess the impact of spaced retrieval on learning new information, the abovementioned reasoning led us to exclude those items that were first answered correctly from our analyses. A correct response on the second occurrence of a question demonstrated that the participant had successfully learned the item. As we describe further below, responses to the third occurrence of a given question served as our response variable, which was dependent on a set of fixed and random factors. Aside from the criteria that we used to filter the data, we did not have any control over the amount of data we had to work with, and were thus unable to conduct a power analysis or otherwise to control sample size.

Model

In the present study, we investigated how to optimally distribute learning/training events by analyzing a large, real-world dataset. Our primary research interest centered on the impact of spacing and retention intervals on employees’ knowledge retention, as well as on a potential interaction between these two variables. As a secondary research interest, we also investigated whether question format impacted employees’ knowledge retention and whether this variable interacted with spacing and retention interval.

Both spacing and retention interval were used as fixed factors. Both of these fixed factors were continuous and quantified using number of days as a unit of measure. The spacing interval corresponded to the time interval between the first and second occurrences of a question (O1 and O2, respectively; please refer to Fig. 1). The retention interval corresponded to the time interval between the second and third occurrences of a question (O2 and O3, respectively). Individuals’ responses to a given question at O3 were taken as a measure of knowledge retention and dichotomously coded as being either correct (1) or incorrect (0). Since the duration of the spacing and retention intervals depended on the timing of O1 versus O2 and of O2 versus O3, respectively, we did not have any control over these timing-related variables. However, the inherent variability of these variables contributed to the scope of data accounted for by our model.

In addition to spacing and retention interval, question format was also treated as a fixed factor. This third fixed factor was categorical and consisted of five types of question format: multiple choice (MC), multiple answer (MA), advanced multiple choice (AMC), fill in the blanks (FIB) and matching (Match). The MC and AMC formats required individuals to select one of multiple options provided to them as the most appropriate response to the given question. The difference between MC and AMC is that employees receive more detailed feedback for AMC than for MC questions; whereas employees receive a general explanation for incorrect responses to a MC question, for AMC questions they receive specific explanations outlining why each incorrect option is incorrect. For MA, multiple responses were provided and individuals were to select all (multiple) of the correct responses. For FIB and Match, individuals were tasked with filling in missing words and identifying appropriate pairings, respectively. Since question format was a categorical factor with five possible variables (MC, MA, AMC, FIB, and Match) and MC set as the reference, four binary variables were required to account for the effect of each variable. For example, to assess the effect of FIB, the binary variable for FIB would be set to one, MC serving as the reference, and the remaining three variables (MA, AMC, and Match) set to zero. In this way, a beta coefficient was derived for each of the four binary variables in the model.

The data in our analyses consisted of a three-level hierarchical structure; the questions (Level 1) clustered within a given individual (Level 2), who is, in turn, clustered within a specific company (Level 3). Responses of a given individual would therefore likely show some dependence, as would responses of individuals from a given company. To account for these dependencies, the following random factors were included in the model: the specific components of content that employees were trained on (e.g., question), which were clustered under employee (e.g., individual) and, in turn, under company, resulting in a three-level hierarchy.

A logistic generalized linear mixed model (GLMM) was fit to the employees’ O3 response data. Spacing (or interval of time between O1 and O2), retention interval (or interval of time between O2 and O3) and question format were set as fixed predictor effects, whereas question, user and company were set as random effects. Our logistic GLMM model allowed us to examine the relation between a binary outcome variable (correct response = 1, incorrect response = 0) and our clustered predictor variables. The odds of a correct response was defined as the ratio between the probability of a correct response over the probability of an incorrect response. The logit transformation on the probabilities of a correct response was used to establish a linear relationship between our predictor variables and the odds of a correct response. Outliers were removed to meet the assumption of homogeneity of variance. All other model assumptions were met.

The odds of a correct response was defined as

$$ \log \kern0.15em \left(\frac{p_{ji}(x)}{1-{p}_{ji}(x)}\right)={\beta}_o+{\beta}_{Spacing}{X}_{1 ji}+{\beta}_{RI}{X}_{2 ji}+{\beta}_{QF}{X}_{3 ji}+{\beta}_{S: RI}{X}_{ji}+{\beta}_{RI: QF}{X}_{ji}+{\gamma}_j+{\gamma}_i+{\gamma}_k $$

where pji(x) was the probability of a correct response, and 1 – pji(x) was the probability of an incorrect response. β0 refers to the intercept, βSpacing was the coefficient for spacing, βRI was the coefficient for retention interval, βQF was the coefficient for question format, βS:RI was the coefficient for the interaction between spacing interval and retention interval, βRI:QF was the coefficient for the interaction between retention interval and spacing interval, and γj, γi, and γk were the error terms for individual, company, and question, respectively.

Results

The results of our logistic GLMM, based on the penalized quasi-likelihood method, are presented in Table 4. Spacing, retention interval, and question format (specifically, AMC, MA, and Match) significantly affected the log-odds of a correct response to a given question within a training session (all z values significantly different from 0, p < .005). Second-order interactions were tested using the multivariate Wald test and revealed a significant interaction between spacing and retention intervals, as well as between retention interval and question format (p < .05).

Table 4 Summary of the results of the mixed model

Inverse log transformations of the predictor-variable estimates were conducted in order to interpret the results of our model as odds ratios. The transformed estimates can also be found in Table 4. Our results showed that when retention interval is set to zero and question format is set to multiple choice (the selected reference for our model), increasing spacing by one day yields an odds of a correct response of .98 with respect to an individual answering correctly. In terms of odds ratios, any value less than 1 indicates that the probability of responding correctly was less than that of responding incorrectly. Given that past laboratory-based research has generally demonstrated better memory performance with increased spacing, one might have expected an odds value greater than 1 when the spacing interval was set to any value greater than 0. However, it is important to note the significance of the interaction between spacing and retention interval, as we discuss further below, which past work has shown to be an important contextual factor for detecting a spacing effect.

Our findings also show that when the spacing is set to zero, question format is set to multiple choice, and retention interval is increased by one day, the odds of an individual answering correctly is .99. Again, in terms of odds ratio, this finding suggests that the probability of responding correctly is less than that of responding incorrectly in the context of the specified parameters. To hone in on the impact of question format, retention interval was set to zero and spacing was held constant, whereas question format was set to AMC. This resulted in an increase in the odds ratio, whereas all the other levels of question format reduced the odds. Using these parameters for spacing and retention interval resulted in the following odds of answering correctly: 8.12 for AMC questions, 0.32 for FIB questions, 0.02 for matching questions, and 0.31 for MA questions.

As we mentioned above, we found a significant interaction between spacing and retention interval. Figure 2 shows the odds of answering correctly at O3 as spacing is increased and retention interval is fixed at various values. As we mentioned above, the spacing and retention interval variables were both continuous. Figure 2 shows that at shorter retention intervals (e.g., 0 and 7 days), the odds of answering correctly decreased as the spacing interval increased. In contrast, Fig. 2 shows that at longer retention intervals (e.g., 120 to 210 days), the odds of answering correctly increased as the spacing interval increased.

Fig. 2
figure 2

Effects of spacing interval on the odds of answering correctly across retention intervals.

As was mentioned above, we also found a significant interaction between question format and retention interval. Figure 3 shows that when the spacing interval is held constant (e.g., at 30 days), the relation between retention interval and the odds of answering correctly differs across the various levels of question format. For example, when question format was set to multiple choice, the odds of answering correctly increased as retention interval increased. However, when the question format was set to AMC, the odds of answering correctly decreased as retention interval increased.

Fig. 3
figure 3

Effects of retention interval, in days, on the odds of correctly answering across question formats when the spacing interval was set to 30 days.

To unpack this interaction further, the odds of answering correctly at different retention intervals can be calculated for different question formats. For example, the following equation can be used to calculate the odds of answering correctly for the FIB question format, where RI refers to retention interval and R:FIB refers to the interaction between retention interval and the FIB question format. For demonstration purposes, the equation is set so that the retention interval is increased by one day and the spacing interval is held at 0.

$$ \log \left(\frac{\mathrm{odds}\left(Y=1\ \right|\ \mathrm{Retention}=r+1,\mathrm{Question}\ \mathrm{Format}=\mathrm{FIB}\Big)}{\mathrm{odds}\left(Y=1\ \right|\ \mathrm{Retention}=r,\mathrm{Question}\ \mathrm{Format}=\mathrm{FIB}\Big)}\right)={\beta}_{\mathrm{R}\mathrm{I}}+{\beta}_{\mathrm{R}:\mathrm{FIB}}=-0.0189-0.0230=-0.0419. $$

The log-odds of answering correctly, derived using the formula above, can then be transformed into odds by using an inverse log transformation. Using this transformation, we found that the odds of answering correctly for the FIB question format and the specified parameters (retention interval increased by one day and spacing interval set to 0) was approximately 0.958. Table 5 presents the log-odds and odds of answering correctly for each level of question format, using the same parameters as those used above (retention interval increased by one day and spacing interval set to 0). Along these lines, the log-odds of answering correctly for each level of question format can also be calculated using different spacing-interval values. Again, using the example of the FIB question format, the log-odds of answering correctly can be calculated using the following formula:

$$ \log \left(\frac{\mathrm{odds}\left(Y=1\ \right|\ \mathrm{Spacing}=s,\mathrm{Retention}=r+1,\mathrm{Question}\ \mathrm{Format}=\mathrm{FIB}\Big)}{\mathrm{odds}\left(Y=1\ \right|\ \mathrm{Spacing}=s,\mathrm{Retention}=r,\mathrm{Question}\ \mathrm{Format}=\mathrm{FIB}\Big)}\right)={\beta}_{\mathrm{R}\mathrm{I}}+{\beta}_{\mathrm{R}:\mathrm{FIB}}+{\beta}_{\mathrm{S}:{\mathrm{R}}^s}=-0.0419+0.0003\ast s. $$
Table 5 Log-odds and odds of correctly responding when retention interval was increased by one day and spacing was held at 0

Using the inverse log transformation, the log-odds of answering correctly can be transformed into odds. Thus, the odds of answering correctly for FIB when the retention interval is increased by one day and spacing is not held at zero is approximately 0.9589e0.0003s. It then follows that if the spacing interval is set to 120 days, the odds of answering correctly for a FIB question is

$$ {\displaystyle \begin{array}{l}0.9589{e}^{0.0003s}=0.9589{e}^{0.0003\ast 120}\\ {}=1.0270\end{array}} $$

Consequently, when the spacing interval is set to 120 days and the retention interval is set to one day, the odds of answering correctly for FIB is 1.0270. Table 6 presents both the log-odds and odds of answering correctly for each level of question format when the spacing interval is not held at 0 (e.g., spacing interval = 120 days, as in the example above).

Table 6 Log-odds and odds of correctly responding when the retention interval was increased by one day and spacing was not held at 0

Discussion

In the present study, we modeled a large, real-world dataset to investigate the impact of spaced retrieval on knowledge retention beyond the lab. The data were collected via a LMS that delivered workplace training materials to employees in a naturally occurring, heterogeneous, spaced retrieval manner. Upon each occurrence of a given piece of information, employees were tested on their knowledge of the information. We organized these data to align with the simplest research design that can be used to investigate the impact of spaced retrieval on learning. Our analyses were conducted on individuals’ performance on the first three occurrences of a question. The first occurrence corresponded to an initial learning session (material was delivered to the employee for the first time using the LMS), and the second occurrence corresponded to a relearning session (the same material was revisited by the employee). Thus, the interval between these occurrences was operationally defined as the spacing interval. The third occurrence of a question corresponded to the final test. Accordingly, the interval between the second and third occurrences was operationally defined as the retention interval. In addition to significant main effects for all the variables included in the model (spacing interval, retention interval, questions format), the results of the present study also revealed a significant interaction between spacing interval and retention interval, as well as between retention interval and question format. These findings are in line with the results of laboratory studies, demonstrating the relevance and transferability of laboratory-based research to real-world contexts.

The interaction we found between spacing interval and retention interval was of prime interest and consistent with past laboratory studies demonstrating that the optimal amount of spacing between an initial learning event and a relearning session varies depending on the length of the retention interval (Cepeda et al., 2009; Cepeda et al., 2006; Cepeda, Vul, Rohrer, Wixted, & Pashler, 2008). More specifically, the probability of retaining information in memory for a longer period of time (e.g., a month or longer) is higher if the spacing interval is also long (e.g., 11 days or longer) (Cepeda et al., 2009; Glenberg & Lehmann, 1980; Küpper-Tetzel & Erdfelder, 2012; Küpper-Tetzel, Erdfelder, & Dickhäuser, 2013). In contrast, shorter spacing intervals (e.g., one day) have been found to be more beneficial for shorter retention intervals (e.g., one week). Thus, the optimal amount of spacing between an initial learning event and a relearning event increases as the retention interval increases. In fact, an influential review aimed at informing best practices in teaching and learning in educational settings concluded that the interval between two study occasions should be approximately 10% to 20% of the retention interval (Pashler et al., 2007). This coincides with the finding that although spacing out repeated study sessions generally benefits knowledge retention, excessive spacing results in decreased retention.

In workplace environments, corporations typically aim to onboard new employees as quickly as possible, and in general, aim for their employees to rapidly acquire the information they need to be successful in their positions, thereby benefiting the company. In this context, the interval between a given training event and the actual use of that knowledge on the job may be quite short (e.g., one week). Thus, on the basis of the results of the present study and the abovementioned laboratory research, a shorter spacing interval (e.g., one day) between repeated training events would be more beneficial than a longer spacing interval (e.g., one week). Additionally, however, corporations benefit and aim to support their employees in retaining information over much longer stretches of time even though employees may not make use of this information on a daily basis (e.g., emergency procedures). In these cases, the results of the present study and the abovementioned laboratory research suggest that longer spacing intervals between repeated training events would be most beneficial. Generally, optimizing spacing intervals between repeated training events requires consideration of how long the information is intended to be retained.

Interestingly, retention interval also seems to impact the optimal spacing schedule or the optimal amount of spacing between repetitions when there is more than one relearning event following the initial learning event and preceding the final test. An equal-interval schedule, for example, involves equally spaced out study episodes, whereas an expanding schedule consists of learning episodes that are spaced apart by incrementally increasing intervals. Mixed findings have been reported in the literature regarding whether an equal-interval schedule or expanding schedule is more beneficial for retention. For example, some studies do not show a difference between the two types of spacing schedules (e.g., Balota et al., 2006; Carpenter & DeLosh, 2005; Karpicke & Bauernschmidt, 2011; Kim et al., 2018), whereas other studies demonstrate a larger benefit from expanding over equal-interval spacing schedules in specific contexts (Gerbier & Koenig, 2012; Karpicke & Roediger, 2007; Nakata, 2015). Interestingly, Karpicke and Roediger (2007) found an interaction between retention interval and the benefit of different spacing schedules on memory performance in young adults—following a short (10-min) retention interval, the expanding schedule resulted in a larger benefit than did an equal-interval schedule. However, following a longer (two-day) retention interval, the equal-interval schedule resulted in a larger advantage than did the expanding schedule. Logan and Balota (2008) and Tsai (1927) have reported consistent findings. However, other studies that used long, multiday study sessions and retention intervals have produced mixed findings. Moreover, Storm, Bjork, and Bjork (2012) and Balota et al. (2006) both found that an expanding retrieval practice schedule was most effective when the to-be-remembered materials were subject to rapid forgetting. Future research should investigate this further and would benefit from investigations using real-world data to assess ecological validity.

The results of our model also demonstrated an interaction between question format and retention interval, which may reflect differences in difficulty and retrieval cues available across the different types of question format ((Tulving & Osler, 1968; Tulving & Pearlstone, 1966)). For example, MC and MA questions included the correct answer within the questions, which serves as a strong retrieval cue, whereas FIB questions did not and required employees to retrieve the corresponding information from memory. The corresponding main effects of retention interval and question format on the odds of employees answering correctly on final tests was not surprising. Our model revealed a negative coefficient for retention interval, which is in line with the vast literature on forgetting curves that demonstrates that the probability of retaining information decreases exponentially from the time of the original learning event (Ebbinghaus, 1885/1964). In terms of question format, differences in the level of retrieval difficulty and retrieval cues may account for the observed differences that the various types of question format had on employees’ odds of answering correctly on the final test.

As we described above, our analyses focused on items that were first answered incorrectly on the first retrieval attempt, because we could not be certain if those items that were answered correctly were already known prior to the learning event in question. For this reason, we considered these data (items that were first answered correctly) to be potentially confounded. Although it is beyond the focus of the present study, an intriguing question is whether the odds of providing a correct response at final recall varies as a function of a participants’ retrieval success on the first and second occurrence of an item, across the four potential outcomes: correct first retrieval–correct second retrieval; correct first retrieval–incorrect second retrieval; incorrect first retrieval–incorrect second retrieval; incorrect first retrieval–correct second retrieval.

In a relevant study by Kapler et al. (2015), participants engaged in spaced retrieval, with two instances of retrieval practice for each study item before the final test. The final test data were analyzed in two ways: (1) considering only items to which participants provided a correct response in the first and second retrieval events; and (2) considering all items regardless of whether participants provided a correct response in the first retrieval event. Importantly, participants could only progress in the study after providing a correct response on the second retrieval event, such that all instances of a second retrieval event for an item were coded as a correct response. The analyses conducted on this latter dataset showed that retrieval accuracy on the first retrieval event was approximately 55%. Both sets of data demonstrated a spacing effect, with final test performance being moderately higher and the effect size moderately larger when all data were included (regardless of participants’ accuracy on the first retrieval event; factual-level condition: p = .02, η2 = .026; higher-level condition: p = .003, η2 = .018) than when the data were filtered to only include items that were correct on the first retrieval event (factual-level p = .01, η2 = .016; higher-level p = .06, η2 = .009). An important difference between the study by Kapler and colleagues and the present study is that participants in the former study did not know the material beforehand, whereas participants in the present study may very well have already been familiar with the content. Future research should investigate further the impact of consecutive retrieval successes on a final memory test.

Although working with a large dataset collected in a real-world context provides an opportunity to test the ecological validity of laboratory findings, some limitations are also intrinsic to the nature of these data. For example, in contrast to the controlled laboratory environment, the data analyzed in the present study were collected in a variety of uncontrolled environments. As mentioned above, the LMS allows the training, and thus data collection, to occur using a variety of electronic devices, including laptops and smartphones. Consequently, the training could have taken place anywhere and at any time. Moreover, we cannot be sure that employees answered the question on their own as opposed to working with a colleague, whether they understood the training task and how seriously they took it. Although one cannot be sure of the internal motivations and attitudes of participants in laboratory studies, participants in these studies are observed and in the company of an experimenter while performing the experimental task. In contrast, however, employees who were engaged in workplace training were not necessarily in the company of anyone, let alone an authority figure. Thus, employees may have been less motivated when it came to focusing on and completing the training task to the best of their ability. Moreover, there was also a lack of control and/or measure of how much employees used the learned information between training sessions.

It is also important to note that learners received corrective feedback after answering questions incorrectly. Thus, we cannot be certain whether the act of spaced retrieval or the spacing of the feedback, or an interaction between the two, led to the present results. Additionally, each of the client companies trained their employees on different information, using different questions and question formats. Although we accounted for company and the specific components of content on which employees were trained (e.g., question) by setting them as random factors in our model, in addition to including question format as a variable, company and question still served as sources of variance. Another limitation is the lack of demographic information we have about the employees whose data we analyzed, particularly age. Without such information, we were unable to account for the impact of age on employees’ learning trajectories, particularly with respect to spacing and retention intervals. Past work has shown that age is an important factor that affects learning and memory (e.g., Burke & Barnes, 2006; Old & Naveh-Benjamin, 2008), and had this information been made available, we could have added greater precision to the model and possibly gained a greater understanding of age interactions. Despite the abovementioned sources of variability, our model revealed significant effects of the variables tested on employees learning behavior, including an interaction between spacing interval and retention interval as supported by the results of laboratory research. The present investigation was guided by the extant literature on the spacing effect, which largely consists of small- to moderate-size studies (for reviews see Cepeda et al., 2006; Gerbier & Toppino, 2015). Thus, in contrast to the more exploratory nature of most studies using big data, the present investigation was hypothesis-driven based on past empirical work. Future research should continue to employ large real-world data to investigate the ecological validity of memory-related phenomena that are traditionally studied in the laboratory.

Author note

We are grateful to Axonify Inc. for their continued research collaborations. This research was supported by grants from Natural Sciences and Engineering Research Council of Canada to R.S.R. and A.S.N.K., as well as a fellowship from Natural Sciences and Engineering Research Council of Canada to A.S.N.K. The authors declare no conflicts of interest.