Introduction

Learning often occurs under pristine conditions, such as when a geologist-in-training studies an illustrated guide to rocks, a birder listens to labeled recordings of bird calls, or a student reviews stars and constellations in a diagram of the night sky. Clear and complete (often decontextualized) examples are the norm in guidebooks and textbooks, as well as in many psychology experiments. Robust learning, however, must transfer to different, often less than ideal conditions. Oftentimes rocks are muddy, bird calls are intermingled with other sounds, and light pollution obscures the night sky. Here we examine whether strategies known to promote category learning also support transfer of that knowledge to the identification of new exemplars in impoverished contexts.

At the broadest level, transfer refers to learning that persists despite differences at test, versus encoding, in the items – the content of what is learned – or conditions – the context of learning (see Barnett & Ceci, 2002; Taatgen, 2013). Close transfer involves similar items or conditions at study and test (e.g., identifying a new photo of a rock after studying guidebook photos) whereas far transfer involves very different items or conditions between study and test. The category learning literature has typically defined transfer as people’s ability to classify new, non-studied exemplars (e.g., classifying new rock as granite), i.e., relatively close transfer.

A large literature has revealed many principles that promote this form of transfer. For example, people correctly classify more new exemplars following interleaved learning, where exemplars from different categories are intermixed, as opposed to blocked study, where exemplars from the same category are studied in sequence (for a recent review and meta-analysis, see Dunlosky et al., 2013; see also Brunmair & Richter, 2019; Kornell & Bjork, 2008; cf. Flesch et al., 2018; Goldstone, 1996; Tauber et al., 2013). Furthermore, transfer is improved following distributed practice – the spreading of study opportunities over a greater period of time – than temporally massed practice (for review, see Benjamin & Tullis, 2010; Cepeda et al., 2006; Dunlosky et al., 2013). Transfer also improves following study of exemplars labeled with items’ key diagnostic features, as compared to studying unlabeled exemplars (Miyatsu et al., 2019). Many of the principles that support transfer of categorical knowledge are the same as ones that support memory. For example, learning through attempted labeling (with feedback) is more effective than studying labeled exemplars (Levering & Kurtz, 2015), paralleling the large literature showing that retrieval practice is a more effective learning strategy than reading (Roediger & Butler, 2011).

Transfer of category learning in the real world, however, is rarely as forgiving as in a textbook – perceptual and cognitive demands increase in cases of far transfer, where test exemplars and contexts differ greatly from learning. For example, naturally encountered pieces of granite differ in shape and size, can be broken or partially obscured, and appear different depending on lighting or environmental factors (mud, dirt, etc.). These perceptual obfuscations of stimuli features can indiscriminately affect both discriminative and characteristic features of stimuli, altering between-category differences as well as within-category similarities (Carvalho & Goldstone, 2017). To date, a small, but growing, literature has examined how people learn to categorize when learning is less than ideal, examining the occlusion of study exemplars’ perceptual features (e.g., Hornsby & Love, 2014; Meagher et al., 2018; Taylor & Ross, 2009) or restricting the training range to typical cases (Hornsby & Love, 2014). But no studies have examined difficulties introduced at test, which may or may not have similar effects as those observed during learning – the memory, for example, is more affected by divided attention at study than at test (Craik et al., 1996).

Here we focus on the potential benefits of interleaved study, as opposed to blocked study (see Dunlosky et al., 2013). To examine transfer in impoverished contexts, we simulated a real-world impoverished context common in aviation and maritime operations: night vision (Gauthier et al., 2008; Johnson, 2004; Ruffner et al., 2001; Salazar et al., 2003). We adapted rock stimuli (Miyatsu et al., 2019) to simulate their appearance via night goggles, partially occluding two diagnostic features: color and, to a lesser extent, granularity (see Fig. 1). Thus, this night goggle simulation affected both discriminative features (ones that differentiate between categories) and characteristic features (those shared within a category). For example, color is discriminative when identifying rocky gypsum (almost always white) and obsidian (almost always black), as those colors are relatively unique in the set of rocks used. In contrast, sandstone rocks are similarly colored to each other, but their color palette is also similar to that of many other rocks – making color characteristic but not discriminative (see Carvalho & Goldstone, 2017; Nosofsky et al., 2017).

Fig. 1
figure 1

An example (a) of interleaved and blocked study sequence. The above example uses a reduced number of exemplars than actually used in Experiments 1 and 2 for visualization purposes. In (b) an example of the same rock exemplar in both a control and impoverished context. In (c) an example of the rocks when studied under the inclusion of feature descriptions condition

Compared to blocked study, interleaving often leads to better learning (Dunlosky et al., 2013). This is especially true for highly overlapping categories (Carvalho & Goldstone, 2014) and more-to-difficult-to-learn categories (Zulkiply & Burt, 2013). Sequential Attention Theory (SAT) posits that such effects occur because interleaving directs attention to features that differentiate between categories (discriminative features), whereas blocked learning highlights shared features within a category (characteristic features; Carvalho & Goldstone, 2015; Carvalho & Goldstone, 2017). While task specific circumstances and variability influence the effectiveness of each study strategy, one prediction of SAT, however, is that a blocked study schedule may be a more effective learning strategy when the nature of the final test is unknown (Carvalho & Goldstone, 2015). This prediction arises as SAT posits that blocked sequences lead to more localized representations of categories absent inter-category context. Conversely, an interleaved sequence contextualizes category representations, promoting interconnected, between-category representations through the learning of discriminative features. Simply put, the larger learning context exerts less influence on representations extracted during blocked learning than interleaved learning, predicting that leading blocked learning will yield representations that are more context flexible and more useful when test conditions are unknown during learning. Therefore, one prediction of prediction of SAT is that blocked learning will benefit far transfer to a testing environment that differs from the context during learning – here, where the perceptual obfuscations indiscriminately affect both discriminative and characteristic features (Carvalho & Goldstone, 2015).

Therefore, in two studies where learning occurred under ideal, unaltered conditions, we manipulated two strategies proposed to improve learning and transfer of rock classification: interleaving (vs. blocked study) and, in an exploratory manipulation, feature descriptions, which described and circled rocks’ key features (Miyatsu et al., 2019). Here we aimed to replicate two findings (pre-registered): that under standard test conditions – i.e., control contexts – both memory for studied items and identification of novel rocks would be greater following interleaved vs. blocked practice (see Dunlosky et al., 2013). Further, we pre-registered two novel questions: Under impoverished contexts, will (1) memory for studied instances and (2) transfer to new exemplars be greater following interleaved versus blocked practice? As an exploratory question (pre-registered), we manipulated whether features were labeled during learning in order to investigate whether any of the above questions are modulated by the use of feature descriptions during learning (see Miyatsu et al., 2019).

We investigated these questions in two experiments which differed only in the amount of time separating study and test. The test occurred almost immediately after study in Experiment 1, but was delayed 2 days in Experiment 2 as research on the relationship between study-test delay and the benefits of interleaving has produced mixed findings (for reviews, see Brunmair & Richter, 2019; Dunlosky et al., 2013).

Experiment 1

Method

Participants and design

Duke University’s internal review board approved both experiments. Both were preregistered on the Open Science Framework (https://osf.io/3fmxj/Footnote 1). We conducted an a priori power analysis using G*Power 3.1.9.2. (Faul et al., 2007) for 2 (between: interleaved vs. blocked) × 2 (between: feature descriptions vs. no feature descriptions) ANOVAs with power set at .9, α = .05, and Cohen’s f = .20, which suggested a sample size of 265. Our targeted sample size was 280 participants, given that we expected some level of attrition and non-compliance.

In total, Experiment 1 included 280 participants recruited from Prolific in February/March 2020 (www.prolific.co; Palan & Schitter, 2018). Participants were required to (1) currently reside in the USA or UK, (2) speak English as a first language, (3) have a 90% study approval rate, (4) have a minimum of 100 study submissions, and (5) complete the study using a desktop computer. Data for 31 participants were excluded from analysis due to failure to complete the full study (n = 5) or failure of one or both attention checks (n = 26). Thus, the final sample included 249 participants (M age = 38 years, SD = 13; 57% female; 87% White).

Materials

Materials included images of rocks from Miyatsu et al. (2019) from 12 different rock categories (i.e., amphibolite, breccia, conglomerate, gneiss, granite, obsidian, marble, pegmatite, pumice, rock gypsum, sandstone, and slate). Out of the 144 exemplars used in Miyatsu et al. (2019), we randomly selected 120 exemplars for the current research (ten exemplars per category). Across all participants, 72 of these exemplars were randomly selected to serve as study stimuli and the remaining 48 served as novel rocks on the final classification test.

At study, images were presented either with or without highlighted feature descriptions, depending on group assignment. Images with feature descriptions included the rock with key category features circled and notated, whereas images without feature descriptions did not (see Miyatsu et al., 2019, Exp. 2). At test, all images were presented without feature descriptions. Out of the 48 studied rocks, 24 were randomly selected to be tested in the original control context and another unique set of 24 were tested in the impoverished context. Similarly, out of the 48 novel rocks (the transfer items), we randomly selected 24 to be tested in the original neutral context and 24 to be tested in the impoverished context. Rocks tested in the neutral context were presented in their original condition in full color (as in Miyatsu et al., 2019). For impoverished context test stimuli, we adapted images from Miyatsu et al. (2019) to appear as if they were being viewed under a night vision filter. This was done using the following procedure. First, each image was converted to a monochrome scale. Then the mixing of RGB source channels was altered to be 0%, 100%, and 0%, respectively, for each channel. Images where then altered along HSL color lines to have the values 75, 100, and -25, respectively, for each color property in HSL space. Finally, for each image, the black value in CMYK color space was altered to be 100%. This effectively simulated each rock under a night vision context. The use of a “night vision” filter therefore does not simply darken stimuli, but mimics the complex and multidimensional way lighting and other environmental features might influence the color of real-world stimuli. For these adapted stimuli, please see https://osf.io/3fmxj/.

Procedure

After reserving a spot in the study on Prolific, participants were redirected to Gorilla (www.gorilla.sc), an online experiment builder that presented all tasks and instructions (Anwyl-Irvine et al., 2020). Participants then completed the consent form and were asked to minimize distractions prior to beginning the experiment. Next, participants were told that they would be asked to learn 12 types of rocks and that the experiment would have two major phases (i.e., a learning phase and a testing phase).

During the learning phase, participants studied 6 different exemplars of each rock category; they saw each example twice during the study phase, equating to 144 study trials in total. On each study phase trial, participants passively viewed a rock image presented with its category name for 6 s (either with or without feature descriptions, depending on group assignment). Images were presented in either a blocked or interleaved study schedule, depending on group assignment. For each group, we created a fixed, randomized order for study presentation (see Fig. 1). In the blocked group, participants studied all six exemplars for a given category back-to-back in the same order twice prior to moving onto the next category. In the interleaved group, participants studied one exemplar per category in six blocks of 12. Once all six blocks were presented, participants received the six blocks for a second round of study. At least three exemplars from different categories were presented prior to receiving a new instance from the same category.

In both groups, two pictures of unrelated objects (i.e., a stack of books and a fork) served as attention check trials. These trials were presented at the ends of blocks 4 and 11 of study (after trials 48 and 132; see Meade & Craig, 2012, for recommendations to use one attention check item per every 50–100 trials in survey research). On each attention check trial, the image and its name were also presented for 6 s. Participants were then immediately prompted to type in the name of the object on the next screen prior to continuing the study. Participants were unaware that these attention check trials would occur. After completing the learning phase, participants played Tetris for 3 min as a distractor task.

During the testing phase, participants took classification tests that required them to classify each of a series of rocks by selecting from a list of 12 possibilities (presented alphabetically). Participants indicated their choice by clicking on it with the mouse cursor. Participants were asked to avoid outside sources and to simply try their best. Participants were not instructed that some rocks would be new and some would be those they previously studied. The test was self-paced.

The standard and impoverished context tests were counterbalanced evenly within each group. For both tests, novel items were always tested prior to studied items, but each half of the tests was randomized anew for each participant. For the impoverished context test, participants were informed that each rock would be presented under a night vision filter during classification. After completing the final tests, participants filled out a demographics questionnaire and were awarded $6.75 for their participation.

Analysis

All primary analyses for both experiments were pre-registered at https://osf.io/3fmxj/. Our pre-registered analysis plan stated that we would conduct four 2 × 2 between-subjects ANOVAs (study schedule × feature descriptions) for each of the example classification item types (i.e., studied items in same context, studied items in different context, novel items in same context, novel items in different context). However, upon inspection of the data, we deviated from our pre-registered analysis plan in favor of a linear mixed-models approach.

Classification accuracy data from the testing phase were fit to five generalized mixed effects logit models using the lme4 packages in R (Bates et al., 2015). Each model had an identical crossed random-effects structure, with a random intercept for each subject, as well as each rock category type. The first, null model included only the random-effects structure.

Our hypotheses focused on the effects of four fixed factors on categorization performance at test: Study Schedule (Interleaved vs. Blocked study schedule), Study Status of test items (Studied vs. Novel exemplars), Context (standard vs. Impoverished contexts), and Feature Descriptions (Present vs. Absent at study). We structured a set of four hierarchical generalized mixed effects models to determine the validity of including each of these fixed effects. We iteratively added each fixed factor to successive models in order to test whether their inclusion improved model fit, and indicating whether that factor provided significant explanatory power in characterizing categorization performance at test. The fixed-effects structure of these models can be summarized as: Null: Random effects only, Model 1: Study Schedule, Model 2: Study Schedule ×Study Status, Model 3: Study Schedule ×Study Status × Context, Model 4: Study Schedule ×Study Status × Context + Feature Description. The fit of these mixed models was determined using the anova() command in R to calculate AIC scores and conduct a chi-squared test of each model against its hierarchically subordinate model (i.e., null vs. 1-factor model).

Results

Means and standard errors for all groups are presented in Table 1. These results are visualized in Fig. 2. After data were fit to each model, the model fit test indicated that Model 3, in which the Study Schedule, Study Status, and Context factors were included as main effects and interactions was the best fitting model (∆AIC = 276.28; p < .0001; Table 2). The inclusion of the Feature Description factor in Model 4 as a main effect did not significantly improve model fit (∆AIC = −1.79; p = .643; Table 2). As such, we concluded that the inclusion (or lack thereof) of Feature Descriptions during study did not impact subsequent memory or transfer performance, and no follow-up tests of interactions were considered. The results of Model 3 can be seen in Table 3.

Table 1 Means and standard errors for final test performance for Experiment 1 (3-min delay)
Fig. 2
figure 2

Categorization accuracy (%) in Experiment 1, as a function of the between-participants study schedule, context (control vs. impoverished), and whether items were previously studied or novel. Error bars are standard error, and data points are individual participants

Table 2 Results of the model comparison for hierarchical models of classification accuracy in Experiment 1
Table 3 Summary results of the Study Schedule × Study Status × Context model for Experiment 1

To explicitly test our four pre-registered hypotheses, we performed follow-up contrasts of Model 3. We found that interleaved versus blocked study schedules led to greater classification accuracy in control contexts for studied items, as well as novel transfer items (Memory: β = –0.60, p < .0001; Transfer: β = –0.41, p = .0002). This confirmed our first two hypotheses, and replicated previous findings (see Dunlosky et al., 2013). Our latter two hypotheses focused on the benefits of interleaving for transfer and memory in impoverished contexts. Here, follow-up contrasts of Model 3 demonstrated that interleaved versus blocked study schedules led to greater classification accuracy in impoverished contexts for studied, memory items, as well as novel, transfer items (Memory: β = –0.55, p < .0001; Transfer: β = –0.31, p = .0051). These results demonstrate for the first time that an interleaved study schedule benefitted later categorization in an impoverished context, for both studied and novel (transfer) rocks.

Experiment 2

Experiment 2 was designed to replicate findings from Experiment 1 with one major change – the time between study and test was extended from 3 min to 48 h. Some effects in cognitive science depend upon the length of the delay between learning and testing. For instance, the benefits of retrieval practice (see Congleton & Rajaram, 2012; Karpicke & Roediger, 2007; Roediger & Karpicke, 2006) are normally observed on delayed tests (e.g., 2 days), whereas the effect disappears or even reverses in favor of rereading on relatively immediate tests (e.g., 3 min). Existing research evaluating the extent to which interleaving benefits depend on test delay is mixed (for reviews, see Brunmair & Richter, 2019; Dunlosky et al., 2013). A recent meta-analysis by Brunmair and Richter (2019) suggests that interleaving benefits are not moderated by test delay or whether test items are studied or novel items. However, given that transfer across different contexts is vastly understudied, the extent to which this meta-analysis applies to the current research is unclear. Thus, we increased the delay between study and test to be approximately 48 h in Experiment 2 to evaluate the extent to which effects and/or effect sizes may change at longer delays.

Method

Participants and design

As in Experiment 1, our targeted sample size was 265 participants. However, we again oversampled and recruited 276 participants from Prolific in May/June 2020, given that we expected some level of attrition and non-compliance. Eligibility requirements were the same as Experiment 1. Data for 30 participants were excluded from analysis due to failure to complete the full study (n = 4) or failure of one or both attention checks (n = 32). Thus, the final sample included 240 participants (M age = 36 years, SD = 13; 69% female; 88% White).

Procedure

The procedure for Experiment 2 was exactly the same as Experiment 1, except for one change. After participants completed study, they did not play Tetris. Instead, they were told that they were done for the day and would receive a reminder email in approximately 48 h to complete the second part of the study (i.e., the testing phase). Upon completion of both sessions, participants were awarded $10.00 for their participation.

Analysis

As in Experiment 1, we deviated from our pre-registered analysis plan and fit the classification accuracy data during the testing phase to a set of generalized linear mixed logit models.

Results

Means and standard errors for all groups are presented in Table 4. These results are visualized in Fig. 3. After data were fit to each model, the model fit test indicated that Model 3, in which the Study Schedule, Study Status, and Context factors were included as main effects and interactions was the best fitting model (∆AIC = 199; p < .001; Table 5). The inclusion of the Feature Description factor in Model 4 as a main effect did not significantly improve model fit (∆AIC = -1; p = .639; Table 5). As such, we concluded that the inclusion (or lack thereof) of Feature Descriptions during study did not impact subsequent memory or transfer performance, and no follow-up tests of interactions were considered. The results of Model 3 can be seen in Table 6.

Table 4 Means and standard errors for final test performance for Experiment 2 (2-day delay)
Fig. 3
figure 3

Categorization accuracy (%) in Experiment 2, as a function of the between-participants study schedule, context (control vs. impoverished), and whether items were previously studied or novel. Error bars are standard error, and data points are individual participants

Table 5 Results of the model comparison for hierarchical models of classification accuracy in Experiment 2
Table 6 Summary results of the Study Schedule × Study Status × Context model for Experiment 2

We again explicitly tested our four hypotheses by conducting follow-up contrasts of Model 3. Interleaved study again led to better classification performance than did blocked study, regardless of whether the test items were studied ones or novel, transfer items (Memory: β = –0.62, p < .0001; Transfer: β = –0.55, p < .0001). Further, interleaved vs. blocked study schedules led to greater classification accuracy in impoverished contexts, for both studied items as well as novel, transfer items (Memory: β = –0.56, p < .0001; Transfer: β = –0.35, p = .0046). Importantly, these results replicate and extend our findings from Experiment 1, indicating that the benefits of interleaved study lead to more accurate classification of studied and novel rocks tested in impoverished contexts, up to ~48 h after study.

General discussion

Across two experiments, we demonstrated far transfer: participants successfully identified novel rocks in a simulated night vision environment that obscured rock color and granularity. Performance (Exp. 1: 50%; Exp. 2: 48%) was much higher than chance (8.33%). Critically, interleaving (vs. blocked study) led to better identification of novel, transfer items in this impoverished context. Interleaving also benefited the identification of studied rocks in the impoverished context (Figs. 2 and 3, Tables 1 and 4). As expected, we replicated previous work showing the benefits of interleaving in a typical test environment, where no features were obscured. All benefits of interleaving occurred both immediately and after a 2-day delay, congruent with a recent meta-analysis from Brunmair and Richter (2019) that suggested the benefits from an interleaved study schedule are not dependent on the length of test delay. Unrelated to our pre-registered hypotheses, we also saw a consistent significant interaction between old/new (study) status and context. Typically, people are much better at classifying studied items than novel transfer ones – a finding we observe under control contexts at test. But, this benefit was reduced under impoverished contexts at test too; performance on studied and novel items was similar (Figs. 2 and 3, Tables 3 and 6).

In contrast, feature descriptions (or lack thereof) did not affect later transfer (Tables 1, 2, 3, 4, 5 and 6), a finding inconsistent with Miyatsu et al. (2019, Experiments 2 and 3). There, labeling features during learning improved performance on a standard transfer test 2 days later (where required participants to identify new exemplars, but in the same unobstructed context as studied items). In our work, feature descriptions had almost no effect on performance regardless of final test context or item type (studied vs. novel). To be very clear, this is not to say the present work finds evidence against the benefits of feature descriptions during study. Feature highlighting is thought to benefit category learning through the biasing of attention toward category-relevant features or via the promotion of learning qualitative difference in category representations (Miyatsu et al., 2019). The inclusion here of other study strategies (i.e., interleaving vs. blocked study schedules) is not necessarily additive with the inclusion (or not) of feature descriptions. Therefore, the benefits of feature descriptions during study may be rendered ineffective when paired with other study strategies.

To our knowledge, this is the first demonstration that interleaving (vs. blocked) study schedules lead to better categorization of novel exemplars in a new, impoverished contexts. This finding adds to a growing literature focused on how changes in perceptually available features between learning and test impacts performance (Hornsby & Love, 2014; Meagher et al., 2018; Taylor & Ross, 2009). While previous literature has investigated the effects of manipulating available perceptual features during learning (see Meagher et al., 2018), understanding how changes in the testing environment affect transfer (i.e., where perceptual features of stimuli are occluded) remains understudied (see Hornsby & Love, 2014).

These results are inconsistent with the prediction of the SAT on the ideal study strategy when test conditions are unknown. The SAT suggests that blocked learning is more advantageous under transfer conditions at test that differ from study conditions. The SAT posits that different learning strategies draw attention to different features of to-be-learned exemplars, with consequences for later performance. Interleaving promotes the learning of differences between exemplars coming from different categories, while blocking promotes the learning of similarities between items of the same category (Dunlosky et al., 2013; Carvalho & Goldstone, 2017; Nosofsky, 2011). Given the benefits of interleaving observed in both of our experiments, the learning of discriminative features appears to matter more when the goal is transfer in a situation in which discriminative and characteristic features are both occluded (see Carvalho & Goldstone, 2017, Figs. 2 and 3). This result is inconsistent with this prediction of SAT; instead, the creation of robust, inter-related networks of categorization via learning discriminative features during interleaved study may be more resilient to broad contextual changes, whereas locally segregated networks of characteristic within-category features promoted by blocked study might be less adaptable (Zulkiply & Burt, 2013; see also Goldstone, 1996).

The current results could alternatively be explained via differences in the initial, baseline quality of learning. While we do not have a direct measure of initial learning (as subjects studied image-label pairs without making any responses), it is reasonable to assume learning was higher in the interleaved condition given past research. In both studies (Tables 3 and 6) interleaved study always led to better categorization performance, for both studied and novel items. However, the data observed in the impoverished condition do not perfectly mirror those in the control context (as might be expected if the amount of learning at the end of the learning phase was the main predictor of final test performance). Interleaving led to better performance in the impoverished condition, with the expected decline for new exemplars (as compared to studied rocks; Figs. 2 and 3). Importantly, this was not true for a blocked study schedule; performance between studied and novel items was quite similar, albeit low, in the impoverished context (Figs. 2 and 3). Perhaps a blocked study schedule can protect against a decline in categorization performance for novel items, in comparison to studied items; however, content is simply not learned as well overall as in an interleaved study context. Conversely, this difference could also be interpreted as a memory boost specifically for studied items learned under an interleaved schedule, as a result of the benefit of distributed practice on memory retention (for review, see Dunlosky et al., 2013). However, whether these effects are present under less passive study conditions remains to be studied (see Carvalho & Goldstone, 2015).

While previous literature has explored the far transfer of category learning for relational category structures (Patterson & Kurtz, 2020; see also Lowenstein, 2010), these studies often conceptualize far transfer as simply obscuring only characteristic features of stimuli, while leaving intact the diagnostic and relational features. Here, our conceptualization of the contextual far transfer of category learning seeks to account for the perceptual conditions of new, real-world contexts which often indiscriminately obscure both characteristic and diagnostic features of an item. A salient, but imperfect, analogy might be how the change in light during sunset can alter the visual color of the entire rockface of a mountain range, as opposed to changing a specific set of features – i.e., altering all colors of a granite rockface instead of only the red specks in the granite. For example, the rockfaces of the Sandia Mountain range outside Albuquerque, New Mexico change color in the evening light without regard to the construct of discriminative or characteristic features. Together, we propose that these results speak both to current debates within the literature regarding the usefulness of blocked versus interleaved learning for complex real-world stimuli (for review, see Hughes & Thomas, 2021; see also Flesch et al., 2018), as well as more generally on the usefulness of discriminative versus characteristic features for the far transfer of category learning in impoverished, real-world contexts (see also Murphy, 1982; Murphy & Ross, 2005).

Conclusion

The current research demonstrated the benefits of interleaving study schedules during learning on performance at test when the physical context differs from the learning context, both for studied and novel exemplars. Here, specifically, we replicated and extended the previous finding (pre-registered) that under control context conditions, memory and transfer for studied instances is greater following interleaved vs. blocked practice (see Dunlosky et al., 2013), showing that this benefit also extended to new, impoverished contexts. While we also manipulated the use of feature descriptions during study, in order to investigate potential performance benefits at test, we found no effect of feature descriptions (or lack thereof) on performance at any level. The current work highlights the importance of manipulations at learning for the promotion of transfer of category learning at test – showing that interleaving study schedules promotes better far transfer of category learning. Future work should continue to pursue investigations of the benefits of different learning strategies and manipulations to promote the far transfer of category learning.