Individuals encode an overwhelming amount of information every day, but not all is remembered and only some is strengthened during sleep (Ellenbogen et al., 2006). A question remains as to what information is subject to sleep-dependent consolidation processes. Current research largely focuses on consolidation of information that is remembered for a subsequent test. When participants study paired associates for a test, recall performance is better after a period of sleep than waking (Ellenbogen et al., 2009; Fenn & Hambrick, 2012; Marshall et al., 2006; Tucker et al., 2006).

It is not clear if the expectation of a memory test influences sleep-dependent consolidation. One study found that sleep benefits information only if a test was expected (Wilhelm et al., 2011). After training on paired associates, participants who were told that there would be a test showed less forgetting after sleep than wake, but for participants who did not expect a test, performance was similar after sleep and wake. This finding suggests that sleep selectively consolidates memory based on factors such as intentionality (Born & Wilhelm, 2012; Rasch & Born, 2013), but was not replicated in later work (Ashton & Cairney, 2021). Due to these mixed results, it remains unclear if sleep affects memory for information that is not being remembered for a subsequent test.

Most studies on sleep-dependent consolidation focus on declarative memory for information that has been intentionally studied. Less is known about the extent to which sleep affects memory for information that is simply encountered. Although sparse, some studies have investigated memory for information that was incidentally encoded, meaning that individuals worked with information in some way but did not purposefully study it. The general finding is that sleep strengthens veridical memory after deep processing. Sleep strengthened memory for negative and neutral images when participants made arousal judgments (Baran et al., 2012), and strengthened recall after deep encoding but did not affect recognition after shallow encoding (Jurewicz et al., 2016). Importantly, there has not been a study that directly compared the effect of sleep after deep versus shallow encoding, using the same test. Thus, it remains unclear if sleep affects information similarly based on depth of processing and what factors influence consolidation after incidental encoding.

Furthermore, although most consolidation research focuses on veridical representations, evidence from a diverse set of paradigms shows that sleep can abstract beyond studied information. In the number reduction task (NRT), seven steps are required to obtain the correct answer, but the task can be successfully solved in two steps; the answers to the second and seventh steps are identical. Participants are more likely to gain insight into this hidden rule after sleep than wake (Verleger et al., 2013; Wagner et al., 2004). Sleep also promotes resolution of remote associate test (RAT) problems. Participants are more likely to solve problems that they could not initially solve, if tested after sleep than wake (Cai et al., 2009; Sio et al., 2013). Thus, sleep can abstract conceptual representations that may not have been explicitly recognized during encoding.

Similarly, sleep can increase abstraction in the Deese–Roediger–McDermott (DRM) paradigm (Deese, 1959; Roediger III & McDermott, 1995), wherein participants study lists of semantically related words (door, glass, pane) that converge on a common theme, the critical lure (window). This critical lure is often falsely retrieved at test, potentially due to the use of gist-based representations to retrieve information (Brainerd & Reyna, 2001). When a recall test is used, participants show higher false recall of critical lures after sleep than wake (Chatburn et al., 2014; Diekelmann et al., 2010; Newbury & Monaghan, 2019; Payne et al., 2009). When a recognition test is used, sleep reduces false recognition of critical lures (Fenn et al., 2009; Lo et al., 2014), although a recent meta-analysis has cast doubt on the strength of these recognition effects (Newbury & Monaghan, 2019). There is no reason to assume these disparate results are due to differences in encoding or consolidation processes because the methods between these studies are similar until the test. Instead, differences in performance may suggest that different processes govern retrieval at test. During recall, participants may rely more on gist-based than veridical memory, especially at later positions in response output (Colbert & McBride, 2007; Roediger III & McDermott, 1995). In the present study, we were not interested in retrieval processes; instead, we focused on encoding, an underexplored area in the field.

In two experiments, we investigated consolidation of veridical and gist-based memory after incidental encoding and assessed the extent to which depth of processing affected sleep-dependent consolidation. Participants rated DRM words in a deep or shallow encoding task and completed a surprise recognition test after a 12-hour interval containing wake or sleep. In Experiment 1, participants rated words in order of descending associative strength with the critical lures; in Experiment 2, participants rated the same words in random order across the experiment. In Experiment 1, we predicted sleep would strengthen gist-based memory for list words and critical lures after deep, but not shallow, encoding. In Experiment 2, we predicted sleep would strengthen veridical memory for list words after deep, but not shallow, encoding and would not consolidate gist.

Experiment 1

Method

Participants

We recruited native English speaking undergraduate students from Michigan State University (N = 193) with no history of memory or sleep disorders. Although there are a few studies on sleep and memory for incidentally encoded information in the literature, we began data collection prior to their publication. Therefore, we used our prior work on sleep and memory for DRM stimuli (Fenn et al., 2009, Experiment 3) as a guide, and we aimed to recruit approximately the same number of participants (i.e., 30) in each of our Sleep and Wake groups as in the prior work. Thus, we aimed to recruit 120 participants between all experimental groups (Sleep, Deep; Sleep, Shallow; Wake, Deep; Wake Shallow), with 30 participants in each cell. We aimed to recruit an additional 120 participants for our control group (Morning and Evening). Our goal was to collect all of our data in one year; therefore, we also instituted a stopping rule such that data collection would end at the end of one academic year.

In total, we only collected data from 193 participants in the time we allotted for this experiment and stopped data collection at the end of one academic year. This resulted in 47 fewer participants than we originally planned. Some participants were excluded from analyses due to attrition (N = 13), incomplete data (i.e., experimental error or not performing the encoding task correctly, N = 5), or for napping during the waking retention interval (N = 2). The final sample contained 173 participants (101 female; Mage = 20.22 years, SDage = 3.0). Participants provided informed consent and were compensated with course credit. This study was approved by the Michigan State University Institutional Review Board.

Design

Participants were assigned to either an experimental group (Wake, Sleep) or a time of day control group (Morning, Evening). Within each group, participants were randomly assigned to one of two encoding tasks (Deep, Shallow).

All participants completed an encoding phase and a surprise recognition test. For the experimental group, half of the participants completed encoding in the morning (09:00–10:00) and the test in the evening (21:00–22:00), following a period of wakefulness (Wake). The other half completed encoding in the evening (21:00–22:00) and the test in the morning (09:00–10:00), after a night of sleep (Sleep). In the control group, participants completed encoding and test in a single session; half did so in the morning (09:00) and the other half, in the evening (21:00).

Materials

We chose 20 DRM lists based on backward associative strength (BAS), which is a measure of the association of each list word to their respective critical lure (Roediger III et al., 2001). These 20 lists were then divided into two sets, each containing 10 lists (Table S1). Sets were normed for BAS, false recognition rates, and connectivity (see the Supplemental Online Materials [SOM] for more information). List set was counterbalanced within each condition and performance across sets was similar, so we collapsed across the two versions in our analyses (for more information, see the SOM).

The test contained 80 items. All participants completed the same test, but the items varied based on which set of lists participants studied; items that were targets for some participants were lures for others. Of the 80 items, 30 were list words (from list positions 1, 8, and 10) and 10 were critical lures from the studied lists. The other 40 items were unrelated lures: 10 critical lures and 30 list words from the unstudied lists.

Procedure

Prior to the start of the experiment, we told participants that the experiment was investigating how time of day affects subjective word ratings and that they would be doing the same task in both sessions. This cover story was designed to ensure that encoding was incidental and that participants did not try to memorize the words.

During the encoding phase, participants first rated subjective sleepiness using the Stanford Sleepiness Scale (Hoddes et al., 1973; for details and data, see SOM) and then rated individual words in either a deep or shallow encoding task. In the deep encoding task, participants assessed how abstract each word was on a 7-point Likert scale ranging from concrete to abstract. We defined concrete as something that physically exists (e.g., piano) and abstract as a quality that does not physically exist in nature (e.g., thought). In the shallow encoding task, participants simply indicated the number of vowels in each word on a 7-point scale. Thus, in the deep encoding task, participants encoded semantic features of each word whereas in the shallow encoding task, participants simply encoded visual features. List words were presented individually on the computer screen, in order of descending associativity with the critical lure. No time limit was imposed; participants had as much time as they needed to make a decision. Each experimental block contained the 15 words from a single DRM list (Roediger III et al., 2001). After each block, participants had the opportunity to take a five second break. After this, experimental participants left the laboratory and control participants listened to a 2-minute audio clip to reduce rehearsal.

During the test phase, either immediately after the audio clip (control group) or after a 12-hour retention interval (experimental group), participants were informed that they would take a surprise recognition memory test. Experimental participants first completed the Stanford Sleepiness Scale and then the test. On the test, participants provided old/new judgments on each word, presented in a random order. They were instructed to indicate that a word was “old” if they had rated it during the encoding phase and to indicate that a word was “new” if they had not rated it. After each “old” response, participants were asked to rate their confidence on a 7-point Likert scale ranging from “Not confident at all” to “Extremely confident.” Participants were instructed to take as much time as needed to maximize accuracy. After completing the test, participants completed the Morningness-Eveningness Questionnaire, an 18-item survey that measures chronotype or preferred time of day (Horne & Östberg, 1976; see SOM for more details) and a demographic questionnaire.

Results

To assess memory discrimination, we used hits and false alarms to compute d-prime (d′); higher d′ represents better discriminability (Macmillan & Creelman, 2004). For list words, hits were the proportion of times participants responded “old” to list words from studied lists and false alarms were the proportion of times participants responded “old” to list words from unstudied lists. Because we were interested in gist-based representations, we computed d′ for critical lures by using the proportion of “old” responses to critical lures as hits (even though these are actually false alarms) and using the proportion “old” responses to critical lures from unstudied lists as false alarms. This is the appropriate comparison because critical lures tend to be more frequent and familiar in the English language and have higher false-alarm rates than list words (Roediger III & McDermott, 1995). In cases where the hit rates were 1 or false-alarm rates were 0, we replaced these values with 1-(1/2n) and 1/2n, respectively (Macmillan & Creelman, 2004; Macmillan & Kaplan, 1985).

We first performed a mixed-design ANOVA on d′ with item type (List Words, Critical Lures) as a within-subjects factor and condition (Wake, Sleep) and encoding (Deep, Shallow) as between-subjects factors. Descriptive statistics are displayed in Table 1, and results are displayed in Fig. 1. As expected, there was a main effect of encoding, F(1, 89) = 90.50, p < .001, ηp2 = 0.50; deep encoding led to higher d′ than shallow, and a main effect of item type, F(1, 89) = 89.36, p < .001, ηp2 = 0.50; d′ was higher for list words than critical lures. Importantly, there was a main effect of condition, F(1, 89) = 19.41, p < .001, ηp2 = 0.18. Participants showed higher d′ after sleep than waking. There was not an interaction between condition and encoding, F(1, 89) = 0.92, p = .34. There was an interaction between encoding and item type, F(1, 89) = 13.43, p < .001, ηp2 = 0.13; but this interaction was not relevant to our research question. Finally, there were no interactions between condition and item type or condition, item type, and encoding, Fs < 1. Results regarding differences in response bias and confidence did not inform the main analyses and can be found in the Supplemental Online Materials (SOM).

Table 1 Experiment 1 means (and standard deviations) for the proportion of “Old” responses to the various types of test stimuli and d′ scores for list words and critical lures across experimental and control groups
Fig. 1
figure 1

Sensitivity (d′) for the Wake and Sleep conditions after Deep and Shallow encoding for list words and critical lures in Experiment 1. Note. Error bars represent the standard error of the mean

To ensure our results were not affected by diurnal or circadian effects, we conducted a multivariate repeated-measures ANOVA on d′ for control participants with item type (List Words, Critical Lures) as a within-subjects factor, and encoding (Deep, Shallow) and time (Morning, Evening) as between-subjects factors (Table 1). Importantly, there was not a main effect of time, F(1, 76) = 0.03, p = .87, and there were not any interactions between time and encoding or item type, Fs < 1 (Fig. S4). Finally, there was no evidence that sleepiness or chronotype affected our primary results (see SOM).

Thus, we found evidence of sleep-dependent consolidation after both deep and shallow encoding for list words and critical lures. This suggests that sleep may have consolidated the overall theme of the list, or gist, which can account for both increased correct memory and increased false memory of critical lures (Brainerd & Reyna, 1990; Cann et al., 2011). Thus, it remains unclear if sleep consolidated veridical memory. We explore this possibility in Experiment 2.

Experiment 2

To better elucidate the effect of sleep on veridical memory following incidental encoding, we used the same words and encoding tasks as in Experiment 1; however, to reduce gist-based processing, participants rated words in random order across the encoding phase (Mather et al., 1997). We predicted that after deep encoding, the Sleep group would show higher sensitivity to list words than Wake. We were unsure if sleep would affect memory in the shallow encoding task. We also expected critical lure sensitivity would not differ between Wake and Sleep after either encoding task.

Method

Participants

We conducted an a priori power analysis in G*Power to estimate sample size for the experimental groups (i.e., participants in the Sleep and Wake conditions). The power analysis was conducted for an ANOVA with fixed effects and interactions to find a moderate effect (f = 0.35) with similar power to that in Experiment 1 (1 – ß = .945). This analysis revealed a necessary sample size of 141 total participants across the two delay conditions and two encoding tasks. To balance groups, we aimed to have a sample of 144 participants in the experimental group and another 144 participants in the control group, for a total of 288 participants. To account for attrition and data loss due to napping, we recruited 393 undergraduates from Michigan State University who did not participate in Experiment 1. All participants were native English speakers with no history of memory or sleep disorders. Participants did not have strong time of day preferences (scores on the Morningness-Eveningness Questionnaire between 42 and 58; Horne & Östberg, 1976). Additionally, all participants had generally healthy sleep quality (score between 0 and 12 on the sleep disturbances scale of the Pittsburgh Sleep Quality Index [PSQI]; Buysse et al., 1989). Several participants were excluded from all analyses due to attrition (N = 24), napping during the waking retention interval (N = 22), or data loss caused by a program error (N = 13) or experimenter error (N = 35). The final sample included 299 participants (200 females; Mage = 19.32, SDage = 1.21).

Procedure

This experiment was nearly identical to Experiment 1 with two exceptions. Critically, in Experiment 1, we presented words grouped by list and in order of descending associativity with the critical lure. In this experiment, we presented the same words, but the words were not grouped in lists. Instead, they were presented in random order across the encoding phase. The basic structure of the experiment was the same as in Experiment 1; participants rated 15 words in each of ten blocks, with a five-second break between blocks. At the beginning of each session, participants completed both the Stanford Sleepiness Scale and the Positive and Negative Affect Schedule, which assesses mood (Thompson, 2007; see SOM for more information and data). The recognition memory test was the same test used in Experiment 1, except participants rated their confidence for both “old” and “new” judgments instead of only “old” judgments.

Results

We used a mixed-design ANOVA on d′ with item type (List Words, Critical Lures) as a within-subjects factor and condition (Wake, Sleep) and encoding (Deep, Shallow) as between-subjects factors (Fig. 2). There were main effects of encoding, F(1, 140) = 57.06, p < .001, ηp2 = 0.29, and item type, F(1, 140) = 68.01, p < .001, ηp2 = 0.33; d′ was higher after deep than shallow encoding and higher for list words than critical lures (Table 2). There was not a significant main effect of condition, F(1, 140) = 2.24, p = .14, but there was a three-way interaction between condition, item type, and encoding, F(1, 140) = 4.33, p = .04, ηp2 = 0.03. To follow up this interaction, we ran a pair of mixed-design ANOVAs on list words and critical lures with condition and encoding as factors. For list words, there was a main effect of encoding, F(1, 140) = 55.16, p < .001, ηp2 = 0.28; d′ was higher after deep than shallow encoding. There was not a main effect of condition, F(1, 140) = 1.24, p = .27. The interaction between encoding and condition was significant, F(1, 140) = 6.42, p = .01, ηp2 = 0.04; d′ was higher after sleep after deep encoding, t(71) = 2.21, p = .03, d = 0.52, but not shallow, t(69) = 1.28, p = .20. For critical lures, we conducted another mixed-design ANOVA, which showed a main effect of encoding, F(1, 140) = 18.94, p < .001, ηp2 = 0.12. There was not an effect of condition, F(1, 140) = 1.43, p = .23, or an interaction between condition and encoding, F(1, 140) = 0.04, p = .84. Although we designed this experiment to minimize gist, critical lure sensitivity was greater than zero in the Sleep group after both deep, t(37) = 6.49, p < .001, d = 1.05, and shallow encoding, t(35) = 2.47, p = .02, d = 0.42, and in the Wake group, after deep encoding, t(36) = 5.06, p < .001, d = 0.86, but not after shallow, t(35) = 1.38, p = .18.

Fig. 2
figure 2

Sensitivity (d′) to list words and critical lures across condition and encoding task when DRM list words were randomly presented at encoding in Experiment 2. Note. Errors bars represent the standard error of the mean

Table 2 Experiment 2 means (and standard deviations) for the proportion of “Old” responses to the various types of test stimuli across all experimental and control groups and encoding tasks

There was an interaction between encoding and item type, F(1, 140) = 4.34, p = .04, ηp2 = 0.03, but the interactions between condition and item type, F(1, 140) = 0.02, p = .90, and condition and encoding, F(1, 140) = 2.11, p = .15, were not significant. Results regarding sleepiness, mood, confidence, and response bias did not inform the primary analyses (see SOM).

Again, we conducted a repeated-measures ANOVA on d′ on the control data with item type (List Words, Critical Lures) as a within-subjects factor and time (Morning, Evening) and encoding (Deep, Shallow) as between-subject factors. Importantly, there was not a main effect of time, F(1, 151) = 0.75, p = .39 (Fig. S8), or an interaction between encoding and time, F(1, 151) = 0.60, p = 0.44, or time, encoding, and item type, F(1, 151) = 3.03, p = .08 (SOM).

General discussion

This work provides the first evidence that sleep consolidates gist, as well as veridical memory, following incidental encoding. In Experiment 1, list words were presented in order of associative strength, and gist-based memory was stronger after sleep than wake following both deep and shallow encoding; sensitivity was higher to list words and critical lures. Although the effect in list words could reflect consolidation of veridical memory, the effect in critical lures suggests gist consolidation. In Experiment 2, list words were presented in a random order across blocks, and sleep strengthened veridical memory for list words after deep but not shallow encoding, and there was no evidence of gist consolidation after sleep for either encoding task.

Thus, across two experiments, we found that sleep increased gist-based memory when list words were presented in order of associative strength, and veridical memory when list words were presented randomly. The veridical memory effect is consistent with prior work showing veridical consolidation after deep but not shallow encoding (Jurewicz et al., 2016), potentially due to memory strength. It is possible that there is a strength threshold that needs to be reached for memories to be consolidated; sleep may consolidate memories above this threshold but not weaker memories. Deep encoding produces stronger memory than shallow (Craik & Lockhart, 1972; Goldman & Pellegrino, 1977) and may produce memories that reach the threshold for consolidation.

The results in gist memory are a novel finding and are a bit more surprising. First, the increased false memory of critical lures after sleep in Experiment 1, suggestive of gist consolidation, more strongly resembles DRM studies that use recall than recognition. When DRM lists are intentionally remembered, sleep consolidates gist-based memory of critical lures, in free recall (Chatburn et al., 2014; Diekelmann et al., 2010; Payne et al., 2009), but when recognition is tested, studies show either lower critical lure false recognition after sleep (Fenn et al., 2009; Lo et al., 2014), or no effect of sleep (Newbury & Monaghan, 2019). One possible explanation for these disparate results is that participants may rely more on gist-based representations to generate target words during recall, whereas during recognition, they may be more likely to use veridical representations to monitor memory to accept and reject words. We propose that incidental encoding may also affect strategies used at test and encourage individuals to rely more on gist-based representations. During incidental encoding, individuals have no reason to suspect that the information will be important in the future. As such, they may not encode source information or item-specific information. Prior research has shown that participants are more likely to use gist-based representations at test if they do not have access to source information, or if veridical memory is weak (Dodhia & Metcalfe, 1999; Johnson et al., 1993; Lindsay & Johnson, 2000). Thus, conditions at encoding, such as intentionality, may affect retrieval processes and strategies.

The gist results are also surprising because we did not find increased gist memory after sleep in Experiment 2, even though the same encoding tasks were used and critical lure sensitivity (measured by d′) was reliably greater than zero. Thus, despite random presentation, some gist memory was formed that could have been consolidated by sleep, likely because participants rated some words from the same DRM lists in the same block. A question, therefore, remains as to why gist memory was consolidated in Experiment 1 and not Experiment 2. It is possible that the strength of the gist memory in Experiment 2 was simply not sufficient for consolidation, similar to the lack of an effect of sleep on veridical memory after shallow encoding. It is also possible that the results of Experiment 1 do not actually reflect gist consolidation. Critical lure sensitivity in Experiment 1 may be explained by familiarity at retrieval since memory for list words was better after sleep than wake in both encoding groups. Given that participants were tested on three list words and one critical lure, participants may have responded based on familiarity to other test items or based on their responses to similar test items (Coane & McBride, 2006; Dewhurst et al., 2011). An alternative, albeit not exclusive, explanation is that spreading activation (Anderson, 1983; Collins & Loftus, 1975) and internal generation of the critical lure at encoding (Marsh & Bower, 2004) may have contributed to critical lure consolidation in Experiment 1. The order of words in each block may have caused the critical lure to be activated at encoding and subsequently consolidated during sleep. In Experiment 2, it is less likely that the critical lure would be internally-generated. Further research is necessary to distinguish between these alternatives.

The above arguments are all suggestive of an active consolidation process, such as memory replay during sleep. However, it is also possible that our results reflect a passive account. Specifically, individuals experience very little interference during sleep compared to waking; memories encoded prior to a period of sleep are less likely to be disrupted by interference than memories encoded prior to waking (Yonelinas et al., 2019). This account would predict better memory after sleep, than waking, albeit by a different mechanism. If passive protection from interference was the only mechanism underlying these results, we would have expected stronger performance after sleep for all types of items- both list words and critical lures, after both encoding tasks- which we did not find. The present experiment was not designed to distinguish between active and passive accounts, but we believe that our results more strongly suggest an active process.

An important consideration when interpreting our results with respect to the literature is our measure of signal detection. We analyzed our data using d′; however, some (Lo et al., 2014; Pardilla-Delgado & Payne, 2017; Shaw & Monaghan, 2017), but not all (Diekelmann et al., 2008; Huan et al., 2022; Jano et al., 2021), studies on sleep and false memory use A′, an alternate measure of signal detection. We chose to use d′ because it is the most widely used and well-understood. Importantly, we replicated our primary results in both experiments using A′ (see SOM for further detail).

In conclusion, we found that sleep consolidated information that is processed, but not actively remembered. Our results suggest that what individuals remember after sleep likely depends on processes at encoding, such as the presence of shared context and strength of veridical memory prior to sleep. Thus, the present study provides new insight into the nature of sleep-dependent consolidation processes and suggests that memory about everyday life may be strengthened during sleep. Although memory advantages of sleep-dependent consolidation processes may not be ubiquitous, they act on more information than was previously demonstrated.