Retrieval practice enhances the accessibility but not the quality of memory
Numerous studies have demonstrated that retrieval from long-term memory (LTM) can enhance subsequent memory performance, a phenomenon labeled the retrieval practice effect. However, the almost exclusive reliance on categorical stimuli in this literature leaves open a basic question about the nature of this improvement in memory performance. It has not yet been determined whether retrieval practice improves the probability of successful memory retrieval or the quality of the retrieved representation. To answer this question, we conducted three experiments using a mixture modeling approach (Zhang & Luck, 2008) that provides a measure of both the probability of recall and the quality of the recalled memories. Subjects attempted to memorize the color of 400 unique shapes. After every 10 images were presented, subjects either recalled the last 10 colors (the retrieval practice condition) by clicking on a color wheel with each shape as a retrieval cue or they participated in a control condition that involved no further presentations (Experiment 1) or restudy of the 10 shape/color associations (Experiments 2 and 3). Performance in a subsequent delayed recall test revealed a robust retrieval practice effect. Subjects recalled a significantly higher proportion of items that they had previously retrieved relative to items that were untested or that they had restudied. Interestingly, retrieval practice did not elicit any improvement in the precision of the retrieved memories. The same empirical pattern also was observed following delays of greater than 24 hours. Thus, retrieval practice increases the probability of successful memory retrieval but does not improve memory quality.
KeywordsCued recall Memory Mnemonic precision Testing effect
Numerous studies have demonstrated that retrieval from long-term memory (LTM) can enhance subsequent memory performance, a phenomenon labeled the retrieval practice effect (Carrier & Pashler, 1992). The benefits of retrieval practice have been observed with a wide variety of memoranda (Roediger & Karpicke, 2006), including word pairs (Pyc & Rawson, 2009), pictures (Wheeler & Roediger, 1992), and spatial positions (Carpenter & Pashler 2007; Rohrer, Taylor, and Sholar, 2010; Carpenter & Kelly, 2012).
Varying explanations have been offered for how retrieval practice enhances memory performance. Some have focused on increased elaborative retrieval during testing (Carpenter, 2009), whereas others have emphasized the narrowing of the retrieval search space via helpful contextual associations (Lehman, Smith, and Karpicke, 2014). One common assumption of these accounts is that retrieval practice enhances the probability of access to a memory rather than the quality of the memory. This focus on accessibility over fidelity may be attributable in part to the fact that past studies have typically used discrete word or picture stimuli (and all-or-none measures of accuracy) that do not allow clear measurements of memory fidelity. That said, some past findings may be consistent with a putative effect of retrieval practice on memory quality. For example, Chan and McDermott (2007) found that retrieval practice improved participants’ ability to avoid semantically similar lures during a recognition test and improved source memory. Likewise, Szpunar, McDermott, and Roediger (2008) found that testing improves list discrimination. However, while each of these findings could reflect a more precise memory (e.g., of specific semantic content, or of the temporal context associated with an item), the binary nature of the responses in these studies also allows for an interpretation based on retrieval probability.
An approach that may provide more traction for understanding the effect of retrieval practice on the quality of item-specific memory is to allow participants to report remembered information along a continuous response space. For example, Carpenter and Kelly (2012) used a continuous response space in a task where subjects recalled the precise positions of different objects. Retrieval practice resulted in a decrease in the average response error for retrieved locations relative to restudied locations. However, although a change in memory quality provides an intuitive explanation of these findings, a reduced guessing rate in the retrieval practice condition also would yield lower average response errors. Thus, the goal of the present work was to examine the retrieval practice effect using an analytic approach that can estimate both the probability of retrieval and the quality of the retrieved representations.
We measured performance in a shape/color recall task in which the possible colors were drawn from a continuous 360-degree space, and we used a mixture-modeling approach (Zhang & Luck, 2008) that provided separate measures of the probability of recall and the quality of the retrieved memories. This analytic approach has been widely applied to the field of working memory (see Luck & Vogel, 2013 for review), and has recently been applied to the study of LTM (Brady et al., 2013). To anticipate our conclusions, retrieval practice elicited robust improvements in the probability of memory access, but absolutely no improvement in the fidelity of the retrieved memories.
Experiment 1: Test versus no test
Twenty-two undergraduates at the University of Oregon completed the experiment for course credit. All participants gave informed consent according to procedures approved by the University of Oregon institutional review board.
Stimuli were generated in MATLAB using Psychophysics Toolbox extension (Brainard, 1997; Pelli, 1997) and were presented on a 17-in. flat CRT computer screen (60-HZ refresh rate). The viewing distance was ~80 cm. Stimuli were 9.2° × 9.2° of visual angle.
Four hundred nameable pictures (e.g., animals, plants, shapes, countries, U.S. states, and symbols) were obtained via a web search for royalty free clip art. One of 360 continuous colors was assigned to each image, with different color/shape sets for each subject.
Task and procedure
After viewing all 200 images with retrieval practice for half of the items in the run (~20-30 minutes), subjects were asked to recall the color of each image by clicking on a color wheel that represented all of the presented colors. Images were tested in a random order relative to their initial presentation. Participants received feedback consisting of the presentation of the shape filled with the correct color and a number denoting the magnitude of the error.
During recall, a white shape cue was displayed for 1 second before the cursor and color wheel appeared (Fig. 1B). During response selection, the color of the shape cue shifted continuously to match the hue that was indicated by the mouse cursor on the color wheel. Participants indicated their color choice by clicking the mouse. Responses were unspeeded and accuracy was given highest priority; subjects were instructed to choose a response even if they felt they were guessing. When they thought they were guessing, they were instructed to click with the right mouse button rather than the left. The color wheel was randomly rotated across trials (so that position information was irrelevant to the color response). Following completion of the first run of 200 images, the remaining 200 images were presented and tested using the same procedure (i.e., a learning period and delayed-retrieval period) with 200 new images. One image was presented twice during the learning period of run one and was dropped from the delayed analyses.
These parameters are calculated using the distribution of all responses, which is a mixture of responses not guided by memory (guesses) and responses guided by memory. Thus, we can determine the proportion of remembered items and the precision of responses guided by memory, but it is not possible to determine if any individual response was guided by memory.
All participants’ responses were combined into an aggregate error histogram (Fig. 2A) and fit using the “memfit” function of Memtoolbox (Suchow et al. 2013) to obtain parameter estimates and 95% credibility intervals (CrI); there is a 95% chance that the true value of the parameter for the sample lies between the credibility intervals. We will refer to parameters with overlapping credibility intervals as “not significantly different” and parameters with nonoverlapping credibility intervals as “significantly different.” Unlike confidence intervals, Bayesian credibility intervals are not necessarily symmetrical.
The mixture modeling analysis revealed that 70.7% (CrI: −1.7%, +2.0%) of the items were recalled during the initial test. SD—our operational definition of mnemonic precision—was 21.4° (CrI: −0.8°, +1.1°). At delayed test, subjects recalled significantly more items that they had previously retrieved (53.8%, CrI: −1.9%, +2.3%) than items that that were previously untested (37.9%, CrI: −2.2%, +2.8%; Fig. 2). Mnemonic precision was not significantly different between tested (22.9°, CrI: −1.0°, +1.5°) and untested (24.2°, CrI: −1.6°, +2.6°) items.
Individual parameter comparisons (Delayed Test)
Analysis of the subset of subjects who successfully retrieved 40% or more items in both conditions (n = 12) also showed higher Pmem for retrieved items (M = 69.2%, SD = 13.4%) compared with untested items (M = 53.2%, SD = 9.9%, t(11) = −6.03; p < 0.001). Also in line with the aggregate data, subjects did not exhibit superior mnemonic precision for items that they had previously retrieved (M = 24.0, SD = 5.3) compared with items that were not retrieved (M = 23.7, SD = 5.7, t(11) = −0.22 p = 0.83).
Experiment 1 suggests that retrieval practice increases the probability that an item can be retrieved in the future but does not improve the precision of that memory. In Experiment 2, we equated the number of times that participants saw and responded to each item by comparing the retrieval practice condition with a restudy condition (Carrier & Pashler, 1992).
Twenty-eight students from the University of Oregon participated in Experiment 2 for course credit or monetary compensation. Six participants were excluded: two did not complete the session, one was excluded during the session for not following instructions, and three participants who completed the session were excluded for responding randomly on restudy trials. Twenty-two participants were included in the analysis of Experiment 2. Six subjects who did not complete all trials in the time allotted were included in the experiment, because they had completed the session and followed instructions.
Similar to Experiment 1, we relied on an aggregate fit to assess the mnemonic precision for all subjects and then looked at individual fits for subjects who retrieved at least 40% of the items (Pmem > 40%). Additional simulations with fewer trials revealed that this also was an appropriate cutoff for subjects who did not complete all trials (Figure S1).
Seventy-four percent (CrI: −2.0%, +1.5%) of the items were recalled during the initial test, and as expected, participants correctly selected responses for more than 99% (CrI: −0.4%, +0.3%) of the items during the restudy task when the stimuli were physically present to guide responses. Not surprisingly, precision was substantially higher for the restudy (SD = 7.2°, CrI: −0.2°, +0.2°) than for the memory task (18.6°, CrI: −0.8°, +0.6°).
Individual parameter comparisons (Delayed Test)
Analysis of the subset of subjects who remembered at least 40% of items (n = 17) in both conditions revealed that subjects recalled a significantly higher proportion of items they had previously retrieved (M = 69.1%, SD = 14.8%) relative to items that were previously restudied (M = 63.4%, SD = 14.4%, t(16) = −3.06 , p = 0.008). As in the aggregate data, subjects exhibited superior mnemonic precision for items that they had previously restudied (M = 22.5°, SD = 8.2°) relative to retrieved items (M = 24.3°, SD = 9.2°, t(16) = −2.60, p = 0.02).
As in Experiment 1, retrieval practice improved the probability of successful delayed recall but not mnemonic precision. Thus, the benefits of retrieval practice on probability of retrieval were robust when the control condition allowed extra time to restudy the memoranda. In Experiment 3, we tested whether a similar empirical pattern would emerge when we equated the amount of exposure time between retrieval and restudy and whether the same pattern would emerge following a >24-hour retention interval.
Twenty-three students from the University of Oregon participated in Experiment 3 for course credit. Two participants who did not complete all trials in the time allotted were included in the experiment. All participants gave informed consent according to procedures approved by the University of Oregon institutional review board.
The task in Experiment 3 was the same as Experiment 2 except for two differences. First, to equate total presentation time with that in the testing condition, restudied items were displayed for 1 s before subjects could respond (Fig. 4B). Second, to determine if the same pattern of results would emerge over a longer delay, the two runs of the task were completed on separate days, 1-4 days apart. This allowed for a >24-hr delayed retrieval of the items learned from the first run before subjects completed the second run of the experiment on day 2. Twenty subjects completed the surprise second retrieval period (3 subjects arrived late for the session and skipped the >24-hr retrieval to ensure a prompt finish).
Analyses were identical to Experiments 1 and 2. Only the aggregate analysis was applied to the >24-hr delayed test, because subjects were only tested on 100 items in each condition and probability of retrieval was low.
Sixty-three percent of the items (CrI: −1.8%, +2.3%) were recalled during the initial test, and as expected, participants correctly selected responses for 99.7% (CrI: −0.2%, +0.1%) of the items during the restudy task. Also as expected, precision was significantly higher for restudy (SD = 7.7°, CrI: −0.2°, +0.2°) than for retrieval (21.3°, CrI: −0.9°, +1.1°).
The pattern of results observed during the test after more than 24 hr was similar to the pattern of results for the first delayed test. Subjects recalled a significantly higher proportion of items that they had previously retrieved (34.4%, CrI: −3.2%, +5.0%) than items that they had restudied (26.1%, CrI: −3.1%, +3.5%; Fig. 6). Estimates of mnemonic precision were not significantly different for retrieved (25.7°, CrI: −3.2°, +5.5°) and restudied (20.4°, CrI: −3.0°, +3.5°) items.
Individual parameter comparisons (Delayed Test)
Analysis of the subset of subjects who successfully retrieved at least 40% items (n = 12) revealed that subjects recalled a significantly higher proportion of previously retrieved items (M = 66.8%, SD = 11.4%) relative to previously restudied items (M = 59.5%, SD = 10.4%, t(11) = −2.35; p = 0.039). In contrast to the findings from Experiment 2, subjects exhibited similar precision for items that they had previously restudied (M = 25.5°, SD = 7.3°) relative to previously retrieved items (M = 26.6°, SD = 5.0°, t(11) = 0.62; p = 0.54).
The findings from Experiment 3 are in line with the findings from Experiment 1 and Experiment 2. Retrieval practice improves the probability of successful delayed recall but does not improve mnemonic precision. Thus, the benefits of retrieval practice on recall probability were robust when the control condition allowed extra time to restudy the memoranda and when delayed recall did not take place for more than 24 hours.
In three experiments, we demonstrated that retrieval practice improves probability of retrieval but not mnemonic precision. Furthermore, in Experiments 2 and 3 subjects provided a response to restudied items by selecting the color they were viewing on the color wheel. Thus, we were able to replicate a critical finding of Carpenter and Kelly (2012) that testing effects are still observed when subjects are required to make a response to restudied items. This line of results supports the idea that the benefits of retrieval practice are due to the act of retrieving information from long-term memory and not simply to subjects making a response for tested material but not for restudied material.
Ruling out a verbal code
Extant models of the retrieval practice effect have asserted that testing enhances the accessibility of learned associations rather than the fidelity of the retrieved memories (Carpenter, 2009; Lehman, Smith, and Karpicke, 2014). The evidence for this assertion has been inconclusive, however, because of a heavy reliance on discrete word or picture stimuli that preclude a clear measure of item specific mnemonic precision. We measured performance in a test that required recall of colors from a continuous 360-degree space, and we used an analytic approach that enables distinct estimates of the probability of successful retrieval and the precision of the retrieved representations. The results were clear at both the aggregate and individual subject levels. Retrieval practice selectively enhances the probability of recall without improving mnemonic precision. Thus, even though both accessibility and fidelity can determine memory performance, the selective effect of retrieval practice on the former highlights the utility of distinguishing these aspects of memory function.
This work was funded in full by National Institutes of Health RO1-MH087214 to Edward Awh. The authors thank Anubhav Gupta and Dylan Sietz for help with data collection. Both authors conceived and designed the experiments and contributed to writing the manuscript. D.W.S. collected and analyzed the data.
- Lehman, M., Smith, M. A., & Karpicke, J. D. (2014). Toward an episodic context account of retrieval-based learning: Dissociating retrieval practice and elaboration. Journal of Experimental Psychology Learning, Memory, and Cognition, 40(4), 1–8.Google Scholar