Imagine you meet with friends and end up reminiscing about old times and events from your shared past. Does listening to somebody else recall specific details of a previously experienced episode affect your own memory of the event? Several studies on memory retrieval in a social context indicate that this is indeed the case, and that being exposed to another person’s recall can change the listener’s memory. Critically, the simple act of listening to somebody else retell the past may not just re-emphasize the recounted events in memory, but induce forgetting of related, but unmentioned details (Cuc et al., 2007). Such socially mediated forgetting effects can be relevant in conversations with friends, but potentially also in other applied scenarios in daily life, like when listening to news anchors, politicians, teachers, or other types of speakers (for a review, see Hirst & Echterhoff, 2012).

The focus of the present study is not on such applied aspects of socially mediated forgetting but on two more basic issues that have largely been ignored in prior work. The one issue is the question of which cognitive mechanisms underlie this form of forgetting. While there is considerable knowledge about the cognitive mechanisms contributing to this type of forgetting in individuals, not much is yet known about the mechanisms driving socially mediated forgetting. The other issue is the possible role of testing format, asking whether the forgetting is restricted to recall tests—which is what has been employed as testing format in prior work—or whether it generalizes to item recognition tests. Because many types of episodic forgetting arise primarily on recall tests and much less on tests of item recognition (e.g., Baddeley et al., 2015), it is important to examine whether socially mediated forgetting is present in item recognition as well. In fact, the role of testing format is also relevant for the question which cognitive mechanisms mediate the forgetting. Before addressing these issues in more detail, we will first introduce the experimental task and report some of the main findings on this form of forgetting.

The task typically used to examine socially mediated forgetting is borrowed from research on individual remembering and called the retrieval-practice task (Anderson et al., 1994). In this task, participants are often asked to study category-exemplar pairs with several exemplars from several different categories (e.g., flower-daffodil, flower-tulip, instrument-tuba, instrument-horn). During a subsequent practice phase, subjects receive category-plus-wordstem cues for some of the exemplars from some of the categories (e.g., flower-tu—?) and are asked to recall the appropriate items. Such selective retrieval creates three item types: practiced items (called Rp+ items; here flower-tulip), unpracticed items from the same categories (called Rp- items; here flower-daffodil), and unpracticed control items from categories that were not practiced at all (called Nrp items; here instrument-tuba, instrument-horn). On the final test, on which subjects are asked to recall all initially studied items, recall of the practiced Rp+ items is typically enhanced and recall of the unpracticed Rp- items typically impaired, relative to recall of the Nrp control items. The two effects are referred to as retrieval-induced enhancement and retrieval-induced forgetting (RIF) in the following (for recent reviews, see Bäuml & Kliegl, 2017; Storm & Levy, 2012).

RIF has mostly been studied in single individuals who engage in retrieval practice by themselves (which results in within individual RIF, or WI-RIF), but the task has recently been adapted to also capture social situations (Cuc et al., 2007; for a review, see Hirst & Coman, 2018). In the adapted version of the task, two subjects are tested together. Initial study and final test phase are still completed individually, but the intermediate retrieval-practice phase is completed together. Here, one subject is asked to act as speaker, i.e., to engage in overt retrieval practice and recall the to-be-retrieved items aloud. The other subject is asked to act as listener and monitor the speaker’s performance for accuracy. Several studies have applied this adapted version of the task and reported forgetting of unpracticed Rp- items relative to Nrp items not only in speakers, but also in listeners; not only for semantically categorized word lists, but also for coherent stories and even autobiographical memories (e.g., Barber & Mather, 2012; Cuc et al., 2007; Stone et al., 2010, 2013; see also Abel & Bäuml, 2015). The speaker’s forgetting is consistent with regular WI-RIF observed in the standard retrieval-practice task. The listener’s forgetting is termed socially shared RIF (SS-RIF) and suggests that the consequences of a speaker’s selective retrieval can be socially transmitted to listeners (Cuc et al., 2007).

In the literature on SS-RIF, it has generally been proposed that WI-RIF and SS-RIF may arise on the basis of the same cognitive mechanism(s). This view rests on the finding that SS-RIF emerges only when listeners are motivated to engage in concurrent (covert) retrieval along with speakers. Consistently, when listeners are not asked to monitor a speaker’s retrieval for accuracy, but for smoothness and fluidity of recall instead, no SS-RIF emerges (Cuc et al., 2007; for related findings, see Coman & Hirst, 2015; Koppel et al., 2014). In the literature on WI-RIF, there are different views on exactly which cognitive mechanisms may cause the forgetting (e.g., Anderson, 2003; Jonker et al., 2013; Raaijmakers & Jakab, 2012). While none of these accounts seems to be able to explain the whole range of RIF findings (see Bäuml & Kliegl, 2017), a two-factor account that assumes a role of inhibition and blocking may be able to explain most of the findings.

The two-factor account assumes that inhibition is recruited during retrieval practice to reduce the interference that arises due to coactivation of the not-to-be-retrieved (Rp-) items. This inhibition is supposed to impair the item representation of these items and reduce retrieval of the items over a wide range of memory tasks (e.g., Anderson, 2003). The account also assumes that blocking can operate at test. Retrieval practice strengthens the category-exemplar associations of the Rp+ items and this strengthening may lead to blocking of the (weaker) Rp- items (e.g., Raaijmakers & Jakab, 2012; Verde, 2013). Critically, the contribution of the two mechanisms is assumed to vary with testing format: Whereas inhibition is supposed to impair retrieval of Rp- items in more or less all testing formats, blocking is supposed to impair retrieval of Rp- items mainly in test formats in which no item-specific cues are provided, like in free recall or category-cued recall. In contrast, when item-specific cues are provided—as, for instance, occurs when items’ unique initial letters are provided as retrieval cues or in item recognition—the role of blocking should be reduced, if not eliminated, with inhibition mainly contributing to the forgetting. Results from several recent studies support this two-factor account (Anderson & Levy, 2007; Rupprecht & Bäuml, 2016, 2017; Schilling et al., 2014).

Following the two-factor account, it seems likely that blocking does not only contribute to WI-RIF in speakers, but also to SS-RIF in listeners. Prior studies examining RIF in a social context have consistently shown that retrieval practice strengthens Rp+ relative to Nrp items not only in speakers, but also in listeners. Even though some prior studies reported comparable, and others reported different degrees of retrieval-induced enhancement in speakers and listeners (e.g., Cuc et al., 2007; Koppel et al., 2014; Stone et al., 2010), the Rp+ items seem to benefit from strengthening in listeners as well, suggesting that blocking can play a role for SS-RIF in listeners. It is less clear, however, if inhibition contributes to SS-RIF in listeners as well.

There are at least two reasons for why inhibition might not be critically involved in SS-RIF in listeners. The one reason is that, in contrast to speakers, listeners during the practice phase are not only exposed to the provided retrieval cues (e.g., flower-tu—?), but they are additionally asked to monitor the speaker’s retrieval practice, e.g., for accuracy, and to provide ratings on corresponding Likert scales for every single item (see Cuc et al., 2007). Effectively, this may create a dual-task situation for listeners. Because prior work on WI-RIF has shown that a secondary task during retrieval practice can decrease cognitive capacities and result in a reduction in inhibition and, consequently, in RIF (Ortega et al., 2012; Román et al., 2009), arguably, inhibition might still contribute to WI-RIF in speakers, but not, or to a lesser degree, to SS-RIF in listeners. The other reason is that listeners are assumed to engage in concurrent retrieval along with speakers, but possibly such covert retrieval might boil down to a rather passive item recognition task. Instead of trying to retrieve the correct answer themselves, listeners may simply wait for the speaker’s response and then try to recognize if the speaker’s response corresponds to an “old” item or not. Because strength-based interference is reduced on recognition tests (e.g., Ratcliff et al., 1990; Rupprecht & Bauml, 2016) and because inhibitory processes are assumed to be recruited only when interference is detected (e.g., Anderson, 2003), the involvement of inhibition might be reduced, if not eliminated, in listeners that engage in a monitoring task like this.

If SS-RIF in listeners depends primarily on blocking, then the forgetting should vary with final test format. Following the two-factor account of RIF, forgetting due to blocking should arise mainly on final tests that do not include item-specific cues (like category-cued recall tests), and should be reduced, if not eliminated, on final tests that provide strong item-specific cues (like item recognition). Yet, if SS-RIF in listeners relies on blocking and inhibition, there should be not much difference between speakers and listeners in terms of RIF, and both speakers and listeners should show RIF regardless of final test format. Studies that compared WI-RIF and SS-RIF mostly relied on final category-cued recall tests in the absence of any item-specific retrieval cues (e.g., Barber & Mather, 2012; Cuc et al., 2007; Stone et al., 2013),Footnote 1 so that, so far, it is unclear if SS-RIF generalizes to, say, item recognition tests. Similarly, it is unclear if RIF in speakers and listeners are caused by the same cognitive mechanisms or if the critical contribution of inhibition is indeed reduced, or eliminated, in listeners.

The goal of the present study was to address the two issues. Four experiments were conducted: All experiments used the social version of the retrieval-practice task as introduced by Cuc et al., (2007), and pairs of subjects completed the task together, acting as speakers and listeners during the retrieval-practice phase. The four experiments differed in which type of final test was applied and whether item-specific cues were present or absent at test. In Experiment 1, a category-cued recall test without any item-specific retrieval cues was applied to replicate prior findings from the literature on SS-RIF. In Experiment 2, items’ initial letters were provided as additional retrieval cues, which offers item-specific information as well as the opportunity to control output order at test, i.e., to exclude the possibility that the prior recall of (stronger) Rp+ items at test induces RIF for (weaker) Rp- items (e.g., Smith, 1971; Tulving & Arbuckle, 1963). Finally, Experiments 3 and 4 applied item recognition tests. Experiment 3 used old/new recognition testing, whereas Experiment 4 applied a recognition test with confidence ratings that allowed construction of Receiver Operating Characteristic (ROC) curves. Both recognition tests comprise strong item-specific cues and allow for a control of output order at test.

The results of the four experiments will show for the first time if SS-RIF in listeners is limited to category-cued-recall tests or is present irrespective of final test format. If RIF is found to be present across all final test formats not only for speakers, but also for listeners, this would indicate that similar cognitive mechanisms cause RIF in speakers and listeners, and blocking and inhibition contribute to RIF in both participant groups. In contrast, if RIF is present across all final test formats only in speakers, but is reduced or even eliminated in listeners when strong item-specific cues are included, this would indicate that inhibition plays a reduced role in listeners, suggesting that WI-RIF and SS-RIF may rely on (partially) different cognitive mechanisms.

Experiment 1

The goal of Experiment 1 was to first replicate previous studies on WI-RIF and SS-RIF that mostly applied category-cued recall tests without any item-specific cues (e.g., Cuc et al., 2007; Stone et al., 2013). Subjects completed the experiment in pairs, studying and practicing category-exemplar word pairs (e.g., flower-daffodil, instrument-tuba). At test, category labels were provided as retrieval cues and subjects were asked to recall all previously studied items belonging to the category labels. On the basis of the prior work on regular WI-RIF on the one hand and SS-RIF on the other hand, we expected to observe RIF effects in both speakers and listeners.

Method

Participants

Sample sizes in Experiments 1 and 2 were determined on the basis of prior recall experiments on retrieval-induced forgetting from our lab (e.g., Rupprecht & Bäuml, 2016). 80 subjects participated in pairs. Mean age was 20.55 years (SD = 2.27); 17 subjects were male, 63 female.

Material

Item material was created by selecting six exemplars from each of eight semantic categories, taken from published category norms (Mannhaupt, 1983; Scheithe and Bäuml, 1995; Van Overschelde et al., 2004). Within categories, all items had unique initial letters. Across subjects, each category served equally often as to-be-practiced and as control category. Within categories, however, we kept constant which items were used as Rp+ and as Rp- items. Following Anderson et al., (1994), items with a lower frequency in the published category norms were used as Rp+ items, items with a higher frequency as Rp- items.

Design

The experiment had a 2 × 3 mixed-factorial design. The factor role during retrieval practice (speaker, listener) was manipulated between-subjects. During the retrieval-practice phase, one subject in each pair was asked to act as speaker and to practice some of the initially studied items by trying to recall them out loud. The other subject was asked to act as listener and to monitor the speaker’s overt practice for accuracy. The factor item type (Rp+, Rp-, Nrp) was manipulated within-subjects. During retrieval practice, only some items from some of the categories were presented for practice, thus creating the three different item types: practiced Rp+ items, related, but unpracticed Rp- items, and unrelated and unpracticed Nrp items.

Procedure. Study phase

Subjects in each pair were seated in front of the same computer. During study, items were presented one at a time, together with their semantic category, and for 5 s each centrally on a computer screen. Subjects were asked to try to memorize them for a later test. Sequence of items was controlled via blocked randomization. Item material was organized into six lists, each containing one item from each semantic category. During study, the sequence of lists as well as the sequence of items within lists were set to random, so that a maximum of two items from the same category could be presented back to back.

Retrieval-practice phase

During retrieval practice, subjects in both experiments were asked to practice half of the items from half of the categories, on two consecutive practice cycles. For each to-be-practiced item, the items’ initial letters were presented below the corresponding category name. Items were again presented via blocked randomization and were shown one at a time and for 8 sec each. When all to-be-practiced items had been presented once, the second retrieval-practice cycle began. Importantly, one of the two subjects was asked to act as speaker, and to practice some of the studied items out loud, by trying to complement the retrieval cues presented on the screen. The other subject was asked to act as listener. In particular, the listener’s task was to monitor the speaker’s practice, and to judge for each item how accurate the speaker’s response was (for details, see Cuc et al., 2007).

Test phase

Subjects completed the final test alone and worked at separate computers. The test was a category-cued recall test, i.e., subjects received category cues only for recall. Category names were presented one at a time, in random sequence, and for 48 sec each. Subjects received a small test booklet, with one page for recall of each of the eight semantic categories. On each page, subjects were asked to write down the category name and to then try to list as many of the studied category exemplars as they could remember.

Results

Success rates during retrieval-practice cycles

On average, speakers were able to correctly recall 90.22% (SD = 10.74) of Rp+ items on the first retrieval-practice cycle, and even improved to 92.34% (SD = 9.75) on the second retrieval-practice cycle, t(39) = 3.20, p = .003, d = 0.51.

Final test performance

Figure 1 shows mean recall performance on the final test, separately for speakers and listeners.

Fig. 1
figure 1

Mean recall in Experiment 1, plotted separately for speakers and listeners. Panel a contrasts Rp- and Nrp items, whereas panel b contrasts Rp+ and Nrp items. Error bars represent ± 1 standard errors

Retrieval-induced forgetting

A 2x2 ANOVA with the factors of item type (Rp-, Nrp) and role during practice (speaker, listener) showed a significant main effect of item type, F(1,78) = 14.66, MSE = 0.03, p < .001, \({\eta _{p}^{2}} = .16\), but no significant main effect of role during practice, F(1,78) < 1.0, p = .418, \({\eta _{p}^{2}} = .008\), and also no significant interaction between the two factors, F(1,78) < 1.0, p = .949, \({\eta _{p}^{2}} < .001\). Recall did not differ between speakers and listeners (51.70% vs. 54.64%), but was higher for Nrp than Rp- items (58.76% vs. 47.58%). Follow-up t tests confirmed that this retrieval-induced forgetting was significant for both speakers (57.38% vs. 46.01%), t(39) = 2.58, p = .014, d = 0.41, and listeners (60.14% vs. 49.14%), t(39) = 2.87, p = .007, d = 0.45.

Retrieval-induced enhancement

A 2x2 ANOVA with the factors of item type (Rp+, Nrp) and role during practice (speaker, listener) revealed a significant main effect of item type, F(1,78) = 129.78, MSE = 0.02, p < .001, \({\eta _{p}^{2}} = .63\), reflecting higher recall rates for Rp+ than Nrp items (83.22% vs. 58.76%). The main effect of role during practice was not significant, F(1,78) < 1.0, p = .714, \({\eta _{p}^{2}} = .002\), suggesting that recall did not differ between speakers and listeners (70.47% vs. 71.52%). The ANOVA also returned a non-significant interaction of the two factors, F(1,78) < 1.0, p = .426, \({\eta _{p}^{2}} = .008\), indicating that retrieval-induced enhancement did not differ between speakers and listeners. Follow-up t tests confirmed that retrieval-induced enhancement was indeed present in both speakers (83.56% vs. 57.38%), t(39) = 8.45, p < .001, d = 1.34, and listeners (82.87% vs. 60.14%), t(39) = 7.65, p < .001, d = 1.21.

Discussion

The results of Experiment 2 replicate prior work that reported WI-RIF and SS-RIF on a category-cued recall test (e.g., Coman & Hirst, 2015; Cuc et al., 2007; Stone et al., 2013). Additionally, both forgetting of Rp- items and enhancement of Rp+ items were comparable in size across speakers and listeners, which is per se consistent with the view that similar cognitive mechanisms may mediate WI-RIF and SS-RIF. Category-cued recall tests come with two problems, however. First, output order at test is not controlled on such tests, leaving the possibility that forgetting of the (weaker) Rp- items arises due to output interference, i.e., because the (stronger) Rp+ items are recalled first at test (e.g., Smith, 1971; Tulving and Arbuckle, 1963). Second, blocking can play a major role for RIF on such tests. As a consequence, results relying on such tests are largely silent regarding the degree to which inhibition may have contributed to the forgetting – and whether this contribution differed between speakers and listeners. Experiment 2 therefore revisited the issue, using a cued-recall test with item-specific cues.

Experiment 2

Experiment 2 provided the studied items’ category labels plus the items’ unique initial-letters as item-specific retrieval cues at test. Such item-specific cues allow to control the output order of the single items and thus prevent that the prior recall of the (stronger) Rp+ items at test induces RIF for the (weaker) Rp- items; Rp- items were therefore tested prior to Rp+ items. Moreover, such item-specific cues can decrease the relative contribution of blocking to RIF and increase the relative contribution of inhibition (e.g., Anderson et al., 1994; Bäuml, 1998). If WI-RIF in speakers and SS-RIF in listeners are caused by the same cognitive mechanisms and these mechanisms contribute to about the same degree to the forgetting in the two participant groups, then the results of Experiment 2 should replicate those of Experiment 1, with similar amounts of RIF in speakers and listeners. In contrast, if SS-RIF in listeners is primarily caused by blocking and inhibition contributes to WI-RIF only, then RIF may be reduced in listeners relative to speakers.

Method

Participants

As in Experiment 1, the sample comprised 80 subjects, tested in pairs. Mean age was 23.15 years (SD = 3.89); 29 subjects were male, 51 female.

Material

The same item material was applied as in Experiment 1.

Design

The experiment had a 2 × 4 mixed-factorial design. As in Experiment 1, the factor role during retrieval practice (speaker, listener) was manipulated between-subjects. The factor item type (Rp+, Rp-, Nrp+, Nrp-) was manipulated within-subjects. Selective retrieval in the intermediate practice phase again created practiced categories (containing Rp+ and Rp- items) as well as unpracticed control categories (containing Nrp items). In Experiment 2, due to the adapted final-test format, it was possible to further divide Nrp items into Nrp+ items (serving as control items for Rp+ items) and Nrp- items (serving as control items for Rp- items; see below for details).

Procedure

The procedure was identical to the one applied in Experiment 1, the only difference being the final-test format. In Experiment 2, the final test was a category-plus-initial-letter-cued recall test, i.e., subjects did not only receive category names as cues but also the items’ initial letters. All items from one semantic category were tested back to back, but sequence of items within categories was controlled. Each item’s initial letter was presented together with the name of the category, for 8 s each. For practiced categories, Rp- items were always tested first, in random order, to reduce potential output interference effects; subsequently, Rp+ items were tested in the same way. To ensure adequate baselines for (tested first) Rp- and (tested second) Rp+ items, Nrp categories were tested in the same manner. The first half of items tested in each Nrp category was used as baseline for Rp- items and labeled Nrp- items; the second half of tested items was used as baseline for Rp+ items and labeled Nrp+ items.

Results

Success rates during retrieval-practice cycles

Speakers’ practice performance was again excellent, with 88.95% (SD = 13.07) of Rp+ items being correctly recalled on the first retrieval-practice cycle, and 89.57% (SD = 14.00) on the second retrieval-practice cycle, t(39) < 1.0, p = .462, d = 0.12.

Final test performance

Figure 2 shows mean recall performance on the final test, separately for speakers and listeners.

Fig. 2
figure 2

Mean recall in Experiment 2, plotted separately for speakers and listeners. Panel a contrasts Rp- and Nrp- items, whereas panel b contrasts Rp+ and Nrp+ items. Error bars represent ± 1 standard errors

Retrieval-induced forgetting

A 2x2 ANOVA with the factors of item type (Rp-, Nrp) and role during practice (speaker, listener) detected a significant main effect of item type, F(1,78) = 21.40, MSE = 0.02, p < .001, \({\eta _{p}^{2}} = .22\), but no significant main effect of role during practice, F(1,78) < 1.0, p = .741, \({\eta _{p}^{2}} = .001\), and also no significant interaction between the two factors, F(1,78) < 1.0, p = .823, \({\eta _{p}^{2}} = .001\). Recall did not differ between speakers and listeners (72.16% vs. 71.03%), but was higher for Nrp- than Rp- items (76.48% vs. 66.70%). Follow-up t tests confirmed significant retrieval-induced forgetting for both speakers (77.27% vs. 67.02%), t(39) = 3.64, p = .001, d = 0.58, and listeners (60.14% vs. 49.14%), t(39) = 2.95, p = .005, d = 0.47.

Retrieval-induced enhancement

A 2x2 ANOVA with the factors of item type (Rp+, Nrp+) and role during practice (speaker, listener) revealed a significant main effect of item type, F(1,78) = 17.21, MSE = 0.03, p < .001, \({\eta _{p}^{2}} = .18\), reflecting higher recall rates for Rp+ than Nrp+ items (83.55% vs. 71.90%). The main effect of role during practice was not significant, F(1,78) < 1.0, p = .606, \({\eta _{p}^{2}} = .003\), suggesting that recall did not differ between speakers and listeners (77.00% vs. 78.46%). The interaction of item type and role during practice was also not significant, F(1,78) < 1.0, p = .944, \({\eta _{p}^{2}} < .001\), which indicates that the size of the enhancement effect was not affected by role during practice. Follow-up t tests confirmed significant retrieval-induced enhancement in both speakers (82.93% vs. 71.08%), t(39) = 3.03, p = .004, d = 0.48, and listeners (84.18% vs. 72.73%), t(39) = 2.84, p = .007, d = 0.45.

Discussion

The results of Experiment 1 show intact WI-RIF and SS-RIF with controlled output order at test, indicating that the forgetting of the Rp- items is not just a consequence of output interference arising from the fact that the (stronger) Rp+ items are recalled first. Moreover, the results provide a first demonstration that (i) SS-RIF in listeners can also arise when item-specific cues are presented at test, and (ii) SS-RIF can be equivalent in size to WI-RIF in speakers under such conditions. On the basis of the two-factor account of RIF, this demonstration is consistent with the view that SS-RIF and WI-RIF are caused by the same cognitive mechanisms and inhibition may not only contribute to RIF in speakers, but also in listeners. Still, although items’ unique initial letters served as item-specific retrieval cues in this experiment, these letters constitute relatively weak item-specific cues, which do not always reduce or eliminate blocking processes. In fact, whereas some RIF studies provided evidence that such cues can eliminate blocking (e.g., Anderson et al., 1994; Bäuml, 1998), other RIF studies reported contrasting results (e.g., Raaijmakers & Jakab, 2012; Rupprecht & Bäuml, 2016; Verde, 2013). It was therefore the goal of Experiments 3 and 4 to use stronger item-specific cues at test in order to enable stronger examination of the question if inhibition also contributes to SS-RIF in listeners.

Experiment 3

There is a good amount of evidence in the literature that strength-based interference effects - as they may occur in the retrieval-practice task but also in other experimental tasks, like the list-strength effect - can be present in recall but do not arise in item recognition (Ratcliff et al., 1990; Rupprecht and Bäuml, 2016; Shiffrin et al., 1990), a finding well captured by models of recognition memory (e.g., Dennis & Humphreys, 2001; Shiffrin & Steyvers, 1997). Experiment 3 therefore employed a yes/no recognition test to reduce blocking at test and thus to examine if inhibition does not only play a role for WI-RIF in speakers, but also for SS-RIF in listeners. The basic experiment was similar to Experiments 1 and 2, but at test, subjects were presented with all items from the study phase – intermixed with new items from the same semantic categories. The subjects’ task was to judge for each single item whether it was an old item from the study phase or a new item not studied earlier.

Based on prior work that showed RIF in individuals on such item recognition tests (e.g., Gómez-Ariza et al., 2005; Hicks & Starns, 2004; Spitzer & Bäuml, 2007) we expected WI-RIF in speakers in this experiment. For listeners, expectations depend on whether inhibition is assumed to play a critical role for SS-RIF. If this is the case, then WI-RIF and SS-RIF should both be present and be largely equivalent in size. In contrast, if inhibition plays a reduced role in listeners, then RIF should be present in speakers, but be reduced or eliminated in listeners.

Method

Participants

Sample sizes in Experiments 3 and 4 were determined on the basis of prior recognition experiments on RIF from our lab (e.g., Rupprecht & Bäuml, 2016, 2017; Spitzer & Bäuml, 2007, 2009). Ninety-six subjects participated in pairs. Mean age was 21.32 years (SD = 3.05); 16 subjects were male, 80 female.

Material

Item material again consisted of exemplars from semantic categories, but we used new semantic categories in Experiment 3. Moreover, the number of items per category was doubled and 12 exemplars were selected from each of 8 semantic categories. These materials were divided into two sets of items, with each set containing 6 exemplars per category. The items of one of the two sets were used to be studied and practiced in the experiment, but the items of the other set served as lures during the recognition test. Sets of items were counterbalanced across participants and were equally often used as study and lure items.

Design

The experiment had the same 2 × 4 mixed-factorial design as Experiment 2, with the between-participants factor of role during retrieval practice (speaker, listener) and the within-participants factor of item type (Rp+, Rp-, Nrp+, Nrp-).

Procedure

The procedures for study and retrieval-practice phases were identical to those applied in Experiments 1 and 2. After retrieval practice, however, subjects were asked to engage in unrelated distractor tasks for 8 min; this delay was added to prevent ceiling effects that might emerge on an immediate recognition test.

At test, all initially studied items were presented one at a time and intermixed with the same number of lure items. To control output interference, the first half of the test only comprised Rp- items and half of the Nrp items (called Nrp- items, because they were tested intermixed with Rp- items and corresponding “new” lure items). The second half of the test then comprised the remaining Rp+ items and the other half of the Nrp items (called Nrp+ items, because they were tested intermixed with Rp+ items and corresponding lures). Within each half of the test, we again used blocked randomization such that old (or new) items from the same semantic category could appear back to back at most two times in succession. For each single item, subjects were asked to judge whether it was old (i.e., studied in the experiment) or new (i.e., not previously encountered in the experiment); subjects entered their responses via the computer keyboard. The test was subject-paced, i.e., the next item appeared on the screen as soon as subjects entered a response. Mean test duration was 137.48 sec (SD = 30.43).

Results

Success rates during retrieval practice

Speakers’ performance was excellent on both practice cycles, with 89.24% (SD = 11.91) of Rp+ items being recalled on the first retrieval-practice cycle and 90.45% (SD = 12.98) on the second retrieval-practice cycle, t(47) = 1.55, p = .128, d = 0.22.

Final test performance

Table 1 shows hits and false alarms, separately for the different item types and for speakers and listeners. Initial analyses indicated that false alarm rates were not significantly different for Rp- vs. Nrp- items or for Rp+ vs. Nrp+ items; all ts(95) ≤ 1.95, all ps ≥ .054, all ds ≤ 0.20. For each single item type, false alarm rates were also comparable between speakers and listeners; all ts(94) ≤ 1.85, all ps ≥ .067, all ds ≤ 0.38. Figure 3 shows mean corrected hit rates (i.e., hits - false alarms) for all item types.

Table 1 Hits and false alarms in Experiment 3
Fig. 3
figure 3

Mean corrected hit rates (hits-false alarms) in Experiment 3, plotted separately for speakers and listeners. Panel a contrasts Rp- and Nrp- items, whereas panel b contrasts Rp+ and Nrp+ items. Error bars represent ± 1 standard errors

Retrieval-induced forgetting

A 2x2 ANOVA with the factors of item type (Rp-, Nrp-) and role during practice (speaker, listener) only revealed a significant main effect of item type, F(1,94) = 15.14, MSE = 0.02, p < .001, \({\eta _{p}^{2}} = .14\), which suggests that recognition was generally reduced for Rp- compared with Nrp- items (52.68% vs. 61.27%). There was no significant main effect of roleF(1,94) = 2.28, MSE = 0.06, p = .134, \({\eta _{p}^{2}} = .02\), and also no significant interaction, F(1,94) = 1.68, MSE = 0.02, p = .198, \({\eta _{p}^{2}} = .02\), suggesting that the difference between item types was not affected by role during practice. Follow-up tests, however, confirmed significantly reduced recognition of Rp- vs. Nrp- items only in speakers (53.85% vs. 65.31%), t(47) = 4.05, p < .001, d = 0.58. The numerical pattern in listeners indicated forgetting, too (51.50% vs. 57.23%), but statistically this difference did not reach significance, t(47) = 1.69, p = .098, d = 0.24.

Retrieval-induced enhancement

A 2x2 ANOVA with the factors of item type (Rp+, Nrp+) and role during practice (speaker, listener) showed a significant main effect of item type, F(1,94) = 176.61, MSE = 0.02, p < .001, \({\eta _{p}^{2}} = .65\), reflecting higher recognition accuracy for Rp+ than Nrp+ items (83.48% vs. 56.16%). The ANOVA als returned a significant main effect of role, F(1,94) = 4.88, MSE = 0.04, p = .030, \({\eta _{p}^{2}} = .05\), which was accompanied by a significant interaction with item type, F(1,94) = 7.63, MSE = 0.02, p = .007, \({\eta _{p}^{2}} = .08\), suggesting that the recognition difference between speakers and listeners depended on item type. Follow-up tests showed that recognition of Rp+ items did not differ between speakers and listeners (83.69% vs. 83.27%), t(94) < 1.00, p = .877, d = 0.03, but Nrp+ recognition was higher in speakers than in listeners (62.04% vs. 50.27%), t(94) = 2.90, p = .005, d = 0.59. Consistently, recognition was enhanced for Rp+ relative to Nrp+ items in speakers (83.69% vs. 62.04%), t(47) = 7.05, p < .001, d = 1.02, but even more so in listeners (83.27% vs. 50.27%), t(47) = 12.07, p < .001, d = 1.74.Footnote 2

Discussion

On the one hand, the ANOVA results replicate those of Experiments 1 and 2 by showing RIF that did not differ statistically between speakers and listeners. On the other hand, the results deviate somewhat from those of the two previous experiments, because the effect size of RIF reported for listeners in this experiment was only half the size of that for speakers - as is also reflected by the fact that significant RIF in single comparisons arose in speakers but not listeners. Sample sizes for Experiment 3 were determined on the basis of prior work from our lab, but a sensitivity analysis (with 48 subjects per condition, an alpha level of .05 and power of .80) indicates that the smallest detectable effect size for within-group comparisons was approximately d = .41. Because smaller-sized effects were therefore difficult to detect, the single comparisons in this experiment may have found RIF to be significant for speakers but not for listeners.

Retrieval-induced enhancement was stronger in listeners than speakers, which makes it particularly unlikely that the numerical difference in amount of RIF was related to differences in blocking processes. Rather, on the basis of the view that mainly inhibition contributes to RIF in item recognition, the pattern of numerically reduced RIF in listeners may suggest that the contribution of inhibition is somewhat reduced in listeners relative to speakers. Because this pattern is present numerically but not statistically in Experiment 3, the goal of Experiment 4 was to reexamine the issue, this time by means of an item recognition test that applies confidence ratings and enables ROC analyses. ROC analysis can overcome problems that may be present with yes/no recognition testing (e.g., Macmillan & Creelman, 2004; Wixted, 2007b; see also below) and thus may provide more accurate information regarding differences between WI-RIF in speakers and SS-RIF in listeners.

Experiment 4

Experiment 4 was largely identical to Experiment 3, the only difference being that the applied item recognition test was based on confidence ratings instead of old/new judgments. At test, old items from the study phase were again presented intermixed with new items from the same semantic categories, and subjects provided a rating for each item reflecting how confident they were that the item was old or new. On the basis of prior work on individual RIF that used the same type of memory test (e.g., Rupprecht & Bäuml, 2016, 2017; Spitzer & Bäuml, 2007, 2009), we again expected WI-RIF in speakers. Depending on the reading of the results of Experiment 3, one may expect a similar SS-RIF effect or a reduced SS-RIF effect in listeners.

Method

Participants

96 subjects participated in pairs. Mean age was 21.95 years (SD = 2.62); 38 subjects were male, 58 female.

Material

We used the same semantic categories as in Experiments 1 and 2, but as in Experiment 3, 12 exemplars were chosen from each of the 8 semantic categories. Materials were divided into two sets of items, with each set containing 6 exemplars per category. Sets of items were counterbalanced across participants and equally often used as study and lure items.

Design

The experiment again followed a 2 × 4 mixed-factorial design with the between-participants factor of role during retrieval practice (speaker, listener) and the within-participants factor of item type (Rp+, Rp-, Nrp+, Nrp-).

Procedure

The procedures for study, retrieval-practice, and distractor phases were identical to those applied in Experiment 3. At test, all initially studied items were presented one at a time and intermixed with the same number of lure items. In Experiment 4, a schematic rating scale was depicted below each item, and subjects were asked to rate for each single item how confident they were that the item had previously been studied (old) or not (new); the rating scale ranged from 1 to 6 (1=definitely old; 6=definitely new). Participants entered their answers via the computer keyboard. Before starting the test, subjects were instructed to try to make use of the whole range of the rating scale. The test was completely subject-paced, i.e., as soon as subjects entered a response, the next item was shown on the screen. Mean test duration was 209.26 sec (SD = 47.98).

We again controlled sequence of items at test. Unbeknownst to participants, the first half of the test only comprised Rp- items and half of the Nrp items (called Nrp- items), tested intermixed with corresponding lures. The second half of the test comprised the remaining Rp+ items and the other half of the Nrp items (called Nrp+ items), also mixed with lures. Within each half of the test, we again used blocked randomization such that items from the same semantic category could appear back to back at most two times in succession.

ROC curves and statistical analysis

Proportion of initially studied items correctly recognized as old (i.e., hits) and proportion of lure items incorrectly recognized as old (i.e., false alarms) were cumulated across the rating scale response options, beginning with the most confident response option (definitely old, “1”). This approach allows to plot ROC curves, relating hits and false alarms across variations in response criteria (i.e., the propensity to make a positive recognition response; e.g., Macmillan & Creelman, 2004; Parks & Yonelinas, 2008). With the present 6-point scale, hit and false alarm rates under five different response criteria arose. We applied the unequal-variance signal detection model to analyze the ROC data (e.g., Dunn, 2004; Wixted, 2007a), because the assumption of unequal variance for the distributions of old and new items can accommodate the typically asymmetrical shape of the ROC. According to this model, recognition responses are made on the basis of a single source of information, namely the items’ general memory strength, which may reflect the additive or nonadditive combination of familiarity and recollection codes (e.g., Kelley & Wixted, 2001; Wixted & Stretch, 2004). Using this model, discriminability of old relative to new items can be derived from the distance between the means of the underlying strength distributions of those old and new items (da). When applied to 5-point ROC data, the model results in seven free parameters (discriminability da, variance of the distribution of old items σ, and five criterion points c1-c5); when testing the model’s goodness of fit, three degrees of freedom are left. To estimate model parameters, we used maximum-likelihood methods which allow statistical testing (for further details on the general analysis approach, see Rupprecht & Bäuml, 2016, 2017).

Results

Success rates during retrieval practice

Speakers’ performance was similar on the two practice cycles, with 91.85% (SD = 10.95) of Rp+ items being recalled on the first retrieval-practice cycle and 92.80% (SD = 10.06) on the second retrieval-practice cycle, t(47) = 1.07, p = .291, d = 0.15.

Recognition test: overallrecognition performance

To see if the general numerical recognition pattern was consistent with that observed in Experiment 2, we determined overall hit and false alarm rates by collapsing subjects’ responses across the three most confident response levels of the rating scale (i.e., rating levels 1-3) and then compiled corrected hits (hits-false alarms) separately for the single item types. Corrected hits were reduced for Rp- relative to Nrp- items, and this reduction was larger for speakers (47.4% vs. 54.6%) than listeners (59.0% vs. 61.7%). Moreover, corrected hits were enhanced for Rp+ relative to Nrp+ items, for both speakers (78.0% vs. 56.1%) and listeners (82.4% vs. 63.9%). Yet, because an analysis of corrected hits assumes that the underlying ROC function is linear (e.g., Wixted, 2007b), and because ROC functions are typically curvilinear and asymmetric (see below), our main analysis relied on a signal detection approach, which can account for the curvilinear and asymmetric form of the ROC.

Recognition test: unequal-variance signal detection model

Figure 4 shows the ROCs for Rp- vs. Nrp- items and for Rp+ vs. Nrp+ items, separately for speakers and listeners. In the first step, we examined if the unequal-variance signal detection model described the data sufficiently for each item type and in both speakers and listeners. Table 2 shows the statistics of goodness-of-fit and maximum-likelihood estimates of the model’s parameters da and σ. The model provided a good fit to the data for all item types, χ2s(3) ≤ 1.21, ps ≥ .752. The variance of the old items’ distribution, as estimated by the parameter σ, was larger than 1.0 for both speakers and listeners, χ2s(1) ≥ 12.29, ps < .001, indicating that the model’s assumption of unequal variances for old and new items improved the description of the data significantly. For Rp- and Nrp- items, σ was constant across item types, χ2s(1) ≤ 0.23, ps ≥ .634. For Rp+ and Nrp+ items, it varied with item type for speakers, χ2(1) = 5.71, p = .017, though not listeners, χ2(1) = 2.89, p = .089. Placement of the five response criteria never varied across item types, χ2s(4) ≤ 2.69, ps ≥ .611.

Fig. 4
figure 4

ROC curves based on cumulative hit and false alarm rates in Experiment 4, plotted separately for speakers (panels a and c) and listeners (panels b and d). Panels a and b show ROC curves for Rp- and Nrp- items, whereas panels c and d show ROC curves for Rp+ and Nrp+ items. The lines between data points reflect theoretical ROCs predicted by the unequal-variance signal detection model

Table 2 Unequal-variance signal detection model for Experiment 4

In the second step, we evaluated retrieval-induced forgetting and enhancement and tested potential differences in discriminability da across item types, separately for speakers and listeners. For speakers, Rp- compared with Nrp- items showed reduced discriminability da (1.53 vs. 1.85), χ2(1) = 29.62, p < .001, reflecting RIF of Rp- items. For listeners, discriminability da also differed between Rp- and Nrp- items (1.87 vs. 2.06), χ2(1) = 4.40, p = .036. Therefore, SS-RIF was present, although the reduction in discriminability for Rp- items was smaller in listeners than speakers, χ2(1) = 6.11, p = .013. Rp+ items showed enhanced discriminability da relative to Nrp+ items for both speakers and listeners, χ2s(1) ≥ 35.60, ps ≤ .001. This retrieval-induced enhancement was larger in speakers (5.80 vs. 1.98) than listeners (3.06 vs. 2.37), χ2(1) = 8.00, p = .005.

Discussion

The results of the experiment show significant RIF for both speakers and listeners, but with a larger RIF effect for speakers than listeners, which is numerically consistent with the results of Experiment 3. On the basis of the view that RIF in item recognition reflects mainly inhibition, this finding suggests that inhibition contributes not only to WI-RIF but does also contribute to SS-RIF. This holds while the contribution to RIF seems to be reduced in listeners relative to speakers. Retrieval-induced enhancement of Rp+ items was larger for listeners in Experiment 3, but larger for speakers in Experiment 4. On its own, Experiment 4 could thus be seen as consistent with an alternative blocking explanation, which may suggest a larger RIF effect when the enhancement effect is increased. However, as emphasized above, blocking does not seem to contribute much to RIF in item recognition and there is also no evidence that the size of the enhancement effect is correlated with the size of the RIF effect in item recognition (see Rupprecht & Bäuml, 2016, 2017; see also Murayama et al., 2014). Blocking thus should not have mediated the results of this experiment.

General discussion

The present results concerning WI-RIF in speakers replicate typical findings from the literature on RIF in individuals, confirming that RIF is present in category-cued recall, initial-letter-cued recall, and item recognition. The results concerning SS-RIF in listeners also replicate prior work by showing that SS-RIF is present in category-cued recall (Experiment 1). However, they also extend the prior work by showing that SS-RIF can also arise on initial-letter cued recall (Experiment 2) and item recognition (Experiments 3 and 4). RIF was similar in size for speakers and listeners when category or initial-letter cues were applied at test (Experiments 1 and 2), but the results obtained on item recognition tests indicated a reduced size of SS-RIF in listeners relative to WI-RIF in speakers. In Experiment 3, this reduction was present numerically only, in Experiment 4 it was also present statistically.

According to the two-factor account of RIF, there are mainly two mechanisms that contribute to RIF, inhibition and blocking. Inhibition is assumed to be involved irrespective of which final test is applied, whereas the additional contribution of blocking is assumed to arise mainly when item-specific retrieval cues are absent at test and to be reduced or even eliminated when such cues are present. On the basis of this account, the results of the present experiments suggest that inhibition and blocking are involved in both WI-RIF in speakers and SS-RIF in listeners. While this equivalence holds qualitatively, the results also suggest subtle quantitative differences between the two participant groups, with inhibition contributing more in speakers and less in listeners.

At least two factors could be responsible for why the involvement of inhibitory processes is reduced in listeners. First, the additional completion of a monitoring task may create a dual-task situation for listeners. For individuals, a secondary task during retrieval practice has been shown to take up cognitive resources, thus reducing the involvement of inhibition and resulting in reduced RIF (see Ortega et al., 2012; Román et al., 2009). The same reasoning can also be applied to explain why the involvement of inhibition is reduced in SS-RIF in listeners. This reasoning is not in conflict with the finding that retrieval practice still entailed enhancement for Rp+ items in listeners. Indeed, prior work on the positive effects of retrieval practice in both the RIF literature and the testing-effect literature has shown that the benefits of retrieval practice can still be reaped when secondary tasks are introduced (Buchin & Mulligan, 2017; Mulligan & Picklesimer, 2016; Ortega et al., 2012; Román et al., 2009).

Second, the monitoring task may also have prompted listeners to engage in a recognition task rather than in more effortful retrieval practice. Because the involvement of inhibition is assumed to depend on interference (Anderson, 2003) and because strength-based interference is reduced in item recognition (Ratcliff et al., 1990; Rupprecht & Bäuml, 2016), this could explain why inhibition is less involved in SS-RIF than WI-RIF. Using the same basic speaker-listener setup as applied in the present experiments, with listeners engaging in accuracy monitoring, a recent study examined non-selective retrieval practice and resulting testing effects in a social context (Abel & Roediger, 2018). Across two-day delays, this study found regular benefits of retrieval practice relative to restudy in speakers, but only reduced benefits in listeners; equivalent testing effects only emerged when accountability was enhanced and listeners were asked to monitor their own retrieval (instead of the speaker’s retrieval). Although these findings for testing effects arose on the basis of a more demanding memory task, they are generally consistent with the idea that monitoring a speaker’s response for accuracy might prompt recognition judgments rather than effortful retrieval practice in listeners.

Clearly, more work is needed to better understand and tease apart the exact processes involved during retrieval practice in speakers and listeners. Although both factors described above may have contributed to the present findings, their importance may vary within and across individuals. For instance, subjects may differ (generally or on specific trials) in whether they manage or even try to retrieve the correct answer themselves before being exposed to the speaker’s responses. Alternatively, it may be possible to passively wait for the speaker’s response and then provide an accuracy rating on the basis of a quick recognition judgment. The retrieval practice plus monitoring task as used in both the present and prior studies does not enable an investigation of what type of retrieval listeners engage in, or of whether judging the speaker’s accuracy on rating scales takes up cognitive resources and creates a dual-task situation for listeners. Consequently, the present findings must remain silent on the extent to which the two factors contributed to the results. Moreover, it also remains unclear if similar findings would be obtained if free-flowing conversations were used instead of experimenter-controlled retrieval-practice phases (e.g., Cuc et al., 2007). For instance, listeners might be more motivated to engage in effortful retrieval along with speakers under such more naturalistic conditions. Future studies, potentially even with novel experimental setups, are necessary to examine these issues.

On the basis of the present proposal that inhibition contributes more to RIF in speakers than listeners, the question arises of why RIF was not also larger in speakers than listeners in Experiments 1 and 2. In fact, if speakers and listeners differed in degree of inhibition but were equivalent in degree of blocking, then speakers should show larger RIF than listeners regardless of testing format. Following a finding by Kliegl and Bäuml (2016), who reported reduced intralist interference after retrieval practice relative to restudy, Rupprecht and Bäuml (2016) recently argued that retrieval practice may reduce possible blocking effects at test. If so, and if speakers engaged in effortful retrieval practice but listeners in less effortful recognition (see above), the contribution of blocking to RIF might be reduced in speakers relative to listeners. In such case, both inhibition and blocking may contribute to RIF in speakers and listeners, but inhibition may contribute more to RIF in speakers than listeners, and blocking contribute more to RIF in listeners than speakers, which is consistent with the present results. Clearly, more work is needed to examine this proposal in depth.

The conclusion arising on the basis of the present experiments is that differences between WI-RIF and SS-RIF are not qualitative, but mainly quantitative in nature, i.e., both effects arise due to blocking and inhibition, but the two mechanisms differ in relative contributions between speakers and listeners. This perspective is also consistent with other recent work on selective retrieval. Although many studies on RIF in individuals demonstrate that selective retrieval of some studied items can cause forgetting of the nonretrieved items, some other recent studies indicate that, under certain circumstances, selective retrieval can also facilitate recall of the nonretrieved information (for reviews, see Bäuml, 2019; Bäuml et al., 2017). Corresponding studies often applied lists of unrelated items as study material and induced a contextual change between study and retrieval practice, for instance, by manipulating delay between study and selective retrieval. Typically, RIF was present when delay was short, but recall facilitation arose when delay was long. This facilitation was explained by context retrieval, assuming that the initial learning context becomes less accessible with delay and that selective retrieval then reactivates the study context, thereby facilitating recall of the other information (see also Polyn & Kahana, 2008). Corresponding facilitation effects were also observed in social settings, such that the delayed selective retrieval of speakers also improved recall of the listeners (Abel & Bauml, 2015; for related work on collaborating groups, see Abel & Bäuml, 2017). Similar to the present findings, these results suggest that, qualitatively, the cognitive mechanisms active during selective retrieval are equivalent in speakers and listeners.

In sum, the results of the present study show that selective retrieval carried out by speakers can induce forgetting in listeners over a variety of memory tests. For test formats without item-specific cues, like category-cued recall, the forgetting can be equivalent in size between speakers and listeners, whereas for test formats with strong item-specific cues, like item recognition, the forgetting in listeners can be reduced. On the basis of the two-factor account of RIF, these findings suggest that the involvement of inhibitory processes in RIF is reduced in listeners relative to speakers, even though inhibitory processes still seem to contribute to the observed forgetting.

Open practices statement

The data and materials for all experiments are available on the Open Science Framework (https://osf.io/y9q37/). None of the experiments were preregistered.