Introduction

Retrieval practice is a learning activity that involves recalling information from memory. Retrieval practice enhances retention in contrast to restudying the material, a robust effect known as the testing effect. Meta-analytic reviews have shown that this effect is usually robust and reveals a medium to large effect on learning outcomes (e.g. Pan and Rickard 2018, Cohen’s d = 0.40; Rowland 2014, between-subject designs Hedges’ g = 0.69; within-subject designs Hedges’ g = 0.43).

The effects of retrieval practice are one of the findings from cognitive psychology that is easily exploitable in different educational contexts, such as school settings (e. g. Butler and Roediger 2007; Pyc et al. 2014), university settings (Carpenter et al. 2016), or online learning (Davis et al. 2016; for recent reviews, see Adesope et al. 2017; Carpenter 2012; Dunlosky et al. 2013; Pan and Rickard 2018; Rowland 2014).

Within this huge body of research, different types of retrieval tasks are used when applying or analyzing the effects of retrieval practice (e.g. Chan and McDermott 2007; Karpicke and Aue 2015; Zaromb and Roediger 2010). However, there has been little research on which types of retrieval are best applied in educational contexts. A teacher may wonder if she should use more general tasks such as “What do you remember from the text about trees?” (holistic retrieval) or more specific recall tasks such as “What impact does the shape of leaves have on trees?” (targeted retrieval).

The current literature might suggest that the use of different types of retrieval tasks makes little difference (Rowland 2014). However, the present study provides empirical findings that the type of retrieval task does matter, and we demonstrate under which circumstances which task type is perhaps best. Such findings help instructors choose the best-suited type of retrieval task based on their educational goals. Moreover, they provide insights into the mechanisms involved in retrieval practice.

More specifically, we analyzed the effects of retrieval on different educationally-relevant aspects. We investigated the direct effects of retrieval on learning different types of information and the indirect effects on factors that likely impact further learning such as metacognitive calibration (correctness of metacognition) and motivation (here: situational interest and self-efficacy, Arnold and McDermott 2013).

Direct effects of retrieval practice

Regarding direct effects of retrieval practice, there is ample evidence for differences when using different retrieval-practice tasks. However, the vast majority of studies focus on comparing recognition-oriented tasks to recall-oriented tasks; they have usually found learning to be optimal when the final learning assessment employs the same kind of tasks as the test-for-learning (recognition—recognition, recall—recall; e.g. Duchastel and Nungester 1982). When the learning goal is recognition-oriented, learners will profit most from recognition-oriented tasks during retrieval practice. When the recall of materials is the goal, learners should profit most from recall tasks during retrieval practice. From a theoretical point of view, these findings are in line with the transfer-appropriate processing view of retrieval practice (Morris et al. 1977; Veltre et al. 2016).

We wondered whether there are not more task-specific differences. We were particularly interested in the recall of learning contents, as recall is necessary in most application situations. Theoretically speaking, even various types of recall tasks such as those that are more or less specific (i.e. short-answer tasks or free-recall tasks) should trigger differences in the learning process and thereby lead to different outcomes.

There are multiple theories about how retrieval-practice effects occur (for an overview see Karpicke et al. 2014). One of the most prominent is the elaborative retrieval theory (e.g. Carpenter 2009). From this theory’s perspective, different retrieval practice tasks should make a difference in retention even when both tasks are recall tasks. The elaborative retrieval theory attributes the benefit of recalling learning materials to two processes: Firstly, the recall of a specific piece of knowledge activates the memory traces of that specific piece of knowledge. Activating this memory trace strengthens the connections between the concepts related to that specific memory trace. These strengthened connections ultimately lead to better retrieval of that memory trace. Secondly, the mental effort invested in recall—induced by search processes in memory—spreads activation to other pieces of knowledge that are connected to the retrieved piece of knowledge; the memory traces of those connected knowledge pieces and their association with the targeted piece of knowledge are thereby strengthened. This process of spreading activation eventually also leads to better retention of the targeted knowledge pieces and of the connected knowledge pieces. Such spreading activation can even lead to better learning performance on knowledge pieces that were not recalled during retrieval practice (retrieval-induced facilitation; Chan et al. 2006).

When learning by retrieval practice, the two mechanisms of strengthening and spreading activation should lead to a particular activation pattern that depends on the recall task’s specificity. If a recall task is specific and requires merely a targeted piece of information, for example, from a text (e.g. short-answer task), this task should lead to much greater activation of that specific, targeted information piece, as compared to the activation of other information pieces from the text. Such targeted retrieval should consequently lead to better retention of the content sections that were targeted by specific recall tasks. In comparison, if a recall task is unspecific and asks for a variety of information pieces (e.g. free-recall task), there should be less of an activation advantage for one specific piece of information; a variety of information pieces may be activated (Anderson and Reder 1999). In other words, the degree of activation generally available must probably be shared between more information pieces.

Few studies have investigated the effects of different types of recall as retrieval practice tasks. Glover (1989), for example, compared restudy, cued recall, free-recall, and recognition, identifying differences between the tasks and free-recall to be most beneficial for learning. However, Glover compared only free-recall with all other conditions combined. He did not specifically compare short-answer tasks and free-recall tasks. In a more recent study, comparisons of short-answer and free-recall retrieval-practice tasks revealed no task-specific differences (Endres and Renkl 2015). A more structured recall process that divided answers by specific prompts revealed no significant differences compared to free-recall retrieval practice (Smith et al. 2016). Taken together, these studies identified no differences in learning between open recall tasks and more specific recall tasks.

Few studies have directly compared different types of recall tasks (note, as mentioned above, most other studies investigated differences between recognition-oriented tasks and recall tasks, e.g., Kang et al. 2007; Rawson and Zamary 2019; Smith and Karpicke 2014). A meta-analysis can, however, also compare the effects sizes of studies using different types of recall tasks. Rowland’s meta-analysis (2014) showed that differences in the specificity of tasks do not lead to significant differences as long as the recall rate during retrieval practice is high. The effect sizes of both task types were similar (the short-answer referred to as cued recall in the meta-analysis, Hedges’ g = 0.72; free recall, Hedges’ g = 0.81).

One drawback of the aforementioned meta-analysis and of previous studies is that the learning was assessed in a rather coarse manner. Usually, only general learning outcomes were tested; different information pieces from the text (e.g. whether an information piece was targeted in the recall task or not) were not differentiated. Such coarse assessments of learning outcomes do not enable us to test more detailed theoretical assumptions of the effect of different types of recall tasks, although such differences might still be relevant when designing learning arrangements including retrieval practice.

Our study differentiated various content sections when assessing learning outcomes to gain both practically and theoretically relevant insights. If activation is the relevant mechanism predicting retrieval practice effects, then there should be clear differences when either free-recall (holistic retrieval) or short-answer tasks (targeted retrieval) are used. We formulated the following hypotheses:

Targeted retrieval hypothesis: Targeted retrieval via specific short-answer tasks improves the learning of the targeted sub-content more than holistic retrieval by free-recall tasks.

Holistic retrieval hypothesis: Holistic retrieval induced by an unspecific free-recall task improves the learning of various sub-contents more than targeted retrieval by specific short-answer tasks.

Indirect effects of retrieval practice

Different recall tasks may not just lead to different learning outcomes—they can influence other factors relevant to (future) learning. Influences on such factors are referred to as indirect effects of retrieval practice (Roediger et al. 2011). These effects are especially relevant when implementing retrieval practice in practice settings, particularly in self-regulated learning arrangements. In such settings, students can decide whether to continue studying, which content to restudy, and how much effort to invest. Although the direct effects of retrieval-based learning can have a substantial impact, the benefits of retrieval-based learning as a whole can only be fully understood when potential remedial learning activities after first recall are also taken into account. Hence, for practical purposes, it is particularly important to know about both the direct and indirect effects of various types of recall tasks.

A key issue regarding indirect effects is the kind of information the learners obtain about their previous success on a recall task. This information influences how accurate the learners monitor their own state of knowledge and knowledge gaps. Such meta-knowledge is important, as it influences learners in their decision about starting re-learning and investing more or less mental effort in such relearning. Learners’ efforts to close their perceived knowledge gaps are termed regulation. It is in particular the learners' perception about the amount of knowledge they have already acquired that determines whether they will invest more or less time and effort in restudy, that is, the remedial learning of specific contents (Nelson and Narens 1990; Nückles et al. 2009).

Learners’ monitoring of knowledge states and gaps can be influenced by many information sources. One classic source is feedback from the teacher who evaluated the learner’s performance on a test or a recall task. Beyond this direct feedback, learners can also monitor themselves, for example, by relying on inherent feedback when working on (recall) tasks (Arnold and McDermott 2013). The fluency of recall and perception of knowledge gaps are crucial factors in this context (Koriat 2012).

Let us reconsider specific and unspecific recall tasks. If a task demands a holistic activation of contents, for example, in a free-recall task, the inherent feedback the learner receives about knowledge gaps is relatively sparse. Although learners may miss some important information, they might not notice, because they can write about other aspects in response to the recall task. The process of recall might be fluent, and the answer provided may appear correct and coherent. As a result, learners get the impression that their answers were good even though significant pieces of knowledge might be missing (Koriat 2012).

If a task targets a very specific answer, for example, a short-answer task, the inherent feedback learners get while monitoring is much higher. For example, there may be several short-answer tasks on the same content section as for a free-recall task. These short-answer tasks require specific, detailed answers, and learners might become aware of their inability to provide such answers. They thus come to realize that they have knowledge gaps. Moreover, the process of remembering while working on a short-answer task might not be fluent, and the subjective probability of having provided a correct answer might be relatively low. Overall, the inherent feedback the learners obtain from specific tasks is higher, and it provides them with a reasonable idea of their “objective” knowledge state (Koriat 2012).

The more feedback a task provides to learners, the more accurate their meta-knowledge should be, that is, they are better able to calibrate (Alexander 2013). Short-answer tasks provide more intrinsic feedback than free-recall tasks. Hence, we formulated the following hypothesis:

Calibration hypothesis: Targeted retrieval via specific short-answer tasks increases calibration more than holistic retrieval via an unspecific free-recall task.

We also assume that differences in the success learners perceive while recalling can influence motivation. Self-efficacy and situational interest may be influenced especially by the type of recall task. Self-efficacy builds mainly on our previous perceptions doing a certain activity (Bandura 1997). When learners have perceived success, they also expect to succeed in the future with similar tasks (Schunk 1985). Returning to the task-specific effects in feedback from the paragraphs above: the inherent feedback in recall tasks might also exert influence on perceived success. Hence, free-recall tasks—with their low level of intrinsic feedback and relatively fluent recall as compared to short-answer tasks—should lead to a stronger perception of success. After perceiving such success, learners should have higher self-efficacy expectations after a free-recall task.

Situational interest is also affected by perceived success. As with self-efficacy, learners are more interested in areas in which they perform well (Hidi and Renninger 2006). The concepts of self-efficacy and situational interest are closely interconnected, according to some theories (e.g. Schiefele 1990). Hence, free-recall may lead to a stronger experience of success—compared to short-answer tasks – and thus should also trigger greater situational interest. On the other hand, there are perspectives predicting greater situational interest after a short-answer task. The knowledge-deprivation hypothesis predicts that a perceived lack of knowledge leads to higher situational interest. According to the assumption that short-answer tasks entail a high level of intrinsic feedback on knowledge gaps, such tasks should trigger stronger situational interest (Rotgans and Schmidt 2011, 2014).

We formulated the following hypothesis:

Self-efficacy hypothesis: Holistic retrieval via unspecific free-recall increases self-efficacy more than targeted retrieval via specific short-answer tasks.

Situational interest hypothesis: The type of recall task influences situational interest (two-sided hypothesis).

Method

Sample and design

We employed G*Power (Faul et al. 2007) as statistical power analysis software to estimate the minimum sample size for our within-subjects design. The software estimated that 34 participants would be needed to detect a statistically significant difference for the assumed medium effect size (effect size Cohen's d = 0.5, α-level p = 0.05, power 80%). Our study enrolled even more participants, thus enabling us to demonstrate reliable effects.

The 54 university students (age: M = 22.50, SD = 5.44) participating in this study were majoring in different subjects. Participants were given course credit for participation. All were aware they were taking part in research. The experimenter informed each participant about the possibility of quitting the experiment with no repercussions or drawbacks at any time. All participants provided informed consent and permitted us to use their collected data anonymously for publication.

We applied a within-subjects design with the factor “type of recall task”. The factor consisted of a holistic retrieval condition by unspecific free-recall and a targeted retrieval condition by specific short-answer. As dependent variables, we assessed learning outcomes determined by a posttest including free-recall tasks and short-answer tasks (direct effects) as well as metacognition, self-efficacy, and situational interest (indirect effects).

Materials

Texts

We drafted two texts (Text 1: 2427 words; Text 2: 2607 words) dealing with two different contents (coffee and sugar). Each text consisted of four sections similar in length, structure, and complexity. To understand a specific paragraph, learners did not need to understand the previous paragraphs. There were no references between the paragraphs. Each section contained three important pieces of interconnected information (see Appendix A). This separation into textual sections enabled us to detect fine-grained differences in learning outcomes.

The texts were assessed for their readability, intelligibility, and reading time (M = 20.6 min, SD = 4.07, max. 25 min) in a pilot study. Twelve students read both texts and rated the readability and intelligibility of both texts on four rating items (see Appendix B). The texts did not differ in any of the four items (all ps > 0.7). After reading and rating, the texts were returned to the students, and they were asked to mark those paragraphs that needed improvement. We improved the marked paragraphs. The original texts were in German (see Appendix C for English translations, please note, due to translation some characteristics of the text might have changed. If you are interested in the original materials, please contact us).

Recall tasks

We constructed short-answer tasks and free-recall tasks. The short-answer tasks were used to induce the activation of specific targeted information. We constructed 12 specific short-answer tasks for every text. Each text had three questions targeting a specific section in the text. Each question assessed one aspect required to understand the text (e.g. “Which characteristics of a coffee plant should be considered for agricultural purposes?”, see Appendix D). For the short-answer tasks, participants were asked to give an answer consisting of 2 to 4 sentences. Text under each answer box prompted learners to provide an answer of comparable length by using a counter for used characters (200–400 characters). The characters were counted while they typed. The count was displayed under the textbox. The color of the count turned from red to green once 200 characters had been reached and turned red again once 400 characters had been reached. Nevertheless, the participants could provide shorter or longer answers.

We used all short-answer tasks in the posttest. The provision of tasks in the learning phase depended on the condition. We scored each answer to a short-answer task on a scale ranging from 0 to 3 (with possible partial credit of 0.5). The maximum score was 36 points per text. The scales’ consistency was acceptable for complex learning items (Schmitt 1996; sugar Cronbach’s α = 0.65; coffee Cronbach’s α = 0.59). Two individual scorers rated the learners’ answers. We double-coded 37% of the data (20 participants). Interrater reliability was high (ICC = 0.95).

Free-recall tasks were used to induce the holistic activation of different learning contents. The free-recall task asked learners to recall all the aspects in the text they could remember (e. g. “What do you remember about the coffee text?”). For the free-recall task, participants were asked to provide an answer consisting of 6 to 12 sentences, a number of sentences that is three times more than a single short-answer task required. A text under each answer box prompted learners to provide an answer of comparable length by using a counter (600–1200 characters). The color of the count turned from red to green once 600 characters had been reached and turned red again once 1200 characters had been reached. The participants could nevertheless provide shorter or longer answers.

The same tasks functioned to assess learners’ knowledge in the posttest. The maximum score was, as with the short-answer task, 36 points per text. The learners' answers were scored by two individual raters. We double coded 20 participants which corresponds 37% of the data. Interrater reliability was high (ICC = 0.92).

Mental effort

To assess subjective mental effort, we asked participants after each recall task type how much effort they had invested in answering the task(s) (Pass 1992; Sweller et al. 2011). Participants indicated their mental effort using a scroll-bar (range: 1 [= low] to 9 [= high]).

Calibration

To assess the meta-knowledge for calculating calibration, we used two judgment-of-learning items (JOLs). Once the participants had worked on each text’s recall tasks, we asked them to indicate how correct they thought their answers were and how probable it was that they would remember their answers next week. Participants indicated their meta-knowledge using a scroll bar (range: 0% [= low] to 100% [= high]). As both scores correlated substantially, we aggregated them into one score (Cronbach’s α = 0.741).

For the meta-knowledge score, we considered the aggregated JOLs as a predicted score for the posttest (percentage of maximum posttest score). We compared this predicted score to the (objective) knowledge score in the delayed assessment and calculated a discrepancy score using the absolute value of discrepancy (Schraw 2009).

Self-efficacy

We assessed self-efficacy with five self-rating items after working on each text’s recall tasks. The five items assessed learners’ anticipated ability to perform similar tasks in different situational and social situations (e.g., could you explain the (the topic) to a friend, could you answer a similar question(s) in a potential exam). We followed Bandura's guidelines (2006) for constructing self-efficacy scale in how we formulated our items. Participants indicated their self-efficacy using a scroll bar (range: 0% [= low] to 100% [= high]; Cronbach’s α = 0.904).

Situational interest

We assessed situational interest by relying on three self-rating items after working on each text’s recall tasks (Schiefele 1990). The items asked whether the participants felt that the content presented was interesting, entertaining, or boring. Participants stated their situational interest using a five-point rating scale (range: 1 [= low] to 5 [= high]). The answers to the boredom item were reversed (Cronbach's α = 0.827).

Procedure

Our experiment consisted of two computer-based sessions that entailed four phases: the learning phase, intervention phase (recall tasks: targeted retrieval by short-answer or holistic retrieval by free-recall), immediate assessment phase (metacognition, self-efficacy, and situational interest) in the first session, and the delayed-assessment phase in the second session (posttest on learning outcomes in short-answer and free recall posttest; delay: 7 days).

In the first session, participants started with the learning phase in which they studied two expository texts, each text introduced with the same sentence: “Please read the following texts carefully. There will be a knowledge test.” (translated from German by the first author). Once both texts had been read, the intervention phase began. Targeted activation and holistic activation were implemented via different recall tasks (targeted retrieval by short-answer tasks and holistic retrieval by free-recall tasks). In our within design, the different task types were randomly assigned to one of two texts to control for (non-expected) text effects. The sequences of the texts and tasks were randomized. No feedback was provided after the tasks. After each task type, we assessed their mental effort to control for (unexpected) task differences. Furthermore, we assessed situational interest, metacognition, and self-efficacy after working on each text’s recall tasks.

After a one-week delay, learners returned for the second session in which we assessed learning outcomes via a free-recall posttest and short-answer posttest for both texts. Participants had to answer the free-recall items in both texts first. After answering the free-recall items, participants had to answer twelve short-answer items in each text. These 24 items included three items that the participants had answered before.

The delayed assessment of learning outcomes is standard in investigations of retrieval practice and testing effects (see Rowland 2014). The tasks’ order with regard to the two texts was randomized. The order of the task types was always the free-recall first, followed by the short-answer.

Results

We used an alpha level of 0.05 and relied on two-sided testing for all statistical analyses. We determined ƞ2 as an effect size index. The values 0.01, 0.06, and 0.14 were considered as small, medium, and large effect sizes, respectively. For certain analyses, we employed Bayesian analysis (Bayes Factor; BF) to confirm null hypotheses. We used a predefined, uninformative ZJS prior with a multivariate Cauchy distribution to (Rouder et al. 2012). This conservative prior makes no predictions about the effect to be found. We used this conservative prior because we wanted to ensure that evidence for a null effect would be conclusive. With this prior, the probability that the Bayes Factor reveals insufficient evidence is higher than a false confirmation of the null hypothesis. We interpreted the Bayes Factor as did Wetzels et al. (2011).

Prior analyses

As a basis for analyzing task-specificity effects, we examined whether the mental effort would be similar in both conditions. There were no differences in mental effort ratings between the two conditions, F (1, 54) = 0.286, p = 0.596. In addition to our frequentist analysis, we conducted a Bayesian repeated measures ANOVA. The Bayes analysis revealed substantial evidence favoring the null hypothesis (BF01 = 4.313). This Bayes factor can be interpreted as follows: it is 4.313 times more likely that there is no difference between mental effort ratings between experimental conditions. This expected result is crucial for testing effects resulting from differential activation patterns induced by different task types, and it enabled us to interpret our data without assuming any mental effort difference between task types (Endres and Renkl 2015; Pyc and Rawson 2009).

Direct effects

To assess the direct effects, we checked for differences in the one-week-delayed assessment phase. Table 1 provides the descriptive statistics of performance variables in both conditions. We had no participant unable to retrieve any information, and observed no differences between targeted retrieval and holistic retrieval conditions on overall performance (short-answer posttest: F [1, 54] = 0.458, p = 0.502; free-recall posttest: F [1, 54] = 0.892, p = 0.349). In addition to frequentist analysis, we conducted a Bayesian repeated measures ANOVA. The Bayes analysis revealed substantial evidence favoring the null hypothesis (short-answer posttest: BF01 = 4.091, free-recall posttest: BF01 = 3.364). This Bayes factor can be interpreted as follows: It is 4.091 or 3.364 more likely that there is no difference between learners’ performance between experimental conditions. On the overall performance level, our study revealed no task-specific performance differences (Bayesian analysis), in line with Rowland’s (2014) recent meta-analyses.

Table 1 Means and standard deviations (in brackets) of performance variables in both conditions

On the more detailed level of content sections, however, we identified several differences between task types. We analyzed those according to our hypotheses:

Targeted retrieval hypothesis

To test the targeted retrieval hypothesis (“Targeted retrieval via specific short-answer tasks improves the learning of the targeted sub-content more than holistic retrieval by free-recall tasks”), we compared the two conditions’ learning outcomes with respect to knowledge pieces specifically addressed in the short-answer tasks. We expected that the targeted retrieval by specific recall (in contrast to holistic retrieval via free-recall tasks) would improve the performance in the targeted information section in both posttest types.

Targeted retrieval improved the learning outcomes of targeted content sections more than holistic retrieval in both posttest types (short-answer posttest: targeted retrieval in targeted retrieval condition, M = 3.63, SD = 1.76, targeted retrieval in holistic retrieval condition, M = 2.82, SD = 0.95, F [1, 53] = 12.868, p < 0.001, ƞ2 = 0.195; free-recall posttest: targeted retrieval in targeted retrieval condition, M = 2.93, SD = 1.69, targeted retrieval in holistic retrieval condition, M = 2.45, SD = 1.09, F [1, 53] = 4.031, p = 0.04980, ƞ2 = 0.071; see Table 1). We also observed this difference when aggregating both posttest types (F [1, 52] = 6.644, p = 0.003, ƞ2 = 0.204). The interaction between task type for practice and task type in the posttest was not statistically significant (F [1, 53] = 1.621, p = 0.208, ƞ2 = 0.030). Hence, we found no evidence for a transfer-appropriate processing effect.

In summary, we confirmed the targeted retrieval hypothesis. In both posttest types, the specifically recalled content sections were recalled better than the not specifically recalled ones. As the amount of mental effort and overall posttest performance did not differ between conditions, this difference is likely due to diverse activation patterns. The specific activation of one section strengthened this section more than did the activation of multiple sections.

Holistic retrieval hypothesis

To test our unspecific recall hypothesis (Holistic retrieval induced by an unspecific free-recall task improves the learning of various sub-contents more than targeted retrieval by specific short-answer tasks), we compared the learning outcomes between conditions with respect to knowledge pieces non-targeted via the short-answer tasks. We expected that the holistic activation via the free-recall task in contrast to non-targeted information in the short-answer tasks would improve performance in both posttest types.

Holistic retrieval improved learning outcomes of several sections more than target retrieval did (free-recall posttest: targeted retrieval in holistic retrieval condition, M = 2.45, SD = 1.09; non-targeted information in targeted retrieval condition, M = 2.10, SD = 1.17, see Table 1). This effect was, however, only apparent in the free-recall posttest types (F [1, 53] = 4.841, p = 0.032, ƞ2 = 0.084). It was not revealed in the short-answer posttest types (short-answer posttest: targeted retrieval in holistic retrieval condition, M = 2.82, SD = 0.95; non-targeted information in targeted retrieval condition, M = 2.67, SD = 1.26; F [1, 53] = 1.289, p = 0.261, ƞ2 = 0.024, overall: F [1, 52] = 2.381, p = 0.102, ƞ2 = 0.084). Following up this insignificant finding, we additionally conducted a Bayesian repeated measures ANOVA to obtain evidence for the null hypothesis. The Bayes analysis yielded no evidence favoring any hypothesis (BF10 = 0.359). The interaction between type of practice task and posttest task type was not statistically significant (F [1, 53] = 1.655, p = 0.204, ƞ2 = 0.030). Hence, there was no transfer-appropriate processing effect.

In summary, we confirmed the holistic retrieval hypothesis in the free-recall posttest. We could not confirm this hypothesis in the short-answer posttest.

Indirect effects

Calibration hypothesis

To test our calibration hypothesis (“Targeted retrieval via specific short-answer tasks increases calibration more than does holistic retrieval via an unspecific free-recall task”) we compared the discrepancy score between the targeted-retrieval condition and holistic-retrieval conditions. Table 2 provides descriptive statistics of the indirect-effects variables in both conditions. Holistic retrieval leads to higher JOLs ratings than targeted retrieval (F [1, 53] = 11.470, p = 0.001, ƞ2 = 0.179). We calculated a difference from predicted posttest scores by using the JOLs of the learners and actual posttest scores. Targeted retrieval improved the accuracy of JOLs more than holistic retrieval (short-answer posttest: F [1, 53] = 11.470, p = 0.001, ƞ2 = 0.178; free-recall posttest: F [1, 53] = 11.470, p = 0.001, ƞ2 = 0.178; overall: F [1, 53] = 11.470, p = 0.001, ƞ2 = 0.178).

Table 2 Means and standard deviations (in brackets) of secondary effect variables in both conditions

In summary, we confirmed the calibration hypothesis. Learners’ calibration was more accurate after short-answer tasks than after a free-recall task.

Self-efficacy hypothesis

To test our self-efficacy hypothesis (“Holistic retrieval via unspecific free-recall increases self-efficacy more than targeted retrieval via specific short-answer tasks”) we compared the aggregated self-ratings in the targeted retrieval and holistic retrieval conditions. Holistic recall raised self-efficacy more than targeted retrieval (F [1, 53] = 4.602, p = 0.037, ƞ2 = 0.080). To better understand the mechanisms underlying this effect, we conducted a within mediation analysis. We tested whether the effect of task type on self-efficacy was mediated by the participants’ performance. We applied the MEMORE tool to calculate within mediations (Montoya and Hayes 2017). This model revealed indirect effects of condition on self-efficacy via a participant’s performance, B = 31.58, 95% CI [9.81, 62.55].The insignificant c’ path indicates a complete mediation (i.e. the effect is entirely attributable to the mediation; see Fig. 1).

Fig. 1
figure 1

Effect of task specificity on self-efficacy via participants’ performance. All Bs represent unstandardized regression coefficients obtained through bootstrapping using 5000 resamples. The range in brackets represents the 95% CI of the indirect effect

In summary, we confirmed our self-efficacy hypothesis. Learners’ self-efficacy was significantly higher in the holistic retrieval condition than the targeted retrieval condition. More intensely perceived success predicted greater self-efficacy after recall.

Situational interest hypothesis

To test our situational interest hypothesis (“The type of recall task influences situational interest”) we compared interest in the topics associated with the targeted retrieval condition and holistic retrieval condition, respectively. Holistic retrieval increased situational interest more than targeted retrieval (F [1, 53] = 13.928, p < 0.001, ƞ2 = 0.208). We conducted a regression analysis to better understand the mechanisms underlying this effect. We see two possibilities in our theoretical argumentation: If the parallel development of perceived success and situational interest support each other, a more intense experience of success would lead to deeper situational interest. The differences in recall in the test-for-learning should predict the situational interest differences with a positive b weight. A negative b weight would support the thirst-for-knowledge theory; in this case, the experience of knowledge gaps (negative experience of success) leads to greater situational interest. Specifically, we conducted a regression analysis using individual recall differences as a factor of “perceived success” to predict situational interest. Test-for-learning performance differences significantly predicted situational interest differences, b = 0.33, t (52) = 2.56, p = 0.014; the better the performance, the greater the situational interest. The test-for-learning performance also explained a significant proportion of the variance in situational interest, R2 = 0.11, F (1, 52) = 6.53, p = 0.014. We also conducted a within mediation analysis, testing whether the effect of task type on situational interest was mediated by the participants’ performance (MEMORE, Montoya and Hayes 2017). The model revealed an indirect effect of condition on situational interest via the participants' performance, B = 0.83, 95% CI [0.20, 1.77]. The insignificant c’ path indicates a complete mediation (i.e., the effect is entirely attributable to the mediation; see Fig. 2).

Fig. 2
figure 2

Effect of task specificity on situational interest via participants performance. All Bs represent unstandardized regression coefficients obtained through bootstrapping using 5,000 resamples. The range in brackets represents the 95% CI of the indirect effect

In summary, the task type did in fact influence situational interest. Learners exhibited deeper situational interest after a free-recall task. A stronger experience of success predicted greater situational interest. In conjunction with low inherent feedback, such enhanced situational interest after having performed a task successfully supports the theory of the parallel development of experiencing success and situational interest (e.g. Hidi and Renninger 2006).

Discussion

Our study makes the following contributions to the available literature: (1) The present findings confirm the assumption that task type matters when employing recall tasks for retrieval practice—a novel finding in this research field. Our study’s setup has enabled us to demonstrate such novel evidence on the effects of specific and unspecific recall tasks. (2) We did not detect a mere better-or-worse pattern of results. Instead, our findings highlight the relevance of educational goals when implementing retrieval practice. Previous studies on retrieval practice barely addressed different educational goals. (3) Regarding the direct effects of retrieval practice, targeted retrieval (via specific short-answer tasks) led to greater retention of targeted information from the learning contents; holistic retrieval (via unspecific free-recall tasks) led to better retention of further information from the learning contents. (4) Regarding indirect effects, targeted retrieval (via specific short-answer tasks) improved metacognitive calibration; holistic retrieval (via unspecific free-recall tasks) increased self-efficacy and situational interest. Note that a strength of the present study is that we investigated both direct and indirect effects in a differentiated manner within a single study, thus enabling insights into the interdependency of direct and indirect task-type effects.

Theoretical implications

Regarding direct effects, our findings provide evidence of the mechanisms involved in retrieval-practice effects. Especially the process of spreading activation seems to play a role in our semantically related learning materials. Although our hypotheses have not yet been supported by empirical studies or meta-analyses (see also discussion, Rowland 2014), our findings suggest that there are systematic differences between tasks when the appropriate analytic grain size is chosen, that is, when learning contents and outcomes are analyzed in greater detail.

In our study, the distinctive activation patterns of knowledge we assumed (specific and unspecific) led to specific learning effects. We suggest that this specific learning effect is attributable to spreading activation as a relevant mechanism in retrieval practice. As long as the topics contain meaningful cues (a factor we can assume in complex learning), spreading activation seems to have, at the very least, greater explanatory value than the transfer-appropriate processing perspective when considering differences between various types of recall task.

Our direct effect hypotheses were derived from the elaborative retrieval theory (Carpenter 2009). Spreading activation plays a crucial role in this theory. Although we did not design our study to (dis-) confirm theories, our findings and the current literature suggest that the transfer-appropriate processing perspective applies when different memory processes such as recognition and recall are in the foreground; however, theories including spreading activation possess greater explanatory power for explaining effects of different recall tasks. Follow-up studies should test this assumption.

Regarding indirect effects, it is the situational-interest effect in particular that is theoretically relevant. In our introduction, we presented two perspectives of how performance may influence situational interest (high performance may enhance or lower interest). Our subsequent mediation analysis indicated that a better-perceived performance leads to higher interest (see Fig. 2). This finding does not support the knowledge-deprivation hypothesis. However, there are major differences between our study and those supporting the knowledge-deprivation hypothesis (Rotgans and Schmidt 2014, 2017). In our case, the learners read both texts relevant to a later task performance. In studies on the knowledge-deprivation hypothesis, the topics addressed in the tasks were completely new to one group of learners. In conclusion the difference in knowledge-deprivation was greater in these studies. A second difference is the type of task carried out in knowledge-deprivation studies and in our (and other) retrieval-practice studies. In knowledge-deprivation studies (e.g. Rotgans and Schmidt 2014), the tasks were more like challenging puzzles instead of recall tasks. Students were given background information (a short introductory text) before they had to make a guess or work out a solution. Students in those studies were probably interested in whether their solution was right. The knowledge-deprivation hypothesis may have explanatory value for those task types, but less so for recall tasks.

Another interpretation of our effect of task specificity on situational interest may be in line with the self-determination theory. A free recall task could also lead to higher situational interest because participants can choose the topic they would like to recall. Such autonomy should trigger stronger situational interest (Ryan and Deci 2000), an assumption already validated (Linnenbrink-Garcia et al. 2013). Although this assumption is theoretically plausible, our study provides no evidence supporting it. Our mediation analysis revealed that the task effect on situational interest was fully mediated by performance. However, as we cannot entirely exclude a self-determination interpretation of the effect on interest. Perceived autonomy while working on different recall tasks should be investigated in future studies (cf. Vansteenkiste et al. 2004).

Practical implications

In this study, we learned that different recall tasks actually make a difference. We found that using specific tasks fosters the retention of those particular targeted content sections that had been included in these questions. On the other hand, unspecific tasks foster the learning of content sections that were not directly retrieved. Hence, when teachers decide on a task type for retrieval practice, they should also take the nature of the learning contents into account. If the contents to-be-learned include a few, very central content units, it would be preferable to use specific tasks such as short-answer tasks. There may be a few specific concepts illustrated by several examples that facilitate understanding, but they themselves are not relevant. A text about three crucial factors of climate change, which also includes illustrations for better understanding, could serve as a corresponding example.

If the contents to-be-learned include a wider range of important information, it would be preferable to use unspecific tasks such as free-recall. For example, in a text on the Gulf Stream and different factors affecting it, there are many different and important points worth remembering.

If the contents to-be-learned include some very important information, but overall understanding also depends on learning a broad set of information, a teacher might apply both formats. For example, students might first work on free-recall tasks to deepen their overall understanding. Then, they could answer short-answer questions to consolidate the most important key information. This mix of specific and unspecific recall tasks may explain why mixed-format tests were associated with higher weighted mean effect sizes in a recent meta-analysis (Adesope et al. 2017).

With respect to indirect effects, our findings suggest that more specific task types (e.g. short-answer tasks) help learners become aware of their knowledge gaps, leading to more accurate metacognition. Free-recall tasks have the advantage of increasing self-efficacy and situational interest. Again, as in the case of direct learning outcomes, teachers should select a specific task type according to the learning unit’s most important educational goals. If teachers want their students to develop accurate metacognitive judgments and ensure they acquire precise understanding of what they must still learn, for example, in self-regulated learning settings, the teacher should select specific tasks that give the learners substantial feedback about their actual knowledge state. When teachers aim to motivate their students and make them feel self-efficacious, for example, in a problem-based learning unit, they should use tasks that provide learners with less feedback.

Limitations

Order of posttest tasks

In our experimental design, we used the same order of posttest tasks for all participants; first free-recall posttest and then short-answer posttest. We chose this task sequence because free-recall and short-answer address different levels of memory accessibility. Free-recall tasks assess easily accessible knowledge. Additionally to high accessible knowledge, short-answer tasks also assesses low accessible knowledge by providing a specific cue for recall. This cue makes recall easier and allows us to assess harder-to-recall knowledge as well. We were interested in both types of accessibility. We stuck to the free-recall, then short-answer order of tasks to ensure that our short-answer tasks did not cue certain contents in the free-recall answers. We expected no carryover effects from the free-recall to short-answer tasks.

In this context, it is essential to distinguish the three cases. Firstly, a participant does not retrieve specific pieces of information in the free-recall task that were associated with the correct answer to a later short-answer task. In this case, there is no plausible carryover effect of activation. Secondly, a participant does retrieve a specific piece of information in the free-recall task that is the correct answer to a later short-answer task. As retrieving that piece of information is easier when responding to a short-answer task, that individual is more likely to retrieve the correct answer to the short-answer task with or without the preceding free-recall. As the free-recall pre-activates this piece of information in the present case, that individual may have provided a faster answer to the short-answer task (which our study, however, did not measure). Thirdly, a participant retrieves specific pieces of information in the free-recall task that were associated with the correct answer to a later short-answer task. In this case, the activation of the associated knowledge in the free-recall task could have led to an activation of information corresponding to a subsequent correct answer via spreading activation. This activation of a correct answer later could have led to better recall while working on specific tasks (retrieval-induced facilitation effect, Chan 2006). However, studies directly addressing the issue of retrieval-induced facilitation without a delay (Chan 2010) found no evidence for a benefit of non-directly activated but associated material in an immediate posttest. The effect of retrieval-induced facilitation seems to develop only in conjunction with a delay. As we observed no delay in our posttest between the free-recall and short-answer tasks, there was very likely no effect of retrieving pieces of information in free-recall on retrieving associated information when answering short-answer tasks later.

Limitations of our findings

We confirmed most of our hypotheses. However, the holistic retrieval hypothesis was only confirmed with the free-recall posttest, but not with the short-answer posttest. There are several possible interpretations of these inconsistent findings. The first is that the free-recall posttest is more sensitive to differences because the short-answer tasks provide learners with retrieval cues. These retrieval cues may eliminate differences between conditions and could be responsible for the lack of evidence in the short-answer posttest. In this interpretation, the free-recall task would be the more informative one. A second interpretation is that the lack of evidence for the unspecific recall hypothesis is due to transfer-appropriate processing. As mentioned above, transfer-appropriate processing effects apply when different cognitive processes are involved (e.g. Veltre et al. 2016). Transfer-appropriate processing would assume that, in addition to the activation patterns, the relation between an intervention task and assessment task had an influence on our assessment. However, the interactions between the intervention and assessment tasks were not significant in any of our analyses in terms of direct effects. Hence, a transfer-appropriate processing interpretation is implausible in this case. We, therefore, reject this second alternative interpretation.

Further studies

Our findings suggest that a more detailed analysis of learning outcomes can establish a fruitful basis for drawing practical and theoretical implications. Our study’s careful and detailed examination of learning outcomes has revealed evidence that is potentially relevant in further research.

Our study suggests that the task type selected for retrieval practice should depend on the nature of the learning content: a few central issues or many relevant information units. To further confirm this recommendation, this feature of learning content should be varied experimentally. For example, we could use learning material about the (three) different types in cognitive load theory (Sweller et al. 2011) and explain them via several examples. In such a case with three (few) central ideas, specific questions should be preferred. The second learning material could explain the cognitive load theory as a whole, with important details and instructional design effects. In the latter case, unspecific free-recall might be preferable.

Another interesting direction in the current research of retrieval practice effects is the use of posttests given after a longer delay. In our study we provided the posttest after a one-week delay. In different studies (e.g. Rummer et al. 2017) retrieval practice interventions interact with the delays. Further studies could examine whether there are also interactions between delay and the effects of a task’s specificity.

With respect to indirect effects: a limitation of this study is that we only assessed calibration, self-efficacy, or situational interest after retrieval practice. However, we did not investigate the influence of these variables on later remedial learning. Future studies should close this gap.

Conclusion

Overall, the type of recall task in retrieval practice makes a difference in learning. Teachers and instructional designers should be aware of task-specific direct and indirect effects. They should choose recall tasks that correspond to their main educational goals in a specific lesson. Matching the type of recall task to corresponding educational goals is necessary with respect to both learning outcomes (direct effects) as well as metacognition and motivation (indirect effects).