A wealth of research has shown that practice involving retrieval of target information from memory (i.e., retrieval practice) is beneficial for subsequent retention (for reviews, see Rawson & Dunlosky, 2011; Roediger & Butler, 2011). Of course, the effectiveness of retrieval practice depends on a number of factors. For example, although failed retrieval attempts may show modest memorial benefits (e.g., Kornell, Hays, & Bjork, 2009), retrieval practice is particularly efficacious when retrieval attempts during encoding are successful (e.g., Karpicke & Roediger, 2007; Pyc & Rawson, 2007, 2011). Furthermore, the memorial benefits of successful retrievals depend critically on the quantity and timing of those successful retrievals (Pyc & Rawson, 2009).

Although retrieval practice has been shown to yield large improvements in memory under appropriate experimentally devised conditions, in many learning situations (e.g., a student studying for an exam), the scheduling of retrieval practice is largely in the hands of the learner. Thus, the efficacy of retrieval practice for enhancing learning can only be as good as individuals’ self-regulated use of retrieval practice. Therefore, it is important to understand the extent to which individuals’ judgments of learning are sensitive to factors that influence the efficacy of retrieval practice. Accordingly, the present research examined the extent to which individuals’ judgments are sensitive to the quantity and timing of successful retrievals during practice.

Below, we provide a brief review of the particular retrieval practice effects that are relevant for the present experiments. We then describe components of self-regulated learning, with particular emphasis on metacognitive monitoring, the component of greatest interest here. Finally, we report two experiments evaluating the sensitivity of judgments of learning to factors that influence the efficacy of successful retrieval practice.

Efficacy of retrieval practice

Many studies have established that retrieval practice is beneficial for memory. Retrieving information from memory during practice promotes memory to a greater extent than do other strategies, such as restudying (e.g., Cull, 2000; Karpicke & Roediger, 2007, 2008). Important for present purposes, previous research has shown that the quantity and timing of practice influences the memorial benefits of retrieval practice.

Concerning the quantity of practice, research has shown greater memorial benefits when individuals engage in more versus less retrieval during practice (e.g., Allen, Mahler, & Estes, 1969; Wheeler & Roediger, 1992). Concerning the timing of practice, a wealth of previous research has demonstrated greater memorial benefits when items are practiced with a longer versus shorter lag between practice trials with items (e.g., Cepeda, Pashler, Vul, Wixted, & Rohrer, 2006; Cull, 2000; Landauer & Bjork, 1978; Pashler, Zarrow, & Triplett, 2003; Pyc & Rawson, 2009). However, almost all of this previous research has manipulated the quantity and timing of trials during practice. In contrast, the present research involved manipulating the quantity and timing of correct retrievals during practice. When students self-regulate their own learning using retrieval practice, they presumably do not (and should not) simply engage in a fixed number of practice trials for each item. Rather, students should self-test until they can correctly recall items multiple times during encoding (e.g., Pyc & Rawson, 2009).

What influence does the quantity and timing of correct retrievals have on final test performance? Recent research has shown greater memorial benefits for items correctly retrieved more versus fewer times during practice and for items that are correctly retrieved after longer versus shorter lags during retrieval practice (Pyc & Rawson, 2009). Pyc and Rawson (2009) presented participants with foreign language paired associates for an initial study trial and then test–restudy practice trials until items reached a preassigned criterion level of performance (1, 3, 5, 6, 7, 8, or 10 correct retrievals) during practice. Items were practiced with either a short or a long lag between practice trials. After a delay, participants completed a final cued recall test for all items. Across two experiments, performance increased as the number of correct retrievals during practice increased (see also Nelson, Leonesio, Shimamura, Landwehr, & Narens, 1982; Vaughn & Rawson, 2011). Additionally, performance was higher for items with a longer lag versus shorter lag between correct retrievals during practice. Thus, the benefits of successful retrievals depend critically on the quantity and timing of those successful retrievals.

Theories of self-regulated learning and metacognitive monitoring

Although researchers have identified various retrieval practice schedules that are particularly beneficial for memory (i.e., schedules with multiple correct retrievals that take place after long lags), the impact of successful retrieval practice for promoting learning hinges critically on individuals’ using the most effective retrieval practice schedules when self-regulating their study. Self-regulated learning includes two central components, monitoring and control (e.g., Greene & Azevedo, 2007; Nelson & Narens, 1990; Winne & Hadwin, 1998). Monitoring involves evaluating how well information has been learned and/or the likelihood that information will be remembered in the future. Control involves decisions about what to study, when to study, and how to study. The primary assumption of models of self-regulated learning is that monitoring informs control decisions, which in turn influence learning (e.g., Ariel, Dunlosky, & Bailey, 2009; Dunlosky & Metcalfe, 2009; Nelson & Narens, 1990; Winne & Hadwin, 1998). Consistent with this basic assumption, research has shown that more accurate versus less accurate monitoring during study leads to higher levels of test performance (e.g., Dunlosky & Rawson, in press; Rawson, O’Neil, & Dunlosky, 2011; Thiede, 1999; Thiede, Anderson, & Therriault, 2003).

Because monitoring accuracy is critically important for effective control and later test performance, we focus on this aspect of self-regulated learning in the present experiments. To examine the extent to which individuals accurately monitor their learning during retrieval practice, we evaluated the extent to which judgments of learning (JOLs) made after correct retrievals are sensitive to factors (i.e., lag and criterion level) that moderate the effects of successful retrieval.

What factors influence JOLs? Koriat’s (1997) cue-utilization framework states that JOLs are inferential, in that individuals do not have direct access to their own memory states and, thus, must use heuristics to assess the likelihood of being able to later recall information. That is, JOLs are not based on an evaluation of the memory strength of an item but, instead, are based on one or more cues that individuals use to infer the state of their memory.

What types of cues are used to make JOLs? According to the cue-utilization framework, three classes of cues can influence JOLs: intrinsic, extrinsic, and mnemonic. Intrinsic cues are based on characteristics inherent to items, which may make them easier or more difficult to learn (e.g., abstract vs. concrete). Extrinsic cues are based on learning conditions (e.g., number of trials) or the encoding task an individual engages in (e.g., interactive imagery). Mnemonic cues are based on aspects of an individual’s own subjective experiences during task performance (e.g., retrieval fluency), which may provide the individual with information that is predictive of how well an item has been learned, as well as the likelihood that the item will be recalled at a later time. To foreshadow, extrinsic and mnemonic cues are of greatest interest here.

Sensitivity of JOLs to effects of correct retrievals

With the goal of the present research in mind (i.e., to evaluate the sensitivity of JOLs to the quantity and timing of successful retrievals during practice), to what extent can previous research provide information about the kinds of cues that learners use to make JOLs after correct retrievals?

A wealth of previous research has evaluated the sensitivity of JOLs to the quantity and timing of practice, but these previous studies are different in important ways from the present research. For example, previous research has shown greater JOL accuracy as the quantity of practice increases (e.g., Mazzoni, Cornoldi, & Marchitelli, 1990; Meeter & Nelson, 2003; Zechmeister & Shaughnessy, 1980). However, much of this previous research has involved study trials only, rather than retrieval practice. Furthermore, prior research involving retrieval practice has manipulated the number of practice trials, rather than manipulating the number of correct retrievals.

Likewise, previous research has examined JOL accuracy as a function of timing of practice. JOL magnitudes are often greater with less versus more time between practice trials, whereas performance is usually lower with less versus more time between practice trials (e.g., Kornell, 2009; Zechmeister & Shaughnessy, 1980). However, the available research either has again involved only study trials or has manipulated the timing of practice trials rather than the timing of correct retrievals. Furthermore, much of the work showing JOL magnitude differences as a function of timing has compared massed versus spaced practice (i.e., no spacing vs. some spacing between practice trials with items), rather than short versus long lags.

Why are these differences important? First, given that previous research has shown differences in JOL accuracy for study versus retrieval practice (e.g., Karpicke, 2009; Kornell & Son, 2009; Mazzoni & Nelson, 1995, Experiment 2; Shaughnessy & Zechmeister, 1992), the sensitivity of JOLs to effects of the quantity and timing of practice in previous studies involving study trials only may differ from the sensitivity of JOLs to these factors under conditions involving retrieval practice (e.g., because the mnemonic cue of retrieval fluency is available under conditions of retrieval practice, but not under conditions of study only). Second, implementing a fixed number of practice trials for each item yields differences in learning status for various items. That is, some items may be correctly recalled during practice, whereas others may not be correctly recalled. Because retrieval status (i.e., correct vs. incorrect) is a powerful cue for making judgments (Nelson & Dunlosky 1991), differences in retrieval status for individual items exerts a strong influence on JOLs made during practice with a fixed number of trials. In contrast, when all items are learned to a given criterion, individuals cannot use retrieval status as a cue for making judgments. Third, a similar logic applies to studies manipulating the lag between trials, rather than the lag between correct retrievals, in that retrieval status will differ as a function of lag in the former case, but not in the latter. In sum, the sensitivity of JOLs to the quantity and timing of correct retrievals may differ from patterns observed in previous research to the extent that the available cues differ for conditions of criterion versus noncriterion learning.

Importantly, here we are interested in students’ judgments of learning when all items are successfully retrieved, for reasons described above. However, to our knowledge, only one prior study has examined JOLs during criterion learning (i.e., when all items are practiced until correctly recalled). Karpicke (2009) reported that JOLs were greater for items that were correctly recalled three versus one time during practice.Footnote 1 No prior research has evaluated the relationship between lag and JOLs when items are learned to a criterion level of performance, nor has prior research examined JOLs when both lag and criterion level are manipulated.

However, on the basis of the kinds of cues that Koriat’s (1997) cue-utilization framework assumes people use when making JOLs, we outline a number of possible outcomes. On the basis of the definition provided by the cue-utilization framework, criterion level is an extrinsic cue. If individuals have accurate beliefs regarding criterion level, JOLs will increase as criterion level increases. Of course, even if participants have accurate beliefs, it is possible that they may not use these beliefs when making JOLs (e.g., Koriat, Bjork, Sheffer, & Bar, 2004), so one might not see a relationship between criterion level and JOL. It could also be the case that individuals do not have any beliefs about criterion level, in which case JOLs will not differ for various criterion levels. (We do not consider the highly implausible possibility that individuals would believe that an increase in criterion level would lead to a decrease in memory.)

Although criterion level is an extrinsic cue, it also influences the mnemonic cue of retrieval fluency. For example, metacognitive research has shown that in various tasks, JOLs increase as response latencies decrease (e.g., Benjamin, Bjork, & Schwartz, 1998). Importantly, previous research on retrieval practice has shown that retrieval latencies decrease as the number of correct retrievals during practice increases (e.g., Pyc & Rawson, 2009). Therefore, if JOLs during retrieval practice are based on the mnemonic cue of retrieval fluency, JOLs are predicted to increase as criterion level increases.

Lag is also an extrinsic cue by definition. If individuals have accurate beliefs about lag, JOLs will be higher for items that are correctly retrieved after longer versus shorter lags. Again, even if participants have accurate beliefs, it does not ensure that they will use these beliefs when making JOLs (Koriat et al., 2004), in which case JOLs may not differ for longer versus shorter lags. JOLs also may not be related to lag if individuals do not have any beliefs about the effects of lag. Finally, if individuals have inaccurate beliefs about lag (and incorporate those beliefs when making JOLs), JOLs will be higher for shorter versus longer lags.

The extrinsic cue of lag also influences the mnemonic cue of retrieval fluency. Previous research has shown that retrieval latencies during retrieval practice are lower for items retrieved after shorter versus longer lags (e.g., Pyc & Rawson, 2009). If JOLs during retrieval practice are based on the mnemonic cue of retrieval fluency, JOLs will be higher for items that are correctly retrieved after shorter versus longer lags.

The present experiments were designed to evaluate two questions. First, are JOLs sensitive to the effects of criterion level and/or the lag between correct retrievals on final test performance? Second, what cues are used to make JOLs for criterion level and lag? In two experiments, participants learned foreign language paired associates via retrieval practice with restudy until items reached an assigned criterion level of performance (one, three, six, or nine correct retrievals). Items were practiced with either a short lag or a long lag between trials. After the last correct retrieval for each item, participants predicted the likelihood of retrieving that item on the final test. If JOLs are based on the extrinsic cue of criterion level and/or on the mnemonic cue of retrieval fluency, JOLs will increase as criterion level increases. For lag, several outcomes are plausible, depending on the extent to which the extrinsic cue of lag complements or competes with the mnemonic cue of retrieval fluency.

Experiment 1

Method

Participants and design

Forty-one Kent State University undergraduates participated in return for course credit. Criterion level (one, three, six, or nine correct retrievals during practice) was a within-participants manipulation. Lag (short vs. long) was a between-participants manipulation, with 22 and 19 participants in each group, respectively.

Materials

Items included 48 Swahili–English translation word pairs previously normed for item difficulty (Nelson & Dunlosky, 1994). Twelve word pairs were assigned to each of four lists, with an equivalent range of item difficulty in each list. Within each list, three items were randomly assigned to each criterion level (randomized anew for each participant).

Procedure

All task instructions and items were presented via computer. All items first received an initial study trial, followed by blocks of test–restudy practice trials until items reached their assigned criterion level of performance. For initial study trials, the cue (Swahili word) and target (English translation) appeared on the computer screen for 10 s. For test trials, the cue appeared on the computer screen, and participants had 8 s to type the correct target answer. If an item was retrieved before 8 s had elapsed, participants could press a key to submit their response. Items that were not correctly retrieved received a 4-s restudy trial with the cue and target before participants moved on to the next to-be-learned item. Items that were correctly retrieved did not receive a restudy trial before participants moved on to the next item.

The computer tracked the number of times each item was correctly retrieved during practice. Items continued to receive test–restudy practice trials until they reached their assigned criterion level of performance (one, three, six, or nine correct retrievals). After items reached their criterion level of performance, they were dropped from further test–restudy practice. If an item had not reached its criterion level of performance on a given trial, it was placed at the end of the list of to-be-learned items. Participants were not aware of the specific criterion level for each item but were told that items would be practiced until they reached an “acceptable level of performance.”

For the short-lag group, the 12 items from one list were each presented for an initial study trial. After all items in the list had an initial study trial, items received test–restudy practice trials until they were correctly retrieved to their predetermined criterion level. When all items in one list had been practiced to criterion, items from a second list were presented for initial study and test–restudy practice trials, and so on until items from each of the four lists had been learned. Order of list presentation was counterbalanced across participants.

For the long-lag group, the four lists of 12 items were combined into one list. All items were presented for an initial study trial. After initial study, items received test–restudy practice trials until items were correctly retrieved to their predetermined criterion level.

Immediately after a given item was correctly recalled to its criterion level of performance (i.e., one, three, six, or nine correct retrievals), participants made a JOL for that item. For the JOL trial, participants were asked the following: “For the item you just saw, how likely do you think it is that you will be able to correctly recall the ENGLISH translation when you are shown the SWAHILI word on the final test 7 days from now?” Participants were asked to type in a response, using any number from 0 to 100 (in which 0 = 0% likelihood of recalling in 7 days and 100 = 100% likelihood of correctly recalling in 7 days). Thus, participants made 48 JOLs, one for each item immediately after the item reached its criterion level of performance during practice.

During the second session 1 week later, participants completed a computer -administered self-paced cued recall final test for all 48 word pairs.

Results and discussion

Final test performance

The mean percentage of items correctly recalled on the final test as a function of criterion level and lag is presented in Fig. 1. Results of a 2 (lag) × 4 (criterion level) mixed factor analysis of variance (ANOVA) showed a significant main effect of criterion level, with final test performance significantly increasing as the number of correct retrievals during practice increased, F(3, 117) = 41.67, MSE = .02, p < .001. The main effect of lag was also significant, with final test performance significantly higher in the long-lag group than in the short-lag group, F(1, 39) = 37.74, MSE = .07, p < .001. The interaction was also significant, indicating a greater difference in performance for the lag groups as criterion level increased, F(3, 117) = 6.47, MSE = .02, p < .001.

Fig. 1
figure 1

Mean percentage of items correctly recalled on the final test as a function of criterion level and lag, Experiment 1. Error bars represent standard errors

Judgments of learning

As was expected on the basis of findings from prior research, higher criterion levels and longer lags between correct retrievals improved final test performance. More important for present purposes, to what extent were JOLs sensitive to the effects of criterion level and lag on final test performance? Mean JOL values at each criterion level for each lag group are presented in Fig. 2. Results of a 2 (lag) × 4 (criterion level) mixed factor ANOVA showed a significant main effect of criterion level, with mean JOL values increasing as the number of correct retrievals during practice increased, F(3, 117) = 44.16, MSE = 162.78, p < .001. Thus, JOLs show correct directional sensitivity to the effects of criterion level on final test performance.

Fig. 2
figure 2

Mean JOL values on criterion trial during practice as a function of criterion level and lag, Experiment 1. Error bars represent standard errors

In contrast, the main effect of lag was not significant, F(1, 39) = 2.26, MSE = 2,942.23, p = .141. JOLs did not accurately reflect the effects of lag on final test performance. In fact, the numerical trend was in the opposite direction (t-tests showed a significant difference between short-lag and long-lag JOLs for criterion level 1, t(39) = 2.61, p = .01, as well as a trend for criterion level 3, t(39) = 1.81, p = .08). The interaction term was not significant, F(3, 117) = 2.12, MSE = 162.78, p = .102. Thus, although performance differences between lag groups increased as criterion level increased, JOL differences did not show this same pattern.

In sum, JOLs showed correct directional sensitivity to the effects of criterion level but did not show correct directional sensitivity to the effects of lag between correct retrievals. To what extent did the mnemonic cue of retrieval fluency influence JOLs? To measure retrieval fluency, we examined first keypress latency for all correct retrieval trials in session 1. First keypress latency was defined as the amount of time between onset of the Swahili cue and a participant’s first keypress in the response box. For each participant, we calculated the mean first keypress latency for the nth correct retrieval during practice, with n = 1–9 correct retrievals across criterion level conditions. To provide the most stable estimates of first keypress latency, we collapsed across criterion level for this analyses (e.g., all 48 items were correctly recalled once and thus contributed to this mean, the 36 items assigned to criterion levels 3–9 were each correctly recalled a second and third time and thus contributed to these means, and so on; outcomes were highly similar when analyses were conducted only on the basis of items assigned to criterion 9). Figure 3 shows mean first keypress latency (in seconds) as a function of the nth correct retrieval during practice. Results of a 2 (lag) × 9 (nth correct retrieval) mixed factor ANOVA revealed a significant main effect of lag, with shorter latencies for the short-lag group than for the long-lag group, F(1, 39) = 8.56, MSE = .50, p = .006. The main effect of nth correct retrieval during practice was also significant, with latencies decreasing as the number of correct retrievals during practice increased, F(8, 312) = 213.57, MSE = .04, p < .001. The interaction was also significant, F(8, 312) = 7.83, MSE = .04, p < .001.

Fig. 3
figure 3

Mean first keypress latency (in seconds) as a function of the nth correct retrieval during practice for each lag group, Experiment 1. Error bars represent standard errors

These results support the possibility that the mnemonic cue of retrieval fluency influenced JOLs during criterion learning. However, at least for criterion level, the extrinsic cue may also have influenced JOLs. Given that both mnemonic and extrinsic cues may influence JOLs, we examined the extent to which criterion level and retrieval fluency uniquely influence JOLs by conducting a series of hierarchical linear models (HLMs).Footnote 2 We also examined the extent to which two other cues may have influenced JOLs. Specifically, we included the intrinsic cue of normative item difficulty (from Nelson & Dunlosky, 1994) and the mnemonic cue of number trials involving retrieval failure prior to the first correct recall during practice for each item. The first model assessed the relationship between criterion level and JOLs. Results showed that JOLs significantly increased as criterion level increased, t(1926) = 7.20, p < .001. The second model assessed the relationship between retrieval fluency (first keypress latency) and JOLs. Results showed that JOLs significantly increased as first keypress latencies decreased, t(1926) = 7.03, p < .001. The third and fourth models assessed the relationship between normative item difficulty and JOLs and between number of retrieval failures and JOLs, respectively. Results showed no significant relationship between either of these variables and JOLs, ps > .05.

Given the significant relationships between criterion level and JOLs and retrieval fluency and JOLs, the fifth model examined the extent to which each of these variables influenced JOLs when the other variable was controlled for. Results showed that both criterion level and first keypress latency were significantly related to JOLs, t(1925) = 6.52, p < .001, and t(1925) = 2.16, p = .03, respectively. Taken together, these analyses suggest that both the factors of criterion level and retrieval fluency influenced JOLs during retrieval practice.

Experiment 2

Results demonstrated that JOLs show correct directional sensitivity to the effects of criterion level on final test performance: Both final test performance and JOLs increased as criterion level increased. Furthermore, HLM analyses indicated a relationship between the extrinsic cue of criterion level and JOLs above and beyond the influence of criterion level on the mnemonic cue of retrieval fluency. Presumably, the extrinsic cue reflects a metacognitive belief about the effects of criterion level on final test performance. However, Karpicke (2009) reported results suggesting that learners may not have appropriate metacognitive beliefs regarding criterion level. Of interest here, after items were learned to criterion during practice, participants were asked to make aggregate judgments, in which they judged the number of items they would remember on a final test 1 week later. Results showed that aggregate judgments did not differ for a group of participants who terminated practice after one correct recall versus participants who completed two additional practice trials, suggesting that participants may not understand the memorial benefits of increasing criterion levels. Thus, one goal of Experiment 2 was to provide further evidence that participants have correct metacognitive beliefs about the effects of criterion level on final test performance.

In contrast to the criterion level results, JOLs did not show correct directional sensitivity to the effects of lag between correct retrievals on final test performance. Final test performance was higher for the long-lag versus short-lag group, whereas JOLs did not statistically differ (and were even numerically lower) for the long-lag versus short-lag group. The design of Experiment 1 precluded us from examining the relationship between lag and JOLs using HLM analyses, as we did for criterion level, because lag was a between-participants manipulation. Therefore, in Experiment 2, lag was manipulated within subjects. Additionally, to further diagnose why JOLs did not show correct directional sensitivity to the effects of lag on final test performance, Experiment 2 evaluated metacognitive beliefs about the effects of lag. One possibility is that participants have correct metacognitive beliefs about the effects of lag on final test performance, but the salient mnemonic cue of retrieval fluency overrides the extrinsic cue of lag. Another possibility is that participants do not have beliefs or have incorrect beliefs about the effects of lag. To measure participants’ metacognitive beliefs about the effects of criterion level and lag on final test performance, in addition to making item-specific JOLs, participants in Experiment 2 also made aggregate judgments. In contrast to item-specific JOLs, aggregate judgments are global predictions about performance, in which participants make overall judgments about the number of items within each level of lag and criterion they believed they will later recall.

The results of Experiment 1 are consistent with the idea that participants have correct beliefs about the effects of criterion level on final test performance, and thus we predicted that aggregate judgments would be greater for higher versus lower criterion levels. In contrast, the pattern of results for lag will be more revealing because a number of outcomes are plausible. If participants have correct beliefs about the effects of lag on final test performance, aggregate judgments will be greater for longer versus shorter lags. If participants do not have beliefs about the effects of lag on final test performance, aggregate judgments will not differ for longer versus shorter lags. Finally, if participants have incorrect beliefs about the effects of lag on final test performance, aggregate judgments will be greater for shorter versus longer lags.

Method

Participants and design

Sixty-seven Kent State University undergraduates participated in return for course credit. Criterion level (one, three, or nine correct retrievals per item) and lag (short vs. long) were within-participants manipulations. Aggregate judgment (preacquisition versus no preacquisition judgment) was a between-participants manipulation, with 36 and 31 participants in each group, respectively.

Materials

Items included 72 Swahili–English translation word pairs. Thirty-six word pairs were assigned to each of two lists, with an equivalent range of item difficulty within each list. List assignment to lag was counterbalanced across participants. Twelve items within each list were randomly assigned to each criterion level (randomized anew for each participant).

Procedure

All participants were told that they would be learning foreign language word pairs and would receive test–restudy practice with items until they reached an acceptable level of performance. Participants in the preacquisition aggregate judgment group then received detailed instructions about lag and criterion level (see the Appendix for complete instructions). In brief, we described these variables in relation to studying with flashcards, a common study strategy reported by undergraduates (Kornell & Bjork, 2008). Participants then predicted how many short-lag and long-lag items they would be able to remember on the final test. Next, they predicted how many of the criterion 1, 3, and 9 items they would be able to remember. These preacquisition aggregate judgments were included as a measure of prior knowledge regarding lag and criterion level.

All participants then began the main experimental task. As in Experiment 1, items were each presented for a 10-s initial study trial. After initial study, items received test–restudy practice until they were correctly retrieved to their criterion level of performance. Once an item reached criterion, participants made a JOL for the item, and then the item was dropped from the list. For short-lag items, participants learned 12 items (4 items from each criterion level) in each of three separate blocks of practice. For long-lag items, participants learned all items in one block of practice. After all items in one lag condition had been learned to criterion, participants had initial study and test–restudy practice with items from the second lag condition. Order of presentation of short-lag and long-lag items was counterbalanced across participants.

After all items had been learned to criterion, all participants made postacquisition aggregate judgments for each level of lag and criterion level. Prior to making judgments, all participants read detailed instructions regarding lag and criterion level manipulations (see the Appendix). After making aggregate judgments, participants were dismissed and reminded to return 1 week later for the final test. The final test was a participant-paced cued recall test. After the final test, all participants made posttest aggregate judgments. As with previous aggregate judgments, participants received instructions regarding lag and criterion level prior to making judgments (see the Appendix).

Results and discussion

No significant differences emerged in any measure as a function of aggregate judgment group (preacquisition vs. no preacquisition), so we collapsed across this variable for all further analyses.

Final test performance

The mean percentage of items correctly recalled on the final test as a function of criterion level and lag is presented in Fig. 4. Results of a 2 (lag) × 3 (criterion level) repeated measures ANOVA showed significant main effects of criterion level and lag, as well as a significant interaction, F(2, 132) = 23.62, MSE = .01, p < .001, F(1, 66) = 90.98, MSE = .02, p < .001, and F(2, 132) = 6.56, MSE = .01, p = .002, respectively. Once again, final test performance significantly increased as the number of correct retrievals during practice increased and was significantly higher in the long-lag versus short-lag condition. The interaction again showed that the difference between lag groups was greater as criterion level increased.

Fig. 4
figure 4

Mean percentage of items correctly recalled on the final test as a function of criterion level and lag condition, Experiment 2. Error bars represent standard errors

Judgments of learning

As in Experiment 1, we evaluated the extent to which JOLs are sensitive to the effects of lag and criterion level on final test performance (see Fig. 5). Results of a 2 (lag) × 3 (criterion level) repeated measures ANOVA showed significant main effects of criterion level and lag, as well as a significant interaction, F(2, 132) = 148.49, MSE = 257.16, p < .001, F(1, 66) = 10.81, MSE = 412.96, p = .002, and F(2, 132) = 3.78, MSE = 57.34, p = .025, respectively. Concerning criterion level, as the number of correct retrievals during practice increased, mean JOL values increased, replicating results from Experiment 1. Concerning lag, JOLs were significantly higher for short-lag versus long-lag items. Thus, JOLs showed incorrect directional sensitivity to the effects of lag on final test performance.

Fig. 5
figure 5

Mean JOL values on criterion trial during practice as a function of criterion level and lag condition, Experiment 2. Error bars represent standard errors

Concerning the correct directional sensitivity of JOLs to the effects of criterion level on final test performance, to what extent might this relationship reflect metacognitive beliefs about the benefits of more versus fewer correct retrievals during practice? For preacquisition aggregate judgments (leftmost bars in Fig. 6), a repeated measures ANOVA revealed a significant main effect of criterion level, F(2, 70) = 37.64, MSE = 11.06, p < .001. Aggregate judgments increased as a function of criterion level, indicating that participants did have accurate prior metacognitive beliefs about the effects of criterion level on final test performance. The same pattern obtained for postacquisition and posttest aggregate judgments (middle and rightmost bars in Fig. 6), F(2, 130) = 62.33, MSE = 11.07, p < .001, and F(2, 106) = 45.16, MSE = 3.83, p < .001, respectively. Note that although these outcomes establish that participants have appropriate metacognitive beliefs about the effect of criterion level on performance, they do not establish that this knowledge about criterion level influenced JOLs. However, they do confirm an important precondition by establishing that this extrinsic cue is available for use, which provides additional evidence converging with our interpretation of the outcomes of the HLM analyses (reported below).

Fig. 6
figure 6

Mean proportion of criterion level 1, 3, and 9 items that participants predicted they would recall as a function of timing of aggregate judgment, Experiment 2. Error bars represent standard errors

Concerning the incorrect directional sensitivity of JOLs to effects of lag on final test performance, participants may have had correct beliefs about lag effects that were overridden by the salient cue of retrieval fluency. Alternatively, participants may have had incorrect or no beliefs about lag effects. Examination of the pattern of aggregate judgments reported in Fig. 7 supports the latter possibility. For preacquisition aggregate judgments (leftmost bars in Fig. 7), no significant differences emerged for short-lag versus long-lag items, F < 1. In fact, judgments were almost identical, indicating that participants have no prior beliefs about the memorial benefits of longer versus shorter lags. Postacquisition aggregate judgments also did not differ for long versus short lags (middle bars in Fig. 7), F < 1. Of course, at this point, participants have not experienced the memorial benefits of using a longer lag to learn items. After the final test, however, aggregate judgments were still similar for the two lag conditions (rightmost bars in Fig. 7), F < 1. Note that although posttest judgments were similar for lag conditions, comparison of posttest judgments with earlier aggregate judgments indicates that participants were learning from experience. Posttest aggregate judgments were lower than preacquisition and postacquisition judgments (ts > 4.82, ps < .001). Thus, the absence of a lag effect was unlikely to have been due to participants’ perseverating on prior judgment magnitudes.

Fig. 7
figure 7

Mean proportion of short-lag and long-lag items that participants predicted they would recall as a function of timing of aggregate judgment, Experiment 2. Error bars represent standard errors

JOLs showed correct directional sensitivity to criterion level but showed incorrect directional sensitivity to lag. To what extent were these patterns due to an influence of the mnemonic cue of retrieval fluency on JOLs? To measure retrieval fluency, we examined first keypress latency for all correct retrieval trials in session 1, as in Experiment 1. Figure 8 shows mean first keypress latency (in seconds) as a function of the nth correct retrieval during practice. Results of a 2 (lag) × 9 (nth correct retrieval) repeated measures ANOVA revealed a significant main effect of lag, F(1, 66) = 125.64, MSE = .17, p < .001. First keypress latencies were significantly shorter for the short-lag versus long-lag condition. Results also revealed a significant main effect of the nth correct retrieval, as well as a significant interaction, F(8, 528) = 377.71, MSE = .07, p < .001, and F(8, 528) = 30.58, MSE = .04, p < .001. As in Experiment 1, first keypress latencies significantly decreased as the number of correct retrievals during practice increased.

Fig. 8
figure 8

Mean first keypress latency (in seconds) as a function of the nth correct retrieval during practice for each lag condition, Experiment 2. Error bars represent standard errors

To examine the extent to which lag, criterion level, retrieval fluency, normative item difficulty, and number of retrieval failures before the first correct recall during practice influenced JOLs, we again conducted HLM analyses. The first model assessed the relationship between lag and JOLs. Consistent with results reported above, the relationship between lag and JOLs was not significant, t(4744) = 0.91, p = .363. The second model assessed the relationship between criterion level and JOLs. As in Experiment 1, the relationship between criterion level and JOLs was significant, with JOLs increasing as criterion level increased, t(4744) = 10.63, p < .001. The third model assessed the relationship between first keypress latency and JOLs. As in Experiment 1, the relationship between first keypress latency and JOLs was significant, with JOLs increasing as first keypress latency decreased, t(4744) = 9.38, p < .001. The fourth model assessed the relationship between normative item difficulty and JOLs and showed a significant relationship, with higher JOLs for normatively easier versus more difficult items, t(4744) = 2.67, p = .004. The fifth model assessed the relationship between number of incorrect retrievals during encoding and JOLs and showed a nonsignificant relationship, p > .05.

Of greatest interest, we ran a sixth model with all variables that were significantly related to JOLs to assess the extent to which each variable uniquely influenced JOLs. Results showed that criterion level, first keypress latency, and normative item difficulty were all significantly related to JOLs, t(4742) = 26.16, p < .001, t(4742) = 6.63, p < .001, and t(4742) = 2.43, p = .015, respectively. These results suggest that the extrinsic cues of criterion level, the intrinsic cue of normative item difficulty, and the mnemonic cue of retrieval fluency each uniquely influenced JOLs. Why did normative item difficulty influence JOLs in the present experiment, but not in Experiment 1? Because lag was manipulated within participants, Experiment 2 included more items than did Experiment 1, which led to the inclusion of more difficult items. Results from the HLM suggest that having a larger range of item difficulty may have provided participants with another cue for making JOLs. Most important, however, we replicated the results from Experiment 1, with both criterion level and first keypress latency influencing JOLs.

General discussion

The present experiments evaluated two questions. First, are JOLs made after correct retrievals during practice sensitive to the effects of quantity and timing of these correct retrievals on final test performance? Second, which cues are used to make JOLs during criterion learning? Concerning the quantity of correct retrievals, JOLs showed correct directional sensitivity to the effects of criterion level on final test performance: Both performance and JOLs increased as the number of correct retrievals during practice increased. In contrast, concerning the timing of correct retrievals, JOLs did not show correct directional sensitivity to the effects of lag on final test performance: Performance was greater for items correctly retrieved after longer versus shorter lags, but JOLs were not, with numerical trends (Experiment 1) or significant differences (Experiment 2) in the opposite direction. In relation to the second question, results from both Experiments 1 and 2 showed that the mnemonic cue of retrieval fluency and the extrinsic cue of criterion level influenced JOLs. Additionally, the intrinsic cue of item difficulty influenced JOLs in Experiment 2, when items had a wider range of difficulty. However, the extrinsic cue of lag did not influence JOLs, nor did the cue of number of failed retrieval attempts during practice.

Given that both JOLs and aggregate judgments showed correct directional sensitivity to the effects of criterion level on final test performance, additional research exploring the extent to which individuals use their metacognitive knowledge about the memorial benefits of increasing criterion levels to control self-regulated retrieval practice will be informative. For example, Kornell and Bjork (2008) had participants learn items for a later retention test and allowed some participants to drop items from further practice during learning. Results showed that a majority of items were dropped from practice after one correct recall. This result is somewhat troubling, given the substantial gains in final test performance after an item has been correctly recalled more than one time during practice.

Why did Kornell and Bjork’s (2008) participants drop items after only one correct recall during practice when participants in the present study demonstrated metacognitive knowledge about the memorial benefits of more versus fewer correct retrievals during learning? One possibility is that Kornell and Bjork’s participants were being strategic on the basis of the time constraints imposed in that study. Specifically, participants were given only 10 min to learn as many items as they could. Participants may have discontinued practice with items after they could correctly recall them one time so that they could focus the remainder of their limited study time on items that had not yet been correctly recalled. If participants were given the goal of learning items for a later retention test and also were given unlimited time to learn the items, it is possible that their self-regulated decisions would more closely resemble their judgments for criterion level in the present experiments, with participants deciding to practice items until they are correctly recalled multiple times before dropping them from practice. Nonetheless, these results leave open the possibility that participants may not effectively self-regulate practice, even though results from the present experiments demonstrate that participants have metacognitive knowledge regarding criterion level effects.

Although the results reported here consistently demonstrated that metacognitive judgments showed correct directional sensitivity to the effects of criterion level on later performance, the results from Karpicke (2009) showed a different pattern. To revisit, after items were learned to criterion during practice, participants were asked to make aggregate judgments. Results showed that judgments did not differ for individuals who terminated practice after one correct recall versus those who completed two additional practice trials. What might explain the inconsistency between the present findings and those of Karpicke? One possibility is that differences in experimental design influenced metacognitive judgments. Criterion level was a within-participants manipulation in the present experiments, whereas practice schedule was a between-participants manipulation in Karpicke’s study. Previous research has shown that the extent to which individuals incorporate metacognitive beliefs into their metacognitive judgments can depend on the extent to which the encoding conditions elicit attention to a given variable (e.g., Koriat et al., 2004). Because within-participants manipulations allow participants to experience different levels of a variable (e.g., criterion level), they may be more likely to consider their beliefs about the effects of that variable when making judgments than in a between-participants design. Although additional research will be needed to explore this possibility further, this account does provide a plausible reconciliation of the apparent inconsistency between the present outcomes and those of Karpicke.

Not only would future research evaluating the sensitivity of JOLs to the effects of criterion level on final test performance be beneficial for understanding self-regulated decisions individuals make when they have unlimited time to learn items, but also it would provide insight into the pattern of diminishing returns observed for criterion level. Previous research has shown that performance increases as criterion level increases, but the incremental benefit to final test performance decreases as criterion level increases (i.e., Figs. 1 and 4; see also Pyc & Rawson, 2009; Vaughn & Rawson, 2011). Although JOLs in the present experiments showed correct directional sensitivity to the effects of criterion level on final test performance in that they increased as criterion level increased, they did not appropriately reflect the diminishing returns of increasing criterion level on final test performance (e.g., the curvilinear pattern of final test performance in Fig. 4 vs. the linear pattern of JOLs in Fig. 5). What does this suggest about the potential basis for JOLs, given that they did not properly reflect the pattern of diminishing returns for final test performance? On one hand, one might think that the pattern reflects the influence of ease of processing during retrieval (given that latencies decrease as criterion level increases). However, the finding that criterion level still influenced JOLs even after controlling for retrieval latency in the HLM analyses weighs against this account. Another possibility suggested by the aggregate judgments is that students have incorrect beliefs about this particular feature of criterion level effects. Future research could evaluate why JOL are not sensitive to the diminishing returns of increasing criterion level.

In the presenst experiments, JOLs did not show correct directional sensitivity to the effects of lag on final test performance. This lack of sensitivity to lag is unfortunate, given the substantial effects of lag on performance. Note that final test performance following just one correct recall at a long lag was as good as (Experiment 1) or even better than (Experiment 2) nine correct recalls at a short lag. Students are likely not fully capitalizing on the benefits of testing by not appreciating the influence of lag on retention. This possibility is further bolstered by complementary results reported by Kornell (2009). Across a series of experiments, Kornell evaluated the effectiveness of a fixed number of practice trials administered with either a short or a long lag. Of interest here, participants made aggregate judgments after studying items during practice. Results showed that although final test performance was greater for items practiced with longer versus shorter lags, aggregate judgments were higher for items practiced with shorter versus longer lags. Taken together, these results demonstrate that participants do not understand the memorial benefits of longer versus shorter lags during learning. These results are somewhat troubling, given that the lag effect is one of the most robust findings in the memory literature and has the potential to greatly impact student learning and scholarship.

Although JOLs were of primary interest in the present experiments, in Experiment 2 we also included aggregate judgments in order to more directly evaluate beliefs about criterion level and lag. These aggregate judgments may be useful for further investigating beliefs about other factors in future research. For example, aggregate judgments may shed light on the extent to which individuals are aware of the memorial benefits of difficult retrievals during encoding. In the present experiments, retrieval fluency consistently had an influence on judgments (i.e., judgments increased as fluency decreased). In keeping with Koriat’s (1997) cue-utilization framework, we classified retrieval fluency as a mnemonic cue. However, some research has shown that the relationship between fluency and JOLs may be theory driven (i.e., based on beliefs; e.g., Matvey, Dunlosky, & Guttentag, 2001). Future research could evaluate what people believe about the effect of fluency on performance by including aggregate judgments about retrieval fluency.

The present experiments extend beyond prior metacognitive research by evaluating the sensitivity of JOLs to the effects of the quantity and timing of successful retrievals during criterion learning (as opposed to prior research involving study only and/or fixed amounts of practice trials) on final test performance. The present results indicate that JOLs made after correct retrievals during practice show correct directional sensitivity to the effects of criterion level on final test performance but do not show correct directional sensitivity to the effects of lag on final test performance. Given the important implications for student learning and scholarship, one goal of future research should be to evaluate ways to improve the sensitivity of JOLs to the effects of lag on final test performance. Additionally, future retrieval practice research should evaluate the extent to which individuals’ self-regulated decisions are related to their JOLs and beliefs.