The underconfidence-with-practice effect in action memory: The contribution of retrieval practice to metacognitive monitoring

When making memory predictions (judgments of learning; JOLs), people typically underestimate the recall gain across multiple study–test cycles, termed the underconfidence-with-practice (UWP) effect. This is usually studied with verbal materials, but little is known about how people repeatedly learn and monitor their own actions and to what extent retrieval practice via interim tests influence the progression of JOLs across cycles. Using action phrases (i.e., squeeze the lemon) as learning material, we demonstrated the UWP effect after both verbal and enactive encoding, although we did not get first-cycle overconfidence. As predicted, participants exhibited underconfidence in Cycles 2 and 3, as an error of calibrations. However, people’s resolution of JOLs (i.e., ability to discriminate recalled from unrecalled items) increased across study–test cycles. Importantly, JOLs for study–test (relative to study–study) items increased faster across cycles suggesting that repeated study–test practice not only produces underconfidence across cycles, but also reduces underconfidence relative to study–study practice. We discuss these findings in terms of current explanations of the underconfidence-with-practice effect.

Metacognitive monitoring is central to human learning because it guides people's control of their study behavior (for reviews, see Bjork et al., 2013;Rhodes, 2016;Rhodes & Castel, 2009;Schwartz & Jemstedt, 2021;Soderstrom & Bjork, 2014). However, miscalibrated metacognitive monitoring results in ineffective control (Kornell & Bjork, 2008;Soderstrom & Bjork, 2014). Thus, advancing our understanding of metacognitive monitoring and its role in control is important for both theory and application. In this paper, we direct our attention to a well-studied metacognitive judgment, judgments of learning (JOLs). Although people are often accurate in predicting subsequent memory performance (Dunlosky & Nelson, 1992;Koriat, 1997), they are prone to systematic biases such as the underconfidence-with-practice effect (UWP; Koriat et al., 2002; see also Ariel & Dunlosky, 2011;Finn, & Metcalfe, 2007, 2008Hanczakowski et al., 2013;Koriat & Bjork, 2006;Rast & Zimprich, 2009;Scheck & Nelson, 2005;Tauber & Rhodes, 2012). The UWP effect refers to the observation that people tend to shift from overconfidence (mean JOL magnitude is higher than mean recall performance) to underconfidence (mean JOL magnitude is lower than mean recall performance) in subsequent learning cycles (e.g., Koriat et al., 2002). In addition to this shift in calibration between JOL magnitude and the overall recall level (i.e., calibration is also called absolute accuracy), the UWP is associated with an increase in resolution of JOLs with repeated study-test practice (Ariel & Dunlosky, 2011;Koriat, 1997;Koriat et al., 2002;Koriat & Bjork, 2006). Resolution (also known as relative accuracy) refers to people's ability to discriminate between the items that will and will not be recalled at a later occasion. Resolution is typically measured by the within-person JOL-recall gamma correlation (Nelson, 1984; but see Benjamin & Diaz, 2008).
In the present study, we pursued three main research aims. First, we examined whether the UWP effect generalizes to action phrases (i.e., phrases including one verb and one noun, such as squeeze the lemon or break the pencil) that are either verbally encoded or enactively encoded (i.e., by acting out the actions described in the to-be-remembered verb-noun phrases). Such knowledge is important because it contributes to our understanding of how learning can be effectively managed, especially because people typically monitor the learning of their own actions. Although the UWP effect is robust across experimental manipulations (for more details, see Finn & Metcalfe, 2007;Koriat et al., 2002), there is little research on how well we monitor learning of actions, in particular across multiple study and test phases. Our second and more important aim was to elucidate the role of retrieval versus study practice in the UWP effect, which also allowed us to evaluate predictions of the Memory-for-Past-Test account, the anchoring-and-adjustment-account, and the mnemonic-debiasing account. With our third aim, we examined the mnemonic benefits of and metacognitive sensitivity of enactive versus verbal encoding across three learning cycles.

Theoretical Accounts of the UWP Effect
Several accounts have been proposed to explain the basis of the UWP effect. According to the mnemonic-debiasing account (Koriat & Bjork 2006), the UWP effect occurs as a function of a foresight bias, which is alleviated by self-testing during practice. The foresight bias claims that people make inflated memory predictions regarding the recallability of information that is present during learning (e.g., both the cue and the target), but is absent during a subsequent test (Koriat & Bjork, 2005. Pertinent to this study,

The Relative Contribution of Retrieval versus Restudy Practice to the Underconfidence-With-Practice Effect
Retrieval practice has various beneficial effects on memory, specifically on long-term retention (Kubik et al., 2018(Kubik et al., , 2020Roediger & Karpicke, 2006a; for comprehensive overviews, see McDermott, 2021;Roediger & Karpicke, 2006a; see also Kubik, Gaschler, et al., 2021). For current purposes, we focus on the indirect and metacognitive benefits of retrieval practice. The finding that taking an interim test enhances the efficiency of subsequent encoding of previously learned information is termed the indirect testing effect (i.e., test-potentiated learning; Izawa, 1966; see also Soderstrom & Bjork, 2014). For example, more items are newly retrieved from pre-to post-test when the number of tests prior to restudy is increased (Arnold & McDermott, 2013; see also Kubik et al., 2015;Tempel & Kubik, 2017). Beyond enhancing subsequent restudy of information, interim tests improve people's ability to accurately predict their own future learning (see Koriat et al., 2002). This may occur because retrieval practice exposes gaps in one's own knowledge and sensitizes people to mnemonic cues-such as ease of learning and retrieval fluency-which are diagnostic for metacognitive monitoring and control of learning (Karpicke, 2009;Mitchum et al., 2016;Roediger & Karpicke, 2006b;Soderstrom & Bjork, 2014;Yang, Potts, et al., 2017;Yang, Sun, et al., 2017).
Given the current widespread interest in the benefits of interim tests, it is surprising that the relative contribution of study-versus-test experience in relation to the UWP effect has been investigated so few times (but see; . Repeated study-test practice usually leads to a shift from over-to underconfidence with succeeding cycles, but it is less clear whether this bias derives from the study or test phases. Typically, repeated practice is instantiated as a repeated study-test condition (e.g., STSTST), but it is rarely compared to repeated restudy practice (e.g., SSSSSS) with regard to metacognitive judgments (but see Karpicke, 2009;Koriat & Bjork, 2006). In accordance with the mnemonicdebiasing account, interim tests-but not studying-provides information regarding the retrieval success and retrieval fluency of items. The test-related provision of and sensitization to these diagnostic mnemonic cues provide a basis for learners to update their memory predictions (Koriat & Bjork, 2006). Thus, rather than fostering underconfidence, test experience should reduce this metacognitive bias relative to repeated restudy, and therefore, test experience should lead to less underconfidence and increased resolution (see . In contrast, following from the MPT account (Finn & Metcalfe, 2007), one may predict the underconfidence based on people's tendency to rely on MPT information across two cycles: that is, they continue to base their JOLs on the past test, when available, and not on new learning in the next study phase of the subsequent study-test cycle (cf. . However, JOL magnitude should increase to the extent that past test performance also increased relative to an earlier study-test cycle. In this work, we aim to address the role of the relative contribution of study-versus-test experience to the progression of JOL magnitude across multiple learning cycles and test the predictions of the above mentioned accounts of the UWP effect.

Monitoring the Progression of the Enactment Effect across Multiple Learning Cycles
Memory is presumably biased, largely, to remember and monitor action-relevant information (Heuer et al., 2020). As such, a great deal of research has been devoted to memory for actions (Roediger & Zaromb, 2010). However, little is known about metacognition for the learning of action-related information as it progresses across several study-test cycles (but see Koriat et al., 2002). This knowledge is important for an understanding of effective management of learning, especially because of the claim that people sometimes have difficulties in monitoring their own actions.
Action memory research has largely focused on the enactment effect, the finding that motorically performed action phrases (i.e., enactive encoding) are better remembered later than the remembering that occurs after reading the same phrases (i.e., verbal encoding; for comprehensive reviews, see Engelkamp, 2001;Roediger & Zaromb, 2010;Steffens et al., 2015; for seminal studies, see Cohen, 1981;Engelkamp & Krumnacker, 1980;Saltz & Donnenwerth-Nolan, 1981). Researchers in this area widely agree that enactment promotes item-specific processing, that is, incidentally focusing attention on the individual features of the action phrase (Kubik, Obermeyer, et al., 2014;Li & Wang, 2018;Seiler & Engelkamp, 2003;Steffens et al., 2015), including the verb, the noun, and their association (Koriat, 1995;Steffens et al., 2006Steffens et al., , 2009. This contrasts with relational-processing which creates associations across items (e.g., linking squeeze the lemon with pick up the fork). This enhanced item-specific processing increases memory performance, leading to the enactment effect. However, there has been limited research that has focused on the learning of action phrases that occurs across multiple cycles of learning (Koriat & Pearlman-Avnion, 2003;Koriat et al., 1998;Kubik, Söderlund, et al., 2014).
More importantly, for the present study, the degree to which people accurately monitor their own actions has been examined in only a few studies, most of which employed a single study phase. According to this research, resolution (i.e., relative accuracy) of people's memory predictions is impaired by enactment (Cohen, 1983(Cohen, , 1988Cohen et al., 1991;Koriat et al., 1991). These results, however, should be interpreted with caution because either no control conditions were provided (Cohen et al., 1991), or single words were used as control items that were presented for shorter durations than the enacted phrases.
With this background, it is of theoretical and practical importance to understand how people monitor the degree and progression of their own actions over several learning cycles. To our knowledge, only Koriat et al. (2002) has investigated the degree and progression of predicted and actual learning across multiple study-test cycles using learning material including action phrases in an enactment condition; the initial study phase of each cycle, participants had to learn 30 paired associates, consisting of a Tumai verb (an imaginary language, but in essence, a nonsense phrase) and a randomly paired Hebrew action phrase denoting its meaning. They were instructed to act the target action phrase (e.g., smell the flower) out and then say them aloud. Similarly, during the tests, participants were presented with Tumai verbs and were asked to recall their corresponding action phrases both by performing and saying them aloud. A clear UWP effect was demonstrated-that is, participants showed initial overconfidence in their ability to remember the action phrases, but became underconfident across subsequent study-test cycles. This research is an important starting point for more systematic investigations of how people monitor the degree and progression of their own actions over several learning occasions, using typical action phrases both comparing verbal and encoding conditions.

The Aims of the Study
First, our aim was to demonstrate the UWP effect with action phrases and to generalize it to performed actions. Based on all three theoretical accounts of the UWP effect, we expected to demonstrate a metacognitive bias in terms of the shift toward underconfidence as learning progresses for both verbal and enactive encoding. The MPT account specifically predicts that "forgotten but then recalled" items exhibit most prominently the underconfidence as these items are recalled but have low JOLs; recalled items on Cycle 1 should show no or less underconfidence; however, some underconfidence may occur because of JOLs' "downward variance from 100%" for recalled items (Finn & Metcalfe, 2007). In contrast, the anchoring-and-adjustment account explains underconfidence for both previously recalled and unrecalled items as JOLs are insufficiently adjusted from a low anchor to recall performance.
Furthermore, as a second characteristic of the UWP effect, we tested the prediction that resolution increases across cycles for verbal and enactive encoding. Based on the MPT and mnemonic debiasing accounts, we made this prediction on the assumption that past-test information or respectively diagnostic mnemonic cues in general accumulate with repeated study-test practice with both cues being predictive of later recall performance. Furthermore, based on the MPT account, past-test information drives JOLs' magnitude. Consequently, past-correlations (i.e. the correlation of JOLs and recall performance in the previous learning cycle) on Learning Cycles 2 and 3 should be higher than resolution scores (i.e. the correlation of JOLs and recall performance in the current learning cycle), respectively. The anchoring-and-adjustment account does not make any predictions pertaining to resolution and past-test correlations.
Second, we sought to elucidate the role of tests in the UWP effect by comparing the effect of retrieval versus restudy experience on JOL magnitude. We tested the prediction that retrieval experience enhances JOLs' magnitude relative to restudy experience across learning cycles. This prediction is consistent with the mnemonic debiasing account assuming that retrieval experience, relative to restudy experience, provides more diagnostic mnemonic cues that more closely predict the increasing recall performance across cycles. However, the MPT account cannot explain the progress of JOL magnitude in study-study practice, as no past-test information is available. Furthermore, it predicts that the underconfidence largely stems from the "forgotten and then recalled" items, and not from previously recalled items. Based on the anchoring-and-adjustment account, we predict that the adjustment up from the initial anchor is larger for study-test than for study-study practice: with retrieval experience as a salient cue, learners may more likely overcome the stability bias in terms of discounting new learning . Consequently, retrieval experience should reduce rather than produce the underconfidence-with-practice effect relative to restudy experience-provided that study-study and study-test practice leads to similar levels of recall performance.
Third, we investigated actual and predicted cued-recall performance and memory predictions (JOLs) for enactive and verbal encoding across multiple learning cycles. Based on the notion that enactment elicits item-specific information, we predicted the enactment effect to remain stable across the entire learning phase. If learners use this item-specific information as a mnemonic cue for making their JOLs, they should also be sensitive to the mnemonic benefits of enactment across the learning session (see also Castel et al., 2013) Similar to prior research, we expected learners to have a generally impaired ability to monitor self-performed actions in terms of resolution (e.g., Cohen et al., 1991;Koriat et al., 1991).

Participants
A sample size of 60 participants was pre-determined for this study, with n = 30 participants to be randomly assigned to both encoding groups. This sample-size estimation was based on prior research (Kubik, Söderlund, et al., 2014;Kubik et al., 2018, Exp. 3) without any a priori power calculation.
In actuality, we individually tested 61 Stockholm undergraduate students (M [SD] age, 24.34 [5.56]; 43 females), and data from 59 participants were included in the final analysis. Two participants were excluded for various reasons. One participant did not follow the instructions, and the data were not recorded for one participant due to a technical error. Participants from this convenience sample were all native Swedish speakers and participated voluntarily without compensation or in return for course credits or movie vouchers.

Materials
Thirty-six Swedish action phrases (e.g., squeeze the lemon) were selected from the normative study of Molander and Arar (1998). The action phrases were two-to-four words long, were composed of one verb and one noun, and did not specify body parts as objects (e.g., scratch the ear).

Design
A mixed-factorial 2 × 2 × 3 design was used, with encoding type (enactive vs. verbal) being a between-subjects variable and study type (study-study vs. study-test) and cycle (1, 2, vs. 3) being within-subject variables. The primary dependent measures were recall performance in the interim tests (measured as proportion correctly recalled targets) and JOL magnitude, collected in Cycles 1-3 of the learning session. JOL accuracy was assessed using both calibration and resolution measures.

Procedure
The experimental procedure consisted of a learning session that included three study-study or study-test cycles of learning action phrases (see Fig. 1). A final test session of verb-cued recall followed after 5 min and after 1 week. The results of the final test session are not reported here as they are not relevant to the present article on the UWP effect. This experimental procedure was run with E-Prime 2.0 professional software (Psychological Software Tools, Pittsburgh, PA; Schneider et al., 2002).
At the beginning of the learning session, we told participants that they would learn 36 action phrases in three cycles, each including two phases. Half of the action phrases were studied in both phases in each of the three cycles (i.e., SS SS SS) and are called study-study items; the other half were studied in the first phase and tested in the second phase in each of the three cycles (i.e., ST ST ST) and are called study-test items. Action phrases were presented for 7 s on the computer screen, during which time they were to be read or acted upon and were followed by a 1-s interstimulus interval. Depending on the encoding group, participants either read the action phases in the first phase of each cycle (i.e., verbal encoding group), or enacted the action phases (i.e., enactive encoding group). More specifically, participants in the enactive encoding group pantomimed the actions described in the tobe-remembered verb-noun phrases (i.e., motorically performed them) without any actionrelated object at hand. Experimenters monitored the learning to ensure that participants were in fact pantomiming the actions.
Importantly, during the initial study phase of each of the three cycles, item-based JOLs were collected, participants were asked how confident (0-100%) they were that they would remember the noun after several minutes if cued with the verb. During testing, the verb of a previously studied action phrase (e.g., squeeze) was displayed as a retrieval cue, one at a time, for 7 s, or until learners pressed the ENTER key to indicate that they remembered the respective target noun (e.g., the lemon). Learners were permitted a maximum of 10 s to type their responses on a computer keyboard. The items were randomly presented for each participant and for each study/test phase in each learning cycle. The study phases were separated from each other by a 30-s-long arithmetic filler task (i.e., evaluating the correctness of mathematical equations with varying difficulty; e.g., 16 × 16 = 254) to eliminate primary memory effects (Glanzer & Cunitz, 1966).

Scoring and Data Analyses
Participants' responses were scored as correct if the original noun was entered on the keyboard. Two measures of metacognitive accuracy were used. Calibration (i.e., absolute accuracy) of JOLs assesses how over-or underconfident learners are when predicting their own memory. It was calculated by subtracting recall performance (proportion correct) from mean item-based JOL for each participant in the study-test conditions. Resolution (i.e., relative accuracy) assesses the extent to which learners can discriminate between items that will or will not be recalled on a later retrieval occasion. It was calculated with the nonparametric Goodman-Kruskal within-participants gamma correlation between actual and predicted recall performance in the study-test conditions (Nelson, 1984). Past-test correlations were calculated as the nonparametric Goodman-Kruskal within-participants gamma correlations between actual recall performance in the prior learning cycle and the predicted recall performance in the current learning cycle in the study-test condition. The figure illustrates the experimental procedure of this study. Participants passed through a learning session, and received two additional final cued-recall tests, which were not considered in the present article on the Underconfidence-With-Practice effect. The critical learning session consisted of three learning cycles, each including an initial study phase with subsequent item-based, cue-target judgements of learning (i.e., JOLs) and an ensuing study versus interim test phase. Critically, half of the action phrases were studied in both phases in each of the three cycles (i.e., study-study items), and half of them were studied in the first phase and tested in the second phase in each of the three cycles (i.e., study-test items). Depending on the encoding type, participants either read the action phrases in the first phase of each cycle (i.e., verbal encoding group), or enacted the action phases (i.e., enactive encoding group) An alpha-level of 0.05 was used. Adjusted, less-biased estimates for the population effect sizes were reported for analyses of variance (ANOVA; generalized omega squared [ω 2 ], Olejnik & Algina, 2003). In cases when the assumption of sphericity was violated, the reported numbers are calculated using the Huynh-Feldt correction. For planned comparisons between specific conditions or experimental groups, contrast analyses or Student t-tests were reported, using Cohen's d. If the assumptions of normality and/or homoscedasticity were violated, we reported equivalent nonparametric statistics, such as the Wilcoxon signed-ranks tests with the rank biserial correlation (r rb ). The dataset analyzed in the current study is publicly available at: https:// osf. io/ b4w26/.
Anchoring To investigate the potential effects of anchoring and adjustment, we analyzed mean JOLs for Cycles 1 and 2 as a function of whether the items were correctly recalled or not during Cycle 1. We used JOLs for nonrecalled items in Cycle 1 as an estimate for the anchor that learners set prior to the experiment, likely indicating theory-based beliefs about the learning tasks. Table 1 illustrates mean JOLs, recall performance, and calibration as a function of study type and recall status (recalled vs. nonrecalled) on the preceding Cycle (n-1).

Fig. 3 Calibration (Panel A), resolution (Panel B)
, and past-test correlations (Panel C) for study-test items, as a function of cycle (1/2/3) and encoding type (verbal/enactive). Error bars represent standard errors (SEs) of the mean phrases), cycle (JOLs increased across cycles), and encoding type (relatively higher JOLs for enacted phrases), as well as several significant two-way interactions between the three factors (ps < 0.001). Specifically, JOLs were higher for items that were recalled on Cycle 1 compared to those which were not recalled, as indicated by a main effect of recall status, F(1, 57) = 154.56, p < 0.001, ω 2 = 0.300, and JOLs increased from Cycle 1 to Cycle 2, as indicated by a main effect of cycle, F(1, 57) = 28.07, p < 0.001, ω 2 = 0.074. Importantly, JOL magnitude increased from Cycle 1 to Cycle 2 for previously recalled items, whereas they decreased for nonrecalled items, as indicated by the Recall Status × Cycle Interaction, F(1, 57) = 90.67, p < 0.001, ω 2 = 0.107. In addition, learners gave higher values to enacted action phrases than to read action phrases, as indicated by a main effect of encoding type, F(1, 57) = 7.65, p = 0.008, ω 2 = 0.054. However, JOL magnitude increased from Cycle 1 to Cycle 2 to the same degree for enactive and verbal encoding, as indicated by a nonsignificant Encoding Type × Cycle interaction, F(1, 57) = 1.32, p = 0.256, ω 2 < 0.001. Critically, a significant Encoding Type × Recall Status interaction was revealed, F(1, 57) = 6.82, p = 0.012, ω 2 = 0.016, indicating that the predicted enactment effect was larger for nonrecalled items than recalled items in Cycle 1 (see Table 1). Thus, the encoding-related anchor difference was even larger than the JOL difference for recalled items on Cycle 1. No significant Encoding Type × Recall Status × Cycle interaction was observed, F(1, 57) = 2.55, p = 0.116, ω 2 = 0.002.

Resolution
We calculated resolution for the study-test items as the mean within-participant gamma correlation between JOL magnitude of a given study phase and recall performance of the ensuing test phase within the same learning cycle.

Past-test correlations
We calculated past-test correlations as the mean within-participant gamma correlation between JOL magnitude of a given study phase and recall performance of the preceding test phase. Figure 3

General Discussion
The present study had three major aims. First, we made predictions from competing theories and then tested the generality of the UWP effect by using verbally and enactively encoded action phrases. Second, we explored the specific role that retrieval practice versus restudy practice plays in the progression of JOLs' magnitude suggesting that repeated study-test practice not only produces underconfidence across cycles, but also reduces underconfidence relative to study-study practice. Third, we examined the mnemonic benefits of and metacognitive sensitivity to enactive versus verbal encoding across three learning cycles. We predicted the enactment effect to occur across learning cycles, and that people are sensitive to this recall benefit in terms of JOL magnitude, but we also predicted that enactment decreases JOLs' resolution.

The Underconfidence-With-Practice-Effect in Memory for Actions
The UWP effect refers to the shift from over-to underconfidence across learning cycles. This pattern was only partially shown using simple action phrases, that is, we showed underconfidence in Cycles 2 and 3. More specifically, participants shifted their memory predictions toward underconfidence, from both Cycle 1 to Cycle 2, and from Cycle 1 to Cycle 3 when action phrases were verbally and enactively encoded during the study phases. Thus, the tendency to underestimate memory performance with practice generalizes to action events, replicating prior research using paired associates and action phrases (Koriat et al., 2002), single words (e.g., Koriat et al., 2002) as well as word pairs (e.g., Finn & Metcalfe, 2007, 2008Hanczakowski et al., 2013, Exp. 1;Serra & Dunlosky, 2005;Tauber & Rhodes, 2012). Thus, this generalizability suggests that the UWP is a general feature of learning. The shift toward underconfidence is consistent with both the MPT account, embarking on learners' tendency to underestimate the amount of new learning (Finn & Metcalfe, 2007, 2008, and the anchoring-and-adjustment account, referring to the incomplete adjustment up from a prior psychological anchor as learning progresses Scheck & Nelson, 2005). The mnemonic debiasing account (Koriat & Bjork, 2005 cannot account for the current data because it cannot explain why underconfidence on the second cycle occurs. However, the mnemonic debiasing account can explain how the provision of mnemonic cues in terms of recall success or ease can enhance JOL calibration toward underconfidence and thereby decrease the overconfidence in Cycle 1. Additional analyses revealed that JOLs shifted toward more accurate calibration from Cycle 2 to Cycle 3. This result pattern was equivalent for verbal and enactive encoding of action phrases. This result is consistent with other findings of the UWP literature, which also show little or no underconfidence in later cycles (e.g. Hanczakowski et al., 2013, Exp. 4). However, such a finding is rather inconsistent with the majority of research reporting that the magnitude of underconfidence stays the same from Cycle 2 onwards (see also Hanczakowski et al., 2013;Exp. 1;Koriat et al., 2002). This shift toward more accurate calibration from Cycle 2 to Cycle 3 cannot easily be accommodated with the MPT account because past-test information of study-test items should not reduce underconfidence but actually produce underconfidence (see also . However, as a posthoc explanation, we argue that the recall gain between Cycles 2 and 3 was relatively small compared to the large recall gain from Cycle 1 to Cycle 2, and that JOLs reflect in terms of memory for past information. This assumption is supported by strong negative associations between calibration shifts and recall gains between Cycles 1 and 2 as well as Cycles 2 and 3. The mnemonic-debiasing account and anchoring-and-adjustment account can explain the general pattern of the UWP effect leading to JOLs' underconfidence as learning progresses, but not the specific finding of a shift toward more accurate calibration. Future studies should systematically compare the UWP effect as a function of learning materials, number of cycles, and, importantly, item characteristics. Notably, some of the present results departed from typical pattern seen in the UWP effect. Only a slight and nonsignificant overconfidence in Cycle 1 was exhibited with action-phrase materials for both verbal and enactive encoding. This finding was not predicted and is somewhat surprising, but a lack of initial overconfidence in the UWP pattern has been reported in other studies (e.g., Hanczakowski et al., 2013, Exp. 1;Koriat, 1997;Exp. 1;Koriat et al., 2002). Nonetheless, UWP studies typically revealed JOLs' overconfidence in Cycle 1 (e.g., Koriat et al., 2002). Many factors may contribute to this finding. One factor may be that participants set a more accurate anchor point, possibly because of the increased familiarity of action-related concepts in everyday life relative to paired associates. As a result, the typical overconfidence bias in Cycle 1 was reduced. Another factor relates to the idea that item characteristics such as the backward association strength from the target word (e.g., cheese) toward the cue word (e.g., gouda) moderates the occurrence of the overconfidence effect. Prior research with paired associates (e.g., gouda-cheese) showed that a rather low backward association strength leads to a well-calibrated JOL-recall correspondence (Koriat & Bjork, 2005, similar to the present results with action phrases. However, high backward association strength leads to inflated JOL and an illusion of knowledge during study. In JOLs, in which both the cue and target being are available, participants discount that only the cue will be available at the time of the test. Thus, at the time of test, the backward association is absent and cannot support the learner (Koriat & Bjork, 2005; for a similar explanation for the overconfidence in retrospective judgements, see Juslin et al., 2000). In the current study, the selected set of action phrases in this study may have on average a low backward association strength from the noun target (e.g., lemon) to the verb cue (e.g., squeeze) because the target noun is also associated with many alternative actions (e.g., to eat, to smell, to pickle, to suck, to slice, to peel a lemon). Future studies can explore this possibility by assessing the verb-noun association strength and examining the size of the overconfidence as a function of it. Thus, the occurrence of overconfidence during any test may hinge on numerous methodological factors, whereas the shift pattern of JOLs toward underconfidence is a more pervasive and critical feature of the UWP effect (cf. Koriat et al., 2002).
A second feature of the UWP effect is the increased resolution with repeated study-test practice (Koriat et al., 2002; see also Ariel & Dunlosky, 2011). In our study, we found that JOLs' resolution increased across cycles, specifically between Cycles 1 and 2, and the pasttest correlations were higher than the corresponding resolution scores for both encoding types. Both findings accord with the MPT account that the past-test information for each item partially drives the resolution increases of JOLs (see Ariel & Dunlosky, 2011;Finn & Metcalfe, 2007, 2008. Similarly, the results pattern is consistent with the mnemonic debiasing account; it assumes that study-test practice provides generally more diagnostic mnemonic cues on an item-by-item basis, such as retrieval experience and fluency, and in turn increases the participants' resolution scores (Koriat & Bjork 2006). In contrast, the anchor-and-adjustment account does not address these findings.

The Relative Contribution of Retrieval versus Restudy Practice to the Underconfidence-With-Practice Effect
Several findings related to retrieval practice contribute to the understanding of the UWP effect. Typically, researchers report that repeated study-test practice leads to underconfidence as learning progresses, but few studies have compared the interim-test condition to repeated study-study practice (but see Karpicke, 2009). In our study, we demonstrated that JOL magnitude for action phrases did not increase between Cycles 1 and 2 with repeated study-study practice, which is consistent with prior work Karpicke, 2009). Furthermore, JOL magnitude accelerated faster across cycles when interim tests were provided, that is, JOLs had practically the same magnitude in Cycle 1, but significantly increased for study-test (relative to study-study) items in Cycles 2 and 3. This suggests that study-study items would produce even more underconfidence, provided that they lead to a similar recall performance as study-test items, as previously shown for action phrases (Kubik et al., 2015(Kubik et al., , 2016 and also shown in the final immediate test of the present study (ps > 0.10; reported in Kubik, Soderstrom, et al., 2021). Together, these results are consonant with prior findings (see Karpicke, 2009).
From the perspective of the mnemonic-debiasing account, interim test experience provides mnemonic cues, such as retrieval success or retrieval fluency, and sensitizes learners to them across the learning session. These cues are more diagnostic of recall. As a result, study-test items should better reflect the recall gains across the learning session, and thereby increase faster than study-study items. More specifically, people continue with repeated study-study practice to base JOLs on internal cues (e.g., perceived intrinsic difficulty of items), whereas with study-test practice they shift successively to the more diagnostic mnemonic cue of encoding fluency (e.g., measured by self-paced study time; Karpicke, 2009;Koriat, 1997;Koriat & Bjork 2006) and retrieval fluency (e.g., Koriat & Ma'ayan, 2005) as learning progresses (Koriat & Bjork, 2006).
The anchoring-and-adjustment and the MPT accounts do not provide any specific mechanism to accommodate the finding that JOLs increase more moderately across cycles for repeated study-study compared to study-test items (cf. . Nonetheless, the anchoring-and-adjustment account is relevant in that we argue that participants adjust their JOLs more effectively from the anchor point based on the salient cue of retrieval experience. That is, participants match the item-specific JOLs to their previous recall performance. However, as study-study items lack this test experience, participants make rather small adjustments .
The MPT account does not explain underconfidence for repeated study-study practice because learners cannot use the memory of a past test as a heuristic to determine their JOLs. Furthermore, past-test information was primarily hypothesized to produce underconfidence in immediate JOLs, rather than reducing it. Against the predictions of the MPT account, the data suggest that past-test information is not critical for this feature of the UWP effect (Koriat et al., 2002) or even detrimental to it (cf. .
Another relevant finding to the theoretical discussion of the MPT account is that we observed that, within the set of study-test items, an increased underconfidence bias for previously recalled relative to nonrecalled items, and in Cycle 3 no underconfidence at all for nonrecalled items on Cycle 2, supporting previous research . Taken together with prior findings of a similarly sized UWP effect for previously recalled and unrecalled items (Koriat et al., 2002), this pattern of results is inconsistent with the notion that items' past-test information accounts for the UWP effect. Although some level of underconfidence for recalled items may result from JOLs' downward variance from 100% (Finn & Metcalfe, 2007), the MPT account has difficulties explaining the present finding of increased underconfidence for recalled items. Based on the latter, the opposite results pattern is predicted such that "forgotten and then recalled items" should disproportionally contribute to the underconfidence, whereas JOLs for previously recalled items should be more accurate, as has been previously reported in a single study (Finn & Metcalfe, 2007). These observed UWP findings can be better explained by the anchoringand-adjustment account, such that people insufficiently adjust from a psychological anchor point to match recall performance, and this matching occurs independently of items' prior recall status (cf. . However, because we did not experimentally manipulate the anchor point, the evidence for the anchoring-and-adjustment account in this study is correlational. Nonetheless, previous research has successfully manipulated participants' initial anchor or in general their metacognitive judgements without affecting recall performance by providing different cover stories (e.g., the task is easy vs. difficult; , or by framing JOLs in terms of forgetting versus remembering information over a delay (England et al., 2017). These manipulations likely change the usage (or selection) of cues or the scale of JOLs (i.e., how confidence is translated into a numeric JOL response). There is further experimental evidence for anchoring and adjustment effects in research on decision making, how participants utilize experimenter-versus self-generated anchor points as the starting point from which they adjust their judgements (see Epley & Gilovich, 2001;Tversky & Kahneman, 1974).
Taken together, the UWP effect cannot be explained by any single account. Based on the data, it is likely that the UWP arises from a complex pattern in terms of JOL magnitudes, calibration, calibration shifts as well as resolution scores across learning cycles. Thus, the UWP effect is likely determined by several underlying mechanisms that act in concert.

Monitoring the Progression of the Enactment Effect across Multiple Learning Cycles
As predicted, enacting action phrases during encoding led to better recall performance than verbal encoding in enhancing cued-recall performance. More importantly, we showed that the enactment effect remained constant throughout the learning session. This finding is consistent with the notion that motor enactment relies on item-specific processing of the whole action phrase, in particular, on the binding of verb and noun within action phrases (see Kormi-Nouri, 1995;Kormi-Nouri & Nilsson, 2001). This finding accords with previous findings showing that when using free recall, which research shows is dependent on item-relational processing, the mnemonic effects of enactment are less reliable (Earles & Kersten, 2000;Knopf, 1995;Steffens, 1999) and even decrease further across multiple study-test cycles (Koriat & Pearlman-Avnion, 2003). Successful free-recall performance requires not only item-specific, but also item-relational information that is less strongly fostered by enactment (e.g., Steffens et al., 2006Steffens et al., , 2009. Notably, we observed smaller error bars for enacted compared to nonenacted phrases, likely indicating that enactment is also be the result of the specific study instruction that we gave participants, whereas verbal encoding allows for a greater variety of learning strategies being selected between and within participants.
Despite the remarkable body of action memory research, there has not been any systematic evaluation of how accurately learners monitor their learning and memory for actions.
To this end, the present results make a novel contribution, showing that participants are sensitive to the memory-enhancing effects of enactment and that participants can predict the enactment effect to enlarge with repeated study-test practice. The finding that JOLs reflect the enactment effect is quite remarkable because extrinsic cues, which include encoding manipulations, are generally discounted in JOLs, especially when manipulated between-subjects (Koriat, 1997). Typically, JOLs reflect encoding effects only in withinsubject designs (e.g., interactive imagery, Begg et al., 1989;levels-of-processing, Shaw & Craik, 1989;verbal production, Castel et al., 2013). Thus, the present results add to previous research by suggesting that even in between-subject designs, JOLs can be sensitive to different encoding manipulations (e.g., generated vs. read words: Begg et al., 1991;Mazzoni & Nelson, 1995). Thus, these results contribute to the literature as prior studies did not systematically investigate to what extent learners are sensitive to the enactment effect. When inspecting the descriptive results of prior studies exploratively, learners seemed to provide at least numerically higher JOLs for enacted phrases, particularly when encoding type is manipulated within-subject (Cohen, 1988, Experiment 2), but also when manipulated between subjects (Cohen, 1988, Experiment 1;Cohen et al., 1991). Note, however, that some of these studies had methodological shortcomings. For example, enacted action phrases were often compared to words that are simply read (Cohen et al. 1991), and they were in part presented for shorter amounts of time (Cohen, 1988;Cohen et al., 1991). Presumably, enactment as a specific encoding type (i.e., motoric performance) draws one's attention to the items' individual characteristics and increases their distinctiveness, which enhances recall for these enacted items. Thus, despite manipulating encoding type between participants, learners acknowledged its memorial benefits, as they probably based JOLs partially on item-specific information as mnemonic cues being diagnostic of future recall (for a similar suggestion, see Castel et al., 2013, explaining the metacognitive sensitivity of the production effect, i.e., the mnemonic benefit of saying words aloud versus silent encoding).
Furthermore, there has been little systematic research on anchoring effects in action memory until our study. The present study reveals that the predicted enactment effect in Cycle 1, as reflected in JOL magnitude, was even larger for items that were unrecalled. This finding suggests that learners have a theory-based belief that enactment is a more powerful encoding technique than verbal encoding. However, it will require further research to examine how beliefs are utilized when participant solicit JOLs (cf. Mueller et al., 2013), and how they interact with experience-based cues (cf. England et al., 2017;. As a consequence, in the current study, JOLs reflected the mnemonic benefit of enactment and thereby calibration scores between recall and JOLs did not differ between the two encoding types. In contrast to JOL magnitudes and calibration scores, the present results showed that enactive encoding hampered JOLs' resolution across cycles, compared to verbal encoding (e.g., Cohen, 1983Cohen, , 1988Cohen et al. 1991;Cohen et al., 1991). They extend prior results showing that resolution for future free-recall performance is fairly accurate for words, but not for enacted action phrases. We speculate that if all enacted action phrases become highly distinctive as a result of such item-specific processing then resolution, which relies on cross-item comparisons, might suffer. For verbally-encoded action phrases, on the other hand, the amount of distinctiveness between action phrases varies more and thus the distinctiveness of one action phrase relative to another should be more salient, providing learners with more diagnostic information upon which their JOLs can be based. Thus, although enactment bolsters memory performance, it is associated with poor metacognitive accuracy in terms of resolution.

Practical Applications
These data have implications for practical aspects of learning. There has been considerable research that documents how important retrieval practice is for efficiently learning in educational settings (cf., Roediger & Karpicke, 2006b; see also Kubik, Gaschler, et al., 2021). Equally important for learning in school and elsewhere is being able to use metacognitive resources to study the items that require further study (see e.g., Efklides, 2014;Zhao & Linderholm, 2011). This study adds something new. When people use retrieval practice across study cycles, not only do they boost the efficiency of their learning, but it also enhances the magnitude of JOL over the progression of learning and thereby reduces the amount of underconfidence that occurs. With less underconfidence, one implication is that efficient learners can direct their attention to new learning, rather than repeat items that they have already mastered. Accurate self-evaluations are critical for effective self-regulated learning (cf. Dunlosky & Rawson, 2012;Dunlosky & Thiede, 2013; see also Roelle et al., 2017), and this is true regardless of whether the learning is verbal or motoric. This metacognitive benefit of retrieval practice may be useful in learning physical trades or skills, multimedia learning (see Eitel, 2016) as well as more academic learning in general.

Concluding Comments
Study strategies, such as retrieval practice, restudy practice, as well as enactment, not only enhance memory performance but also affect memory predictions across repeated learning occasions. This has implications for the theoretical understanding of how to effectively regulate one's own learning, but also educational implications. It is upon such metacognitive monitoring that people base their decisions to continue or stop studying and upon which items to focus (Nelson & Leonesio, 1988). Any systematic dissociation between subjective and objective learning curves can be detrimental to learning because ineffective study strategies or study time allocation may result. Future research is encouraged to investigate over-and underconfidence in self-evaluations and its impact on self-regulated learning and academic achievement in educational real-world contexts (such as example-based and multimedia learning scenarios; cf. Eitel 2016; Roelle et al., 2017).

Contributions
Veit Kubik was the main contributor to the study conception and study design. Material preparation, data collection and analysis were performed by Veit Kubik. The first draft of the manuscript was written by Veit Kubik, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Funding Open Access funding enabled and organized by Projekt DEAL.