In Experiment 1, we attempted to manipulate MSE and subsequent recall performance by having participants complete a fluid intelligence task in either ascending or descending order of difficulty before completing a VDR task. Prior research has explored the effect of different item-difficulty orders on test anxiety (Smouse and Munz 1968), performance (Klosner and Gellman 1973), and confidence (Weinstein and Roediger III 2010); however, research has yet to investigate the effect of different item-difficulty orders on MSE. Although most research reports no relationship between the order of questions and performance (see Hauck et al. 2017 for a review), compared to a descending order of difficulty, we expected participants completing the questions in ascending order to have elevated MSE as a result of better early task performance serving as an anchor for MSE judgments. Furthermore, we expected increased fluid intelligence to relate to elevated selectivity for high-value words, task scores, and metacognitive accuracy.
Method
Participants
After exclusions, participants were 73 undergraduate students (age: M = 20.19, SD = 1.93) recruited from the University of California Los Angeles Human Subjects Pool who received course credit for their participation. The experiment was conducted online and lasted approximately 30 min. Participants were excluded from analysis if they admitted to cheating (e.g., writing down answers) in a post-task questionnaire (they were told they would still receive credit if they cheated). This exclusion process resulted in 1 exclusion from the ascending order group and 2 exclusions from the descending order group.
Materials and procedure. Raven’s progressive matrices (RPM)
Participants first completed the Raven’s Progressive Matrices (RPM; Raven 1938), a non-verbal test of abstract reasoning and fluid intelligence (e.g., Jarosz et al. 2019; Staff et al. 2014). The task is composed of 12 problems (of varying difficulty), each of which presents participants with a pattern that has a piece missing. Participants were instructed to select the option (out of eight choices) that correctly completes the pattern and then indicate their confidence in the accuracy of each response (from 0 to 100 with 0 being not confident and 100 being very confident). The task was self-paced, was not limited in time, and question difficulty was determined by Raven (1938) based on indices of participants’ performance, such as mean response latency and accuracy rate. Participants were randomly assigned to complete the RPM with questions sequenced in either ascending or descending order of difficulty, and fluid intelligence scores were calculated as the proportion correct.
Memory self-efficacy questionnaire (MSEQ)
Participants next completed the Memory Self-Efficacy Questionnaire for Items to assess participants’ perception of their general memory abilities (MSEQ-I; Berry et al. 2013). The questionnaire consisted of 14 questions in which participants provided judgments of their ability to remember a list of items. They rated how confident they were that they could achieve a certain level of performance by selecting percentage responses ranging from 0 to 100% (in 10% increments). MSE scores were calculated by averaging participants’ confidence ratings across all questions. See Appendix for the full MSEQ-I.
Value-directed remembering (VDR)
After completing the RPM and the MSEQ-I, participants completed a value-directed remembering task. In this task, participants were presented with a series of to-be-remembered words with each word paired with an associated value between 1 and 12, indicating how much the word was “worth” (e.g., table: 5, toast: 12, plum: 7). Each point value was used only once within each list and the order of the point values within lists was randomized. The stimulus words were presented for 3 s each, were nouns that contained between four and seven letters, and had an everyday occurrence rate of at least 30 times per million (Thorndike and Lorge 1944). Participants were told that their score would be the sum of the associated values of the words they recalled (e.g., 5 + 12 + 7 = 24) and that they should try to maximize their score.
After each word was presented, participants made a judgment of learning (JOL). Participants answered with a number between 0 and 100, with 0 meaning they definitely would not remember the word and 100 meaning they definitely would remember the word. Participants were given as much time as they needed to make their judgments. After the presentation of all 12 word-number pairs in each of the eight lists, participants were given a 20-s free recall test in which they had to recall as many words as they could from the list (they did not need to recall the point values). Immediately following the recall period, participants were informed of their score for that list but were not given feedback about specific items.
In addition to their point scores, participants were scored for efficiency via a selectivity index. For this metric, we calculated each participant’s recall score relative to their chance and ideal score. The ideal score consisted of the sum of only the highest values for the particular number of words recalled. For example, if a participant remembered 3 words, ideally those words would be paired with the three highest values (e.g., 12, 11, 10). Chance scores reflected no attention to value and were calculated as the product of the average point value and the number of recalled words. At chance, the score in our example would be 6.5 multiplied by the number of recalled words. If a participant only recalled words paired with the highest values, the resulting selectivity score would be 1 while a participant who only recalled words paired with the lowest values would receive a selectivity score of −1. Scores close to 0 indicate that a participant’s recall was not sensitive to point values (see Castel et al. 2002 for more details).
Encoding strategies
At the end of the VDR task, participants were asked to report which strategies (if any) they had used. Specifically, they were given a list from which they had to choose between reading each word as it appeared, repeating the words as much as possible, using sentences to link the words together, developing mental images of the words, grouping the words in a meaningful way, or utilizing some other strategy (participants could select some, none, or all of the strategies). To examine variation in encoding strategies, we computed an effective strategies variable which was the proportion of effective strategies reported as used by each participant. Prior research has indicated that effective encoding strategies lead to better recall performance and include imagery, sentence generation, and grouping, while less effective strategies involve passive reading and rote repetition (Hertzog et al. 1998; Richardson 1998; Unsworth 2016). In the present study, we coded self-reported encoding strategies based on their level of effectiveness and differentiated less effective strategies and strategies that support deeper levels of processing. Specifically, we computed an effective strategies variable which was the proportion of effective strategies reported as used by participants (i.e., using sentences to link the words together, developing mental images of the words, and grouping the words in a meaningful way).
Results
Although we initially hypothesized that manipulating the order of questions of the Raven’s Progressive Matrices would result in differences in MSE and VDR, there were no differences of interest as a function of question order, consistent with prior work on item-difficulty order and performance (Hauck et al. 2017). Thus, we collapsed results across conditions for all subsequent analyses to investigate the relationships between selectivity for valuable information, metacognitive accuracy, strategy use, and fluid intelligence. Correlations between variables of interest can be seen in Table 1.
Table 1 Pearson (r) correlations between the primary variables of interest (collapsed across conditions) in Experiment 1 In our examination of VDR performance, we treated those data as hierarchical or clustered (i.e., multilevel), with items nested within individual participants. This approach helps to control for variation between individual participants and accounts for the non-independence of observations (i.e., 8 lists completed by the same individual are not independent observations). Also, multilevel approaches can account for an unequal number of observations across groups and participants and allow for both categorical and continuous predictor variables (Bolger and Laurenceau 2013; Gelman and Hill 2007; Jaeger 2008; Kenny et al. 1998; McElreath 2016). Thus, we used multilevel models (MLMs; sometimes also called mixed-effects or hierarchical models) in the present study. For all linear models, we used restricted maximum likelihood estimation (REML) to estimate coefficients, which is robust to small sample sizes at level-2 (i.e., the participant level in the current study; see McNeish 2017).
Because memory performance at the item-level was binary (i.e., correct or incorrect), we conducted a logistic MLM to assess performance. As a result, the regression coefficients are given as logit units, or the log odds of being correct. We report exponential betas (eB), which give the coefficient as an odds ratio (e.g., the odds of being correct divided by the odds of being incorrect). Thus, eB can be interpreted as the extent to which the odds of being correct changed with values greater than 1 representing an increased likelihood of being correct, values less than 1 representing a decreased likelihood of being correct, and a value of 1 indicating no change. Also, we report 95% confidence intervals for eB because odds ratios are nonlinear and confidence intervals can be asymmetric. Lastly, centering of predictor variables in MLMs can be important for various reasons, and there are typically two ways to center item-level (level 1) predictors: around the grand mean (i.e., the average value across all observations), which we refer to as grand mean centering (GMC) or around the cluster mean (i.e., each level-2 unit’s average value), which we refer to as cluster-based centering (CBC). Because level-2 variables only have one source of variance, they are always grand mean-centered. Cluster-based centering is important for isolating level-1 from level-2 effects, which becomes particularly important for interaction effects (Enders and Tofighi 2007; Ryu 2015). Thus, in all MLMs with an interaction term, we use CBC for level-1 predictors.
To first examine the proportion of words recalled (M = .50, SD = .13) as the task endured, a logistic MLM with accuracy (level 1) modeled as a function of list (level 1, GMC) revealed that list significantly predicted accuracy [eB = 1.08, CI: 1.06–1.10, z = 6.90, p < .001] such that task experience resulted in greater recall. Similarly, a linear MLM with selectivity scores (level 2; M = .17, SD = .24) modeled as a function of list showed that list predicted selectivity [b = .03, CI: .02–.03, t(6804) = 13.32, p < .001] such that participants became more selective with increased task experience.
To determine if participants strategically organized their recall, we computed a Pearson correlation for each participant between each item’s output position (with larger numbers meaning later output) and its value. A strong negative correlation would indicate that participants recalled high-value items before low-value items and a positive correlation would indicate the recall of low-value items before high-value items. While these correlations (M = −.08, SD = .23) were different than 0 [one sample t-test: t(72) = − 2.87, p = .005, d = −.34], a repeated measures ANOVA (8 levels) did not reveal a main effect of list [F(7, 469) = 1.71, p = .104, η2 = .03]. Thus, participants generally recalled high-value items before low-value items and this did not change with task experience.
Most measures of monitoring, such as JOLs, are assessed as a probability, or percentage likelihood (same scale as the probability of recall), allowing for measures of absolute and relative accuracy (see Higham et al. 2016; Rhodes 2016). Absolute accuracy (i.e., calibration), is the overall relationship between judgment and performance and is calculated as the difference between mean judgments and the percentage of items recalled. For perfect calibration, participants’ scores would be zero indicating a direct correspondence between prediction and recall. Results revealed that participants were well calibrated such that calibration (M = .76, SD = 18.77) was not different than 0 [one sample t-test: t(72) = .35, p = .731, d = .04]. Additionally, a repeated measures ANOVA (8 levels) on calibration revealed a main effect of list such that participants were initially overconfident but calibration improved with task experience [Mauchly’s W = .17, p < .001, Huynh-Feldt corrected results: F(4.75, 341.89) = 16.78, p < .001, η2 = .19].
Relative accuracy (i.e., resolution) is the degree to judgments discriminate between items that are or are not remembered and is often measured by Gamma correlations between each item’s JOL and recall for each participant (see Masson and Rotello 2009 for alternative approaches). A perfect correlation between judgment and performance would exemplify the ability to distinguish between what will or will not be remembered; the individual remembers what they say they will remember. We computed Gamma correlations for each participant and these correlations (M = .38, SD = .34) were different than 0 [one sample t-test: t(72) = 9.66, p < .001, d = 1.13] indicating that participants’ JOLs were relatively accurate. However, a repeated measures ANOVA (8 levels) on resolution did not reveal a main effect of list [F(7, 378) = 2.48, p = .017, η2 = .04] indicating that relative accuracy did not change with task experience.
To further examine whether higher JOLs was related to better accuracy, we ran a logistic MLM with item-level accuracy measured as a function of JOLs (GMC), controlling for value and list, which showed that JOLs were a significant predictor of later accuracy [eB = 1.02, CI: 1.016–1.020, z = 17.89, p < .001]. In other words, for each one-point increase in an item’s JOL, the odds of remembering the word was expected to increase by 1.02. Thus, participants better remembered items after giving higher JOLs, suggesting they were generally metacognitively accurate.
In addition to metacognitive accuracy, we were also interested in whether participants were metacognitively aware of their selectivity. To assess the influence of value on participants’ metacognitive judgments, we ran an MLM with item-level JOLs modeled as a function of the item’s value (GMC), controlling for list (GMC). The analysis revealed that point value was a significant positive predictor of JOLs [b = 1.85, CI: 1.67–2.02, t(6896) = 20.74, p < .001] such that a one-point increase in an item’s value is predicted to result in an increase of 1.85 in JOL, controlling for list position. Thus, participants were metacognitively aware of value effects on memory performance.
Next, to examine the relationship between fluid intelligence (M = .58, SD = .27) and VDR scores (sum of values of recalled words; M = 41.94, SD = 10.62), we used MLM to examine the predictive value of RPM score on average VDR score, taking into account both the number of words recalled and their point value. In this model, both RPM and average VDR scores were level-2 variables (e.g., collected at the participant level), and RPM performance was centered around the grand mean (GMC). Results revealed that RPM score was a significant predictor of VDR scores [b = 11.29, CI: 2.39–20.20, t(71) = 2.49, p = .015], indicating that higher RPM scores predicted higher VDR scores.
Lastly, to examine differences in metacognitive accuracy as a function of fluid intelligence, we used a logistic MLM (Murayama et al. 2014). We modeled item-level accuracy as a function of item-level JOLs, participant-level (e.g., level-2) RPM score, and the interaction between JOLs and RPM score. To accurately represent the unique within-person effects, it is important to center item-level variables around the cluster means (e.g., each participant’s JOLs are centered around their personal average JOL), rather than a grand mean (Enders and Tofighi 2007). Thus, JOLs were centered around participant means, while RPM scores were centered around the grand mean. Again, accuracy was modeled logistically (0 = incorrect, 1 = correct), so exponential betas (eB) are reported.
Results revealed that JOLs were a significant predictor of accuracy for those who had average fluid intelligence scores [eB = 1.02, CI: 1.018–1.022, z = 19.40, p < .001] but fluid intelligence scores were not a significant predictor of accuracy for items that were at participants’ mean JOL [eB = 1.50, CI: .90–2.48, z = 1.56, p = .119]. Of most interest, the interaction between JOL and fluid intelligence was significant [eB = 1.01, CI: 1.01–1.02, z = 3.50, p < .001] indicating that an increase in fluid intelligence is expected to enhance the relationship between JOLs and recall. Specifically, this coefficient indicates that the relationship between JOLs and later recall (e.g., the slope of the regression line) is expected to increase by an odds ratio of 1.01 given a one-unit increase in RPM score (see Fig. 1).
We probed this interaction to determine the relationship between JOLs on later recall at three values of RPM scores: the mean and one standard deviation above and below the mean RPM score. Importantly, these values don’t necessarily represent groups of participants, but rather the expected relationship between JOLs and recall performance for an example participant with a particular score on the RPM task. This revealed that JOLs were a significant predictor of accuracy for those who were one standard deviation below the mean [eB = 1.02, CI: 1.01–1.02, z = 11.71, p < .001], at the mean [eB = 1.02, CI: 1.018–1.022, z = 19.40, p < .001], and for those who were one standard deviation above the mean on the RPM task [eB = 1.02, CI: 1.02–1.03, z = 16.10, p < .001], suggesting that most participants were metacognitively accurate, but those with higher RPM scores were more so. In other words, participants with higher fluid intelligence had a stronger relationship between JOLs and recall than those with lower fluid intelligence, suggesting they may be more metacognitively accurate.
Discussion
In Experiment 1, we manipulated the order of questions on the RPM to investigate potential differences in MSE and VDR as a consequence of completing easier or more difficult questions first. After completing the RPM task in ascending difficulty, we expected that these participants would report higher MSE as a result of better early task performance serving as an anchor for MSE judgments. However, results revealed no differences in MSE or any other variables of interest as a function of question order, so we collapsed the remaining analyses across conditions. Results revealed that the proportion of words recalled, selectivity for high-value words, and task scores (sum of values of recalled words) all increased with experience. In terms of the organization of participants’ recall, participants tended to recall high-value items before low-value items but this did not change with task experience. Furthermore, participants were generally metacognitively accurate, both in terms of absolute and relative accuracy, and were metacognitively aware of their selectivity for valuable items and this awareness increased as a function of list.
The results observed in Experiment 1 are consistent with previous research on the influence of item value on selectivity, suggesting that people tend to focus more on high-value information and less on low-value information to maximize gains (Ariel et al. 2009; Castel 2008; Castel et al. 2009; Castel et al. 2002; Castel et al. 2007). Additionally, we demonstrated that increased fluid intelligence predicts greater VDR task scores, and participants with higher fluid intelligence had a stronger relationship between JOLs and recall than those with lower fluid intelligence, indicating that they were more metacognitively accurate (see Fig. 1). Thus, not only were participants with greater fluid intelligence more selective, but they were also more accurate in their metacognitive judgments.
Fluid intelligence positively correlated with MSE such that participants who performed better on the RPM task reported greater belief in their memory abilities. Since past successes and failures are known to influence MSE (Bandura 1977), heightened fluid intelligence may be related to greater feelings of mastery after success on previous and/or similar tasks, leading to higher self-efficacy judgments. However, MSE did not correlate with any other variables of interest, supporting accounts that MSE may not directly relate to performance (see Beaudoin and Desrichard 2011).
Fluid intelligence positively correlated with selectivity scores such that higher fluid intelligence was associated with better recall of higher valued items compared to less valuable items. However, fluid intelligence was not associated with the number of words recalled but was positively related to task scores (the sum of the values paired with recalled words). Moreover, fluid intelligence negatively correlated with the organization of retrieval such that participants with increased fluid intelligence demonstrated a greater tendency to recall valuable items before less valuable items. Thus, the positive relationships between fluid intelligence, selectivity, and task scores (but no relationship with total recall), together with the tendency to prioritize recall for high-value items, indicate that better reasoners engage in a more strategic utilization of cognitive resources to optimize performance in VDR. Finally, reported effective strategy use did not relate to task performance or fluid intelligence, contrary to previous findings (e.g., Ariel et al. 2015; Hennessee et al. 2019), suggesting that utilizing effective encoding strategies may not be necessary for good performance.