When people solve problems, they often have a sense of how confident they are about their accuracy (Ackerman & Thompson, 2017; Desender et al., 2018; Koriat & Adiv, 2016; Peters et al., 2019). A primary research goal in the study of metacognition is to decipher the bases of metacognitive monitoring judgmentsFootnote 1 (i.e., confidence judgments; Dunlosky & Thiede, 2013; Koriat & Adiv, 2016). In general, people tend to report greater confidence when they are correct compared to when they are incorrect (Nelson & Fyfe, 2019; Rinne & Mazzocco, 2014; Rivers et al., 2020; Wall et al., 2016). This alignment between performance and confidence reflects accurate metacognitive monitoring. The accuracy of monitoring judgments is important because people can, and often do, use these judgments to make decisions such as whether to share their opinions, seek help, check their work for errors, or submit their work for evaluation (e.g., Dunlosky & Metcalfe, 2009; Dunlosky & Rawson, 2012; Nelson & Fyfe, 2019; Wall et al., 2016). However, people’s judgments sometimes misalign with their performance (Dunning et al., 2003; Kruger & Dunning, 1999). This misalignment could be especially problematic in applied contexts such as health-decision making. If people are unaware of their own lack of understanding of health metrics, including probabilities relating to risk, they may make decisions based on faulty interpretations of numerical health information. For example, if people mistakenly believe with high confidence that they have an insignificant risk of disease, they might be more willing to engage in risky behaviors or less likely to engage in preventive health behaviors. Thus, it is important to evaluate what factors influence monitoring judgments during problem solving involving health statistics. The current study examined predictors of adults’ item-level monitoring judgments in the context of health-related math problems about COVID-19 (hereafter referred to as problems; see Fig. 2).

Why are judgments sometimes misaligned with performance?

Misalignment between people’s metacognitive judgements and accuracy occurs because a wide variety of cues inform monitoring judgments. According to cue-utilization theory (Koriat, 1997; Koriat & Levy-Sadot, 1999), monitoring judgments can be informed by information- and experience-based cues. Information-based cues include beliefs and knowledge about the self (e.g., Händel et al., 2020), about cognition in general (e.g., Rivers et al. 2020), or about the task at hand (e.g., Mueller & Dunlosky, 2017). Experience-based cues arise from people’s experiences during the task, such as how quickly they solved a problem (e.g., Desender & Sasanguie, 2021; Koriat et al., 2008; Leonesio & Nelson, 1990; Sanchez & Dunning, 2020), their familiarity with the problem features (Fitzsimmons & Thompson, 2021; Fitzsimmons et al., 2020a; Reder & Ritter, 1992), or whether they received feedback or instruction on the problem (Labuhn et al., 2010). The accuracy of judgments is dependent upon the validity of the cues that are used (Koriat, 1997; see also Ackerman & Thompson, 2017).

In the current study (pre-registered on OSF: https://osf.io/vxm8d), we examined which information-based and experience-based cues influenced monitoring judgments using data from a publicly available data set (Thompson et al., 2021). In Thompson and colleagues’ study, adults solved a series of problems and indicated, “How confident are you in your decision about which disease is the most fatal from 0% = I am not confident at all, to 100% = I am totally confident?” after solving each of the problems. Our interdisciplinary team examined which information- and experience-based cues influenced these monitoring judgments. Specifically, we: (a) explored whether information-based cues, such as math self-efficacy and math anxiety impacted monitoring judgments, (b) examined how experience-based cues, such as worked-example training impacted monitoring judgments, and (c) assessed whether the magnitude of monitoring judgments was higher when people answered questions correctly versus incorrectly. Next, we discuss relevant literature pertaining to each of these information-based and experience-based cues.

Information-based cues

Math self-efficacy

One type of information-based cue that can influence monitoring judgments during math tasks is an individual’s math self-efficacy (Efklides, 2006; Händel et al., 2020; Stankov et al., 2012). Math self-efficacy is defined in the current study as one’s beliefs about one’s competence for completing specific tasks (Bandura, 1982; Lee, 2009; Pajares, 1996). In general, people tend to be less confident in their math ability compared to other domains, such as reading (Dowker et al., 2016; Punaro & Reeve, 2012; Wigfield & Meece, 1988), and it is common for people to endorse the idea: “I am not a math person” (Dowker et al., 2016; Miller-Cotto & Lewis, 2020; Nolen et al., 2014; Peters, 2020; Peters et al., 2019). Note that math self-efficacy is generally lower among women (Ashcraft & Ridley, 2005; Dowker et al., 2016; Hembree, 1990; Morony et al., 2013; Pajares & Miller, 1994), and may contribute to the underrepresentation of women in math-heavy careers (Huang et al., 2019), thus we also examined gender in our models.

Problem-solvers’ beliefs about their own math ability are related to both task-specific performance (Pajares & Miller, 1994) and academic math performance (Ahmed et al., 2012). Because these constructs are related to math performance, people may draw on these math beliefs when making monitoring judgments. That is, information (e.g., attitudes and prior knowledge) and experiences (e.g., feelings elicited in the moment) are sources of inferential information that influence judgments (cf. cue-utilization theory, Koriat, 1997). In the current study, we considered math self-efficacy to be an information-based cue which was guided primarily by prior experiences with math and could influence item-by-item monitoring judgments. Given that the role of math perceptions’ influence on monitoring judgments may be underestimated in the literature (Händel et al., 2020), and that many people report low confidence in their math ability (Ashcraft & Ridley, 2005; Barroso et al., 2020; Dowker et al., 2016; Gough, 1954; Hembree, 1990), we tested whether math self-efficacy accounted for unique variance in monitoring judgments, even after accounting for performance accuracy.

In the current study, we adopted Peters et al. (2019) approach to evaluating math self-efficacy by analyzing the first four items of the Subjective Numeracy Scale (SNS: Fagerlin et al., 2007; Peters, 2020). These items from the SNS assess participants’ confidence in their math skills pertaining to fractions and percentages (e.g., “How good are you at working with fractions?”). In our view, these SNS questions are better characterized as a measure of participants’ perceptions about their math ability (i.e., math self-efficacy), specifically for rational numbers (i.e., ratios such as fractions and percentages), as opposed to math more generally. This view of these SNS items is supported by strong correlations between the items and objective measures of numeracy–how good people are at working with probability and math concepts (Lipkus et al., 2001; Peters, 2020; Schwartz et al., 1997). Moreover, these SNS items are also strongly correlated with health decision-making measures (Låg et al., 2014; Peters, 2020; Peters et al., 2019; Waters et al., 2018), because health statistics are commonly presented as rational numbers (e.g., the ratio of people who experience side effects of a medication relative to all the people who take the medication). We expected that math self-efficacy (operationalized as responses to the first four SNS items) would contribute to, but not completely overlap with, monitoring judgments. That is, higher math self-efficacy would be positively related to monitoring judgments. Other individual differences, such as math and trait anxiety, may also contribute towards monitoring judgments.

Math anxiety and trait anxiety

An information-based cue, math anxiety, is defined as a fear or apprehension about mathematics (Ashcraft, 2002). When individuals who are math anxious find themselves in math-intensive situations, their anxiety may influence their monitoring judgments. That is, their apprehension about math in general may lead them to indicate they are less confident in their answer regardless of whether they are accurate or whether other cues are available to indicate they should have higher confidence. Because math anxiety is an information-based cue of which people have at least some awareness, higher levels of math anxiety could be a cue that might influence participants to report lower monitoring judgments (Desender and Sasanguie, 2021; Jain & Dowson, 2009).

In the current study, we tested the unique relations between monitoring judgments, math anxiety, and trait anxiety. There is an ongoing debate whether math anxiety is a distinct construct from other forms of anxiety (e.g., trait anxiety) given the parallels between math anxiety and generalized anxiety. For example, similar to generalized anxiety (Cresswell et al., 2010; Ooi et al., 2015), math anxiety may be modeled by teachers and parents who exhibit high math anxiety (Beilock et al., 2010; Ramirez & Beilock, 2011). Meta-analytic reviews (Barroso et al., 2020; Hembree, 1990; Ma, 1999; Namkung et al., 2019; Zhang et al., 2019) support the idea that mathematics anxiety is distinct from general anxiety. Indeed, some researchers consider mathematics to be a particularly common context for heightened anxiety, which can have functional implications similar to specific phobias (Ashcraft, 2019; Ashcraft & Ridley, 2005; Cipora et al., 2019; Dreger & Aiken, 1957; Gough, 1954; Núñez-Peña et al., 2014).

We anticipated that participants’ math anxiety would be more closely (and negatively) related to participants’ monitoring judgments about their math performance than trait anxiety. In Thompson et al. (2021), math anxiety predicted problem accuracy even when trait anxiety was also included in the model. This indicated that participants who were less likely to be accurate on their problem-solving performance were not just more anxious in general. Thompson and colleagues collected trait anxiety data as a critical variable for predicting downstream effects of COVID-19 worry and risk perceptions; trait anxiety was included in the current study primarily to contribute to the literature regarding whether math anxiety is distinct from other forms of anxiety (e.g., Ashcraft, 2019; Ashcraft & Ridley, 2005; Hembree, 1990; Ma, 1999; Namkung et al., 2019).

Experience-based cues: worked-example training and problem-solving accuracy

In addition to the information-based cues described above, people may also rely on experience-based cues – cues based on the actual experience of completing the task at hand – when making monitoring judgments. Thus, we also tested whether experience-based cues (problem-solving accuracy and the worked-example content of the educational intervention) related to adults’ item-level monitoring judgments.

Worked-example training

To address participants’ tendency towards whole number bias (Thompson et al., in press), Thompson et al. (2021) randomly assigned participants to an educational intervention or control condition. In the intervention condition, participants saw a worked exampleFootnote 2 demonstrating, step-by-step, how to solve the target problems. This training was effective: post-training problem solving was more accurate in the experimental than control condition. Given that the intervention improved problem-solving accuracy, and problem-solving accuracy often influences monitoring judgments, we anticipated that the intervention would also result in higher monitoring judgments compared to the control condition (see Fig. 1). We examined whether the worked example influenced participants’ monitoring judgments while controlling for their problem-solving accuracy. In this way, we tested whether the experience of training, regardless of its effect on problem-solving accuracy, influences monitoring judgments. Without feedback or practice solving problems, instruction may artificially increase the magnitude of monitoring judgments in problem-solving. That is, it might lead to misalignment between confidence and performance. For instance, Baars et al. (2017) found that solving practice problems after a worked example led to greater monitoring accuracy compared to when worked examples were not followed by practice problems. However, evidence is mounting which indicates that even when performance accuracy on a math task is improved with training, monitoring judgments may not improve in tandem (Fitzsimmons & Thompson, 2021; Fitzsimmons et al., 2021).

Fig. 1
figure 1

Proposed Theory of Change

Problem-solving accuracy

Problem-solving accuracy (i.e., whether a participant answered a problem correctly or not) often has a strong influence on monitoring judgments (e.g., Händel et al., 2020; Wall et al., 2016). As discussed, confidence tends to be greater when people are accurate as compared to inaccurate on the task at hand. However, in the current study, participants solved problems in which people often demonstrate misconceptions. One such misconception is whole number bias, in which individuals misapply whole-number knowledge when reasoning about ratios, such as fractions (Ni & Zhou, 2005). For example, someone may say that COVID is more dangerous than the flu because 22,000 people died from the flu and only 9,318 had died from COVID (as of March 2020 when data were collected). Even though 22,000 is greater than 9,318, when one takes the ratio of the numerator to denominator into consideration, the case-fatality rate of COVID-19 (9,318/227,743 = 0.041 = 4.1%) was higher than that of the flu (22,000/36,000,000 = 0.00061 = 0.06%). Given that people often focus on just the number of deaths and not the number of deaths relative to the number of infections, people may believe that the flu is more fatal than COVID-19 with high confidence. That is, they may be both ‘unskilled and unaware’ (Dunning et al., 2003; Kruger & Dunning, 1999; Pennycook et al., 2017). Problem accuracy was not a direct cue in the current study because participants did not receive feedback on their accuracy; however, it is possible that participants’ accuracy affected their task-related metacognitive experiences, such as feelings of ease and difficulty. Thus, we evaluated whether those participants who answered problems accurately also reported higher monitoring judgments than participants who answered inaccurately.

Other individual differences included in statistical models

In addition to the information- and experience-based cues described above, we also included several relevant individual difference variables in our analyses. Specifically, we wanted to account for individual differences in objective math skills to control for pretest math ability, explore how individual differences in math skills are related to monitoring judgments, and account for potential gender differences in monitoring judgments.

Objective math skills: Number-line estimation and Berlin numeracy

We focused on measures from both the math cognition (i.e., number-line estimation performance) and health decision-making literature (i.e., Berlin numeracy) because of the interdisciplinary nature of the research questions (Thompson et al., in press).

The number-line estimation task taps numerical-magnitude knowledge (e.g., Siegler, 2016; Siegler & Booth, 2004; Siegler et al., 2011), which is a critical component of correctly solving the problems in the current study. Magnitude understanding is at the core of math achievement (Schneider et al., 2018; Siegler, 2016), and more precise estimates are predictive of greater math skills (e.g., Fazio et al., 2014; Siegler & Thompson, 2014; Siegler et al., 2011). Thus, number-line estimation precision served as a proxy for objective math ability in the current study. We also included a measure of objective numeracy (as measured by the 1-item Berlin numeracy task in the current study; Cokely et al., 2012), commonly used in the decision-making literature. This measure assesses people’s ability to reason with and compute numerical information. Thus, our two measures of objective math skills accounted for numerical-magnitude estimation and calculation skills, both of which are critical to complete the target problems.

Gender

It was necessary to include gender in our statistical models as a covariate because gender was not equally distributed across experimental conditions in Thompson et al. (2021). However, we were also interested in investigating the effect of gender on monitoring judgments in the domain of COVID-19-related math problem solving given that previous studies have reported gender differences in metacognitive judgments in non-health math contexts (Devine et al., 2012; Dowker et al., 2016; Else-Quest et al., 2010; Hembree, 1990; Rivers et al., 2020; Spencer et al., 2016). Specifically, women and girls tend to judge their performance with less confidence than men and boys, even when accounting for their objective performance on such tasks (Morony et al., 2013; Rivers et al., 2020). Given these gender differences in math and monitoring judgments, combined with findings that women report being more math anxious than men (e.g., Devine et al., 2012), and have more negative math attitudes than men (Sidney et al., 2021), we expected that women would make lower monitoring judgments than men in their problem solving, even when accounting for performance accuracy on the item.

Current study

The current study expands the literature in several important and novel ways. First, the analyses we conducted, which assessed whether math self-efficacy was associated with monitoring judgments, was a novel approach to unite distinct literatures in health decision making and metacognition. Second, Rivers et al. (2020) found that men were more precise and confident in their number-line estimates than were women; the current study tests whether that finding extends to the applied domain of health decision-making, specifically with problems pertaining to COVID-19. Third, it was unclear to our team how monitoring judgments would be affected by the worked-example educational intervention. Participants who engaged with the worked example, which was designed to improve relational understanding of rational numbers, might increase their monitoring judgments if the training helped them to gain confidence for the task at hand. However, it is also possible that the worked example highlighted for participants just how incorrect they were when trying to solve the problems, thus decreasing the magnitude of their subsequent monitoring judgments. The current study tracks how monitoring judgments evolved after participants engaged in a worked-example intervention, thus extending the literature on the relations between cues used to make monitoring judgments and interventions in an applied context.

Research questions and hypotheses

Our pre-registered analyses are broken up into directional hypotheses (H1 and H2) and exploratory, non-directional research questions (RQ1 and RQ2). We:

  • (H1) Predicted that, because the intervention led to higher performance accuracy, monitoring judgments would also be higher post intervention in the intervention compared to the control group.

  • (H2) Predicted that math anxiety would explain more variance in posttest monitoring judgments than trait anxiety, controlling for gender, pretest monitoring judgment, and experimental condition. Because participants completed a variety of math tasks in the current study, we expected a domain-specific construct, math anxiety, to be more strongly related to monitoring judgments than a domain-general measure of trait anxiety.

  • (RQ1) Non-directionally explored which individual differences were associated with monitoring judgments. We pre-registered plans to explore several individual differences (e.g., gender, math anxiety, pretest monitoring judgment, experimental condition), and we subsequently added math self-efficacy (measured with a selection of items from the SNS) to our analytic plan for the reasons indicated above.

  • (RQ2) Non-directionally explored whether participants who correctly answered the problems also reported higher monitoring judgments than participants who answered incorrectly, regardless of experimental condition.

Method

The Footnote 3dataset we used for the present study involved several measures related to general cognitive abilities, affect, math cognition, perception of COVID-19 susceptibility and severity, health literacy, and more. In this paper we only discuss the measures relevant to the current hypotheses (for a full analysis of the measures used in this study and the full study flow, see Thompson et al., 2021).

Participants

The parent study, from which these data were collected, was approved by the Kent State University IRB; all participants provided online consent for their participation, and their participation was voluntary. Data were collected from March 24 to April 9, 2020 through Qualtrics panels. As indicated in the pre-registration of the original project (https://osf.io/9hc7d), the authors planned to sample 1,200 people in the baseline survey to get 10 days of daily diaries from at least 625 people. The secondary data analyses reported in this paper were drawn from the baseline survey only, as there were no math-related or monitoring questions answered by participants in the daily diary portion of the study.

After exclusions, the final sample for our secondary data analyses included 1,177 participants. We excluded 120 participants for having incomplete data so that we could compare included participants on all analyses. Seventy-five percent of participants self-identified as White (10.33% identified as Black or African American, 3.93% identified as Hispanic or Latino, 3.86% identified as Asian, 0.62% identified as American Indian or Alaska Native, 0.23% identified as Native Hawaiian or Pacific Islander, and the remaining 6.32% identified as multiple races or ethnicities, “other,” or did not report), 46% identified as male, 41% reported being employed for wages, and 70% reported having between some college experience and a graduate degree. The average reported age of participants was 46.9 years (SD = 17.34 years; range: 18–85 years). As noted in the parent study, there were some differences between those participants who were included and excluded from analyses: those excluded were younger and more likely to identify as White or female or to be students or self-employed. They were also less likely to be retired or employed for wages, report lower income, have taken fewer math courses, and be incorrect on an objective numeracy (Cokely et al., 2012) and baseline problem-solving question. See the original paper (Thompson et al., 2021) and the original project’s pre-registration (https://osf.io/9hc7d) for full demographic information and data cleaning procedures.

Experimental design and procedure

Participants completed a pretest problem in a generic Disease A vs. Disease B context (see Fig. 2). Then, participants were randomly assigned to one of two intervention conditions. In the educational intervention condition, participants completed a step-by-step worked example of the correct procedures to complete the problems. The correct procedures involve accurately identifying the problem as a comparison of two ratios, and then correctly transforming the ratios to make them manageable to compare. Participants in the educational intervention condition received the worked example designed to eliminate common mathematical errors; the business-as-usual condition saw relevant statistics – number of deaths and number of infected individuals – but were not shown how to calculate case-fatality rates or to consider the relation between these numbers.

Fig. 2
figure 2

Overview of the four problems. Note: All problems were forced choice. We have bolded the correct responses here for readers. Participants made their monitoring judgments on a slider with endpoints of 0% = I am not confident at all to 100% = I am totally confident

After the intervention, all participants completed three more problems (see Fig. 2) and reported the strategy that they used to solve the problem. Strategy reports are useful because they reveal unique, convergent evidence into how participants solved the problems (see Alibali & Sidney, 2015; Sidney et al., 2018). Each problem only had three possible answers, so random responding would result in an average of 33% accuracy. Because of the high possibility for getting a problem right for the wrong reason–or getting a problem wrong for the right reason–strategy reports provide a more complete picture of the participants’ cognitive processes (Fazio et al., 2017; Fitzsimmons et al., 2020b; Reder, 1987; Sidney et al., 2018, 2021; Siegler & Thompson, 2014; Siegler et al., 2011).

Immediately after each problem, participants reported their strategy use, and then provided a monitoring judgment by rating their confidence using a slider between 0 and 100% to answer the question “How confident are you in your decision regarding the fatality rates? 0% = I am not confident at all, to 100% = I am totally confident?” Each of these monitoring judgments reflected participants’ confidence in the accuracy of their answer to the problem they had just completed.

Participants completed the pretest problem, the math anxiety scale, and the number line estimation tasks prior to the intervention. After the intervention, participants completed the remaining problems and the trait anxiety scale. There were a number of other measures included in the parent project (e.g., risk perceptions related to COVID-19) that are not central to the current hypotheses (see Appendix 3 in Thompson et al., 2021).

Materials

Measures relevant to the current analyses are described in the order they were completed by participants.

Pretest problem

As shown in Fig. 2, participants compared health statistics for two hypothetical diseases and chose which of the two was more fatal. Disease A (analogous to flu statistics at the time) included a bigger numerator and a bigger denominator as compared to Disease B (analogous to COVID-19 statistics at the time), even though the magnitude of the risk was larger for Disease B.

Pretest subjective and objective tasks

Participants rated their math attitudes and math anxiety prior to completing measures of their objective math skills because prior research (e.g., Sidney et al., 2021) suggested that when participants completed difficult fraction tasks first, it resulted in more negative attitudes about math in general and attitudes about fractions, whole numbers, and percentages, specifically. The order of the measures within these two blocks was randomized for all participants.

Math anxiety

Participants rated their overall math anxiety on the Single-Item Math Anxiety Scale (SIMA; Ashcraft, 2002; Núñez-Peña et al., 2014) and their math anxiety about specific types of numbers (e.g., fractions) across four items on Likert-like scales ranging from 1 = “Not anxious” to 10 = “Very anxious.” We calculated a math anxiety index by averaging scores across the five items.

Trait anxiety

We included the 20-item validated state-trait anxiety scale from Spielberger et al. (1970). A sample item includes: “I worry too much over something that really doesn’t matter.” Participants made their ratings on four-point Likert-like scales that ranged from “Almost never” to “Almost always.” We aggregated the ratings by taking a sum across all 20 items (leaving a possible range of 20–80, with higher scores representing higher reported anxiety).

Math attitudes

Participants completed the 20-item Math Attitudes Questionnaire (MAQ: adapted from Sidney et al., 2021) regarding their attitudes about math in general, as well as specific math attitudes toward whole numbers, fractions, and percentages. We calculated a math-attitudes index by averaging scores across the 20 items. We planned to include the MAQ in our regression models, but this variable was strongly related to the SNS scale. Thus, including both predictors in our models would lead to issues of multicollinearity. The SNS subscale accounted for more variance in metacognitive monitoring judgments, thus only the SNS was retained in the reported model. Additionally, given that Peters et al. (2019) considered the first four items of the SNS a measure of “confidence,” we explored whether it accounted for unique variance in item-level monitoring judgments. We also included the MAQ in the correlation matrix to indicate how the task correlated with other tasks in the current study.

Math self-efficacy

Peters et al. (2019) conceptualized “confidence” in math as the first four items on the Subjective Numeracy Scale (SNS), which involves eight questions pertaining to people’s subjective preferences for and comfort with math (Fagerlin et al., 2007). In the introduction, we argued that this conceptualization of confidence is not consistent with the way this construct is handled in the domain of metacognition. Rather, we argued that the first four items of the SNS should be considered a proxy for participants’ math self-efficacy with rational numbers because the questions specifically address perceived ability with fractions and percentages: (a) How good are you at working with fractions, (b) How good are you at working with percentages, (c), How good are you at calculating a 15% tip, and (d), How good are you at figuring out how much a shirt will cost if it is 25% off? Similar to the self-perceived ability subscale of the MAQ, we also adopted this subscale of the SNS to examine individual differences in participants’ pretest monitoring judgments.

We did not preregister plans to include the SNS in our statistical models. However, as our project developed, we realized that this subscale may more accurately reflect math self-efficacy and preferences pertaining specifically to rational numbers. Thus, we deviated from our pre-registered analysis plan to include this additional, theoretically valuable variable in our models (see results section for more details).

Magnitude knowledge

We measured number-magnitude knowledge with number-line estimation tasks for fractions, whole number frequencies, and percentages. Number-line estimation is an ideal proxy for overall mathematics skill because it is quick (Fazio et al., 2017) and easy to administer. Performance on each trial was measured as percentage of absolute error (i.e., PAE; Siegler & Booth, 2004). PAE is calculated by taking the absolute value of the difference between the person's estimate and the to-be-estimated number and dividing by the scale of the number line ([person’s estimate - to-be-estimated number] / scale of estimates). We averaged PAE across trials within each range and calculated an average across ranges such that higher PAE scores indicated greater error of estimation (worse performance).

Objective numeracy

We operationalized objective numeracy–or the ability to “run the numbers” correctly (Lipkus et al., 2001; Peters, 2020; Peters et al., 2019; Schwartz et al., 1997)–by adopting Cokely et al.’s (2012) 1-item, free-response version of the Berlin Numeracy Test. This measure asks, “Imagine we are throwing a 5-sided die 50 times. On average, out of these 50 throws how many times would this 5-sided die show an odd number (1, 3, or 5)?” Participants’ answers were coded as either correct (i.e., 30 times) or incorrect. Including this measure, commonly used in the health decision-making literature, in our models allowed us to account for possible differences in general numerical ability.

Educational intervention vs. business-as-usual control

Participants were randomly assigned to one of two experimental conditions: The educational intervention to combat whole number bias errors (problem-solving errors where people attend only to the individual components of the ratio and ignore the relational nature; Ni & Zhou, 2005) or the business-as-usual control condition (Fig. 3). The intervention included an analogy to a familiar context–an apple orchard–designed to help participants understand the problems conceptually. That is, the intervention included a worked example in which a familiar context (apples rotting at different rates in two apple orchards) illustrated the procedural steps that could be followed to calculate a case-fatality rate, emphasizing the use of number lines and relational reasoning. Then, participants were asked to draw an analogy from the apple orchard worked example to a worked example in a health context which showed how to calculate and compare the case-fatality rates for COVID-19 and the flu. See Thompson et al. (2021) for details. For the purposes of the secondary data analyses in the current study, we ensured that pretest item-level monitoring judgments did not significantly differ across the experimental versus control groups (p = 0.592).

Fig. 3
figure 3

Order in Which Participants Completed Tasks for Both Experimental and Control Groups. Note. Order of measures was randomized for the block including math anxiety, math self-efficacy, and math attitudes. Order was also randomized for the objective math measures including number line estimation and Berlin Numeracy

Results

Data analytic plan

We preregistered hypotheses on OSF (https://osf.io/vxm8d/?view_only=ca422f16682947d98a801e42f4d94b90). First, we computed zero-order correlations to assess the relations between pretest and post-intervention item-level metacognitive judgments and other relevant variables (e.g., math anxiety and self-perceived math ability). We ran linear regressions to identify the factors associated with item-level metacognitive judgments (RQ1), whether performance accuracy, regardless of condition, was associated with item-level metacognitive judgments (RQ2), and whether math anxiety was more strongly associated with item-level metacognitive judgments than trait anxiety (H2). Finally, to test hypothesis 1, in which we predicted higher item-level metacognitive judgments in the intervention vs. control group, we conducted ANCOVAs.

Correlations among study variables

As Footnote 4anticipated, many of the variables in this study were correlated because they tapped similar underlying constructs (see Table 1 for descriptive statistics and correlations). For example, number-line estimation PAE and objective numeracy were strongly correlated (i.e., objective math skills), as were subjective numeracy, math anxiety, and math attitudes (i.e., affective factors), item-level metacognitive judgments before and after the intervention (i.e., awareness of performance on the problems), and math anxiety and trait anxiety (i.e., anxiety more broadly). Measures of magnitude understanding (PAE) and affective reactions to math (math attitudes and math anxiety) were moderately related (absolute value of r’s ≥ 0.32, p’s < 0.001). As seen in Table 1, trait anxiety and math anxiety were also moderately correlated (r = 0.31, p < 0.001).

Table 1 Descriptive Statistics and Correlations for Study Variables

Monitoring judgments from pretest and posttest were all strongly related to one another (r’s = .48 to .70, p’s < 0.001). That is, those who rated their item-level metacognition high at pretest also rated their item-level metacognition high at posttest across all items. To foreshadow our subsequent results, self-perceived math ability (operationalized as the first four items on the SNS) was moderately associated with monitoring judgments across problems, even when accounting for our other predictors.

Primary models

Variables associated with monitoring judgments

To explore the factors that predicted item-level metacognitive judgments (RQ1), we conducted linear regressions in which we predicted item-level judgments separately for each of the four problems. We included condition (except at pretest), pretest monitoring judgments (except at pretest), participants’ pretest problem-solving accuracy (except at pretest), gender, PAE, problem-solving accuracy for the current problem, math self-efficacy, objective numeracy, math anxiety, and trait anxiety in our models. As indicated in the note for Table 1, we did not include the full MAQ or the MAQ self-perceived math ability subscale in our models because math self-efficacy (i.e., the SNS self-perceived math ability subscale), the full MAQ, and the MAQ self-perceived math ability subscale were highly correlated (all r’s > 0.78). We were particularly interested in how the SNS subscale was associated with item-level metacognitive judgments; thus, we retained this variable in our models instead of MAQ.

As seen in Table 2, when accounting for individual differences in pretest monitoring judgments, the only variable which was a significant predictor of monitoring judgments on all four problems was the 4-item SNS self-perceived math ability subscale (β’s ≥ 0.12, t’s ≥ 4.10, p’s < 0.001). Problem-solving on the current problem (except at pretest) was significantly associated with monitoring judgments, as was gender. See Appendices A and B for the full models.

Table 2 Linear Regression Standardized Beta Coefficients for Predictors of Monitoring Judgments

Differences in monitoring judgments by experimental condition

We hypothesized that post-intervention monitoring judgments would be higher in the intervention group as compared to the control group, controlling only for pretest monitoring judgments and gender (H1). Our hypothesis stemmed from the fact that participants were more accurate at solving problems (both posttest problem 1 and 2) in the intervention compared to the control group (Thompson et al., 2021), thus we anticipated increased monitoring judgments as well.

To test this hypothesis, we conducted ANCOVAs comparing the effects of experimental condition on monitoring judgments (dummy coded as intervention vs. control) while controlling for gender (dummy coded as male vs. not male), and pretest monitoring judgments. Consistent with hypothesis 2, participants’ monitoring judgments were higher in the intervention group than the control group on all post-intervention problems (see Table 3). On posttest problem 1, the intervention group reported higher monitoring judgments (M = 83.89, SE = 0.84) than the control group (M = 80.88, SE = 0.85), F(1, 1173) = 6.30, p = 0.012, partial η2 = 0.01. Similarly, on posttest problem 2, the intervention group (M = 82.11, SE = 0.83) reported higher monitoring judgments than the control group (M = 79.67, SE = 0.85), F(1, 1173) = 4.19, p = 0.041, partial η2 = 0.004. Finally, on posttest problem 3, the intervention group (M = 80.40, SE = 0.89) again reported higher monitoring judgments than the control group (M = 77.45, SE = 0.91), F(1, 1173) = 5.37, p = 0.021, partial η2 = 0.01.

Table 3 Observed means, Standard deviations, and Analysis of Covariance for Hypothesis 1 (Comparing Differences in Monitoring Judgments by Experimental Condition)

Problem accuracy and monitoring judgments

To test our second research question (RQ2), we ran separate hierarchical linear regressions to provide convergent evidence that people who answered the problems correctly also reported higher monitoring judgments, regardless of experimental condition. Participants who were accurate (dummy coded as correct or incorrect) reported higher monitoring judgments on the pretest problem, β = 0.08, t = 2.63, p = 0.009, but not when gender was added to the model, β = 0.05, t = 1.71, p = 0.087. However, males reported higher monitoring judgments than females, β = 0.19, t = 6.43, p < 0.001.

Then, we included accuracy on each respective problem in block 1 and gender and pretest monitoring judgments in block 2 to predict monitoring judgments on each of the post-intervention problems in separate regression models. All six models exhibited good model fit (block 1: all F’s > 19.90, p’s < 0.001, R2’s > 0.02; block 2: all F’s > 143.10, p’s < 0.001, R2’s > 0.26), and most of the individual predictors were related to participants’ monitoring judgments; the only exception was that gender did not improve model fit for posttest problem 2, β = 0.04, t = 1.42, p = 0.156.

To summarize, confidence (i.e., the magnitude of monitoring judgments) was higher for those who were accurate on problems relative to those who were inaccurate, suggesting some level of metacognitive awareness at the group level. Means, standard deviations, and independent sample t-tests are shown in Table 4 illustrating mean differences in monitoring judgments by problem accuracy. Note that the difference between accurate and inaccurate responders on monitoring judgments was smallest at pretest, t(1175) = 2.81, p < 0.01, d = 0.16, compared to posttest d’s between 0.30 and 0.42 (see Table 4). The larger differences in monitoring judgments between accurate and inaccurate responders appears to be driven by lower confidence in inaccurate responses rather than higher confidence in accurate responses. Given that participants did not receive feedback on their performance, it seems likely that characteristics of the problems, elicited experience-based cues such as perceptions of difficulty, influenced their monitoring judgments. For example, posttest problem three included exceptionally large numbers (e.g., 14 billion), did not include a contingency table, and had the smallest difference in magnitude of infection rate between countries (see Fig. 2 for problem features). These features of the problem may have served as cues that influenced participants’ monitoring judgments, even though Thompson et al. (2021) reported that accuracy on posttest Problem 3 was higher than expected, potentially because of news sources reporting on Italy’s and China’s infection rates. In fact, monitoring judgments were lowest on this final problem, regardless of accuracy (Table 4).

Table 4 Observed means, Standard Deviations, and t-tests for Research Question 2 (Comparing Monitoring Judgments for Accurate vs. Inaccurate Problem-Solvers)

Math anxiety, trait anxiety, and monitoring judgments

Finally, we hypothesized that math anxiety would explain more variance in posttest monitoring judgments than trait anxiety, controlling for gender, pretest monitoring judgments, and experimental condition (H2). We based this prediction on prior research that suggested math anxiety might play a unique role above and beyond general trait anxiety when examining negative affective reactions in a specific math domain (Ashcraft & Ridley, 2005; Barroso et al., 2020; Dreger & Aiken, 1957; Gough, 1954; Hembree, 1990; Zhang et al., 2019).

We regressed posttest monitoring judgments (for each individual problem) onto gender, experimental condition, pretest monitoring judgments, and trait anxiety in the first step. We evaluated the variance explained by math and trait anxiety above and beyond all other predictors in the model for each of the four problems. The change in R2 values in Table 5 reflect the additional variance explained by each of the four models when an individual predictor is added to the model with all other variables. Across all three post-intervention problems, adding math anxiety to the model increased the model fit (all △ R2 values = 0.01; all △F p-values < 0.01).

Table 5 Hierarchical Linear Regression Standardized Beta Coefficients for Hypothesis 2 (Unique Variance in Monitoring Judgments Accounted for by Math Anxiety)

As shown in Table 5, math anxiety remained a significant predictor of monitoring judgments (β’s between -0.08 and 0.10; all p’s < 0.01) above and beyond trait anxiety. However, trait anxiety was only a significant predictor above and beyond math anxiety for the third posttest problem, △ F(1, 1191) = 10.64, △ R2 = 0.007, p = 0.001. It is an open question as to why trait anxiety was a stronger predictor for post intervention problem 3 than the other problems. One possible explanation is that there was something unique about problem 3. The problems in this study were complicated and involved real-world scenarios. For example, problem 3 compared two countries that were discussed extensively in the news at the time, and thus accuracy on the problem may have been high even if participants did not engage deeply with the mathematics involved in the problem. See Table 5 for the standardized beta coefficients and Appendix 3 for the full models.

Discussion

In this secondary data analysis, we investigated the type of information-based (i.e., math self-efficacy, math anxiety, trait anxiety) and experience-based cues (i.e., worked-example intervention) that related to adults’ monitoring judgments about their performance on problems (Koriat, 1997; Koriat & Levy-Sadot, 1999).

First, several individual differences accounted for variance in metacognitive monitoring judgments: Males reported higher monitoring judgments than females, accurate responders reported higher monitoring judgments than inaccurate responders, and participants who made more precise estimates of numerical magnitudes reported higher monitoring judgments than participants who made less precise estimates of numerical magnitudes. Further exploratory analyses (see Supplemental File) revealed that math anxiety mediated the effect of gender on pretest monitoring judgments. Second, consistent with our first hypothesis, participants in the intervention condition reported higher monitoring judgments than participants in the control condition. Third, individual differences accounted for variance in item-level monitoring judgments: (a) Participants with higher math self-efficacy reported higher item-level monitoring judgments than participants with lower math self-efficacy, controlling for a number of other predictors as described in the introduction, (b) participants with lower math anxiety reported higher monitoring judgments than participants with higher math anxiety, and (c) participants with more positive math attitudes reported higher monitoring judgments than participants with less positive math attitudes (see the correlation matrix in Table 1). Finally, consistent with prior research (e.g., Ashcraft, 2019; Ashcraft & Ridley, 2005; Hembree, 1990), participants’ math anxiety appeared to be separable from participants’ trait anxiety.

Significance of the present study

Our data indicate that when people solve math problems, they bring a variety of individual differences to the experimental context, which can contribute to differences in their monitoring judgments, regardless of their accuracy on the problem (Koriat, 2011). For example, on two of the three post-intervention problems, participants’ gender was associated with their monitoring judgments even when the other predictors were in the model (see Table 2). Furthermore, across all problems, ratings of math attitudes and math anxiety in our study were significantly associated with monitoring judgments (all correlations r ≥ .34, all p-values < 0.001). More subjective cues, such as math attitudes and anxiety may serve as commonly used cues that inform monitoring judgments.

Furthermore, consistent with prior research, our data suggested that women report lower monitoring judgments than men when rating how sure they were that they had correctly answered math problems (Barroso et al., 2020; Devine et al., 2012; Hembree, 1990; Wigfield et al., 1991), and people with higher math anxiety reported lower monitoring judgments (Hembree, 1990; Morsanyi et al., 2014; Rolison et al., 2016). Women–who are on average less precise when estimating the magnitudes of numbers (Geary et al., 2020; Hutchison et al., 2019; Rivers et al., 2020; Thompson & Opfer, 2008) and rate themselves as less confident in their ability to estimate magnitudes (Rivers et al., 2020)–have higher math anxiety (Devine et al., 2012), lower numeracy (Weller et al., 2013), and more negative attitudes about rational numbers than men (Sidney et al., 2021). In the current study, women also provided lower monitoring judgments on their problem-solving performance than men. The reasons for these gender differences are up for debate (Halpern et al., 2007); however, gender differences in spatial abilities (Geary et al., 2020), which may stem from early gender differences in spatial experiences (e.g., spatial language and block play; Pruden et al., 2011), are one widely accepted predictor of gender differences in math.

Relevance for health care

Many people struggle to comprehend numerical health information (Peters et al., 2019; Waters et al., 2016). People may be likely to avoid numerical health information if they believe that they cannot interpret the information (Afifi & Weiner, 2004; Sweeny et al., 2010), if they have high math anxiety (Rolison et al., 2016), or if they have negative attitudes toward the numbers (Sidney et al., 2021). People’s level of certainty in their interpretations of numerical health information, their comprehension of health statistics, and their own perceived disease susceptibility can influence their medical decision-making (Desender et al., 2018; Peters et al., 2019; Taber & Klein, 2016). If people are not confident in their ability to accurately interpret quantitative health information, they may turn to other types of information, such as affective factors and personal values, to make health decisions, which could have downstream implications for uptake of preventive health recommendations (e.g., social distancing, wearing a mask, getting vaccinated). Conversely, if people are inappropriately confident, while lacking adequate understanding of numeric health information, then they might avoid healthy behaviors, engage in unhealthy behaviors for loved ones and themselves, and be less likely to listen to opposing data. One real-world application of monitoring judgments is affecting when people offer opinions (Dunlosky & Metcalfe, 2009). Monitoring judgments paired with inaccurate understanding of numerical health information could contribute to the spread of misinformation. Note that COVID-19 was the domain in which we tested our hypotheses due to its international relevance; however, we do not believe that our findings are strictly limited to the context of COVID-19 and should be applicable to adults’ metacognitive monitoring of any health statistics. It is also important to note that while the current study offers some data on the relation between monitoring judgments and several other factors, some of the reported effect sizes are small. Small effect sizes can often be indicative of real, important effects (Funder & Ozer, 2019), especially when the effects translate to large numbers of people, as with an international pandemic. Regardless of large or small effects, it is important to thoughtfully consider the appropriate reach of recommendations and implications (see Robinson et al., 2013; Robinson & Levine, 2019).

Future research might assess whether monitoring judgments also play a role in conveying incorrect health information to others. There is a vast amount of misinformation regarding COVID-19, exacerbated by easy access to social media (Frenkel et al., 2020; Pennycook et al., 2020; Russonello, 2020). It is possible that a person who confidently believes that COVID-19 is no worse than the flu (or similar minimizing beliefs) could be more likely (compared to a person who is not confident in this belief) to perpetuate this false belief by sharing the information with friends and family. More than two years after the WHO declared COVID-19 a pandemic, misinformation continues to spread, and accurate information about COVID-19 can be hard to find.

Relevance for education

There are important educational implications regarding the role of gender and math anxiety (Ashcraft & Ridley, 2005; Baloģlu, 2004; Barroso et al., 2020; Beilock et al., 2010; Devine et al., 2012; Dowker et al., 2016; Hembree, 1990; Ma, 1999; Tomasetto, 2019). In the current study, we explored how gender and math anxiety were related to monitoring judgments in health-related math problem solving. With a broader lens, we also examined how math self-efficacy for rational numbers (i.e., first four items of the SNS) was associated with math anxiety and monitoring judgments. It remains an open question as to whether math self-efficacy is a domain-general trait which could affect participants’ perceptions of their ability in a wide variety of contexts including simple math-related decisions, such as choosing an appropriate tip, to more complex math-related decisions, such as evaluating health care options.

Future research could assess whether other math interventions, which have been successful in academic settings, might also lead to high monitoring judgments in health and other domains. Our data suggest that primary targets for future interventions could be negative affective reactions to math, such as poor math attitudes and high math anxiety (cf. Jamieson et al., 2010; Park et al., 2014; Supekar et al., 2015). Decreasing these negative affective reactions may increase people’s monitoring judgments and may be beneficial both in educational and health contexts.

Limitations and future directions

One limitation of the present study is inherent to any secondary data analyses: We were only able to investigate the measures included in the publicly-available data set. Future studies might include several additional measures, such as various measures of math self-efficacy, math self-concept, and math anxiety. For example, math anxiety as a construct has been measured in many ways (see Cipora et al., 2019 for a review). Some inventories focus on formal academic learning and testing contexts (Carey et al., 2017; Hopko et al., 2003; Jameson, 2013; Yánez-Marquina et al., 2017), others focus on affective reactions to math such as worry (Bai, 2011; Harari et al., 2013; Wigfield & Meece, 1988), and still others focus on everyday math contexts (Hunt et al., 2011; Yánez-Marquina and Villardón-Gallego, 2017). Math anxiety in different contexts was not the focus of the current study; future research could establish whether health-related statistics cause adults to feel more or less math anxious than when they encounter math in academic contexts. Similar to the way math perceptions (e.g., math self-efficacy, math self-concept, and math anxiety) often overlap and are presented in different ways, terminology related to confidence is discussed differently in different domains (Stankov et al., 2015). Notably, domains with substantial overlap (e.g., health decision making, math cognition, and metacognition) share terminology–confidence–yet operationalize and interpret findings differently (see Thompson et al., in press for a more in-depth discussion of this topic).

Another similar limitation, which could guide future research, is the high level of overlap between math anxiety, trait anxiety, math attitudes (especially the efficacy subscale), and math self-efficacy. Correlations among these variables were moderate to extremely high, suggesting that the constructs were somewhat overlapping. It is possible that the significant correlations between these measures is largely due to the nature of self-report surveys: Participants might not be able to differentiate between small differences in stimuli between math anxiety, math attitudes, and their perceptions of their own math ability (i.e., math self-efficacy). While self-report surveys are by far the most commonly used way to assess math anxiety (Cipora et al., 2019), other methodologies could be used in the future, such as physiological measures and administering state, as opposed to trait measures. Additionally regarding math anxiety, it is possible that participants had reactive effects to completing math anxiety and math attitude items near the beginning of the experiment. Order effects should regularly be a consideration for any experiment that deals with many different measures. However, previous research on math attitudes suggested that completing math tasks first would induce lower overall math attitudes, thus, Thompson et al. (2021) did not counterbalance in the current study (see Sidney et al., 2021 for a discussion of order effects in this domain).

Finally, several other interesting open research questions remain. For example, are mathematical monitoring judgments in general strongly related to monitoring judgments across different rational number types (e.g., ratios, fractions, decimals, percentages)? Similar to attitudes about math (e.g., Sidney et al., 2021), are gender differences in math anxiety and monitoring judgments differentiated by the type of mathematics being done? Because some less diagnostic subjective cues, such as math self-efficacy, are stronger predictors of pretest monitoring judgments than more diagnostic, direct-access cues, such as problem-solving accuracy, it is a worthwhile endeavor for future research to investigate ways to de-bias people’s tendency to focus on less diagnostic cues (Fitzsimmons et al., 2020b). Future research should investigate how aware people are of using specific cues as they make math-related health decisions, given that a common assumption of cue-utilization theory (e.g., Koriat & Adiv, 2016) is that cues shaping subjective confidence judgments (i.e., monitoring judgments) can operate below conscious awareness. Metacognitive experiences (Efklides, 2006) are complicated inferential processes that use a variety of cues to help guide decision-making and problem solving. Follow-up studies should investigate if people’s beliefs about what drives their monitoring judgments align with the data presented in the current study.

Conclusions

The current study involved a secondary data analysis of Thompson et al.’s (2021) data. In the original project, a national panel of adults learned how to calculate and compare COVID-19 and flu case-fatality rates. In this secondary data analysis, we examined the factors that predicted people’s monitoring judgments (i.e., confidence) pertaining to their performance on problems involving COVID-19. Monitoring judgments are only as useful as the cues from which they are derived (Ackerman & Thompson, 2017; Koriat, 1997). Thus, developing a more rigorous understanding of which cues inform math-related decision making in health contexts deserves future research. Our data suggested that people use both experience-based and information-based cues when making monitoring judgments when solving health-related math problems. We argue that both math self-efficacy and monitoring judgments have relevant implications for downstream effects in both health and education domains.