Introduction

Confidence, Certainty, and Correctness

The present study concerns the interplay between students’ self-reported confidence and certainty with response correctness in educational assessment. The study operates upon the following foundational definitions of confidence, certainty, and correctness. Confidence refers to an individual’s sureness that they can provide a correct response to an assessment item pertaining to a given set of clearly defined criteria. For instance, in the academic environment, a student may be confident that they could provide a correct response to a question pertaining to a course learning objective (LO) if they were tested to do so. Certainty on the other hand refers to an individual’s sureness that they have provided a correct response to a given assessment item. Certainty differs from confidence in that certainty is measured after a response has been provided and is reflective of past performance, whereas confidence is measured before an assessment is administered and is reflective of anticipated performance.

Correctness refers to the evaluation of a response for trueness based on accepted standards and field expertise. For example, when asked to respond with the largest ocean on Earth, the correct answer is, objectively and without debate, the Pacific Ocean. While many fields and assessments award no credit for providing an incorrect answer, some fields, such as mathematics, consider the relationship of correctness and earned credit in broader context. The process of answering a question in mathematics is often evaluated separately from the final response, and therefore, it is standard practice to award partial credit for correct procedural processes despite an ultimately incorrect response. In these regards, correctness is not solely defined to the end response to an assessment item.

Confidence, certainty, and correctness apply to the foundational principles for assessing knowledge. By definition, knowledge is information that is true and justified (Hunt, 2003). Customary academic assessments rely on correctness of student responses to evaluate knowledge trueness, but relatively few evaluate metacognitive elements such as confidence and certainty to justify the information trueness. The most studied model evaluates self-reported certainty to justify response correctness and accurately distinguish complete, partial, absent, and flawed knowledge (Gardner-Medwin & Gahan, 2003; Ghadermarzi et al., 2015; Hunt, 2003; Snow, 2019), but confidence has received less attention. As confidence has the potential to influence an individual’s capacity to act and make decisions throughout the cognitive process, it may influence tendencies toward certain behaviors and performance outcomes. For example, a previous study by Veenman et al., (2006) suggests that consistently experienced high confidence or certainty levels alongside incorrectness may prevent children from properly amending their perceptions of knowledge for future use.

Confidence, certainty, and correctness, hereinafter referred to collectively as the “3Cs,” have not previously been assessed simultaneously to strategically inform performance behaviors and assessment item deficiencies. Methods for quantifying the accuracy of confidence and certainty with correctness, especially regarding process-based questions (PBQs) invoking problem-solving tasks, have also not been studied but may help students and STEM educators identify areas of frequent metacognitive misalignment. This paper discusses the implementation and efficacy of a novel 3C assessment method (“The 3C Assessment Method Structure” section) in an undergraduate linear algebra course and presents the usefulness of an accuracy index (“Accuracy Indices and Metacognitive Accuracy Plots” section) used to measure students’ accuracy in their confidence and certainty. The results and discussions in this paper (“Results” and “Discussions” sections) present two practical utilities (identifying problematic student behaviors and informing teaching and assessment improvements) for the information produced by this 3C assessment method.

Metacognition and Mathematical Problem-Solving

Academic assessment in STEM education often requires students to use and demonstrate problem-solving techniques to arrive at answers. The terms problem and problem-solving have been used with varying definitions throughout literature on mathematics and mathematics education (Schoenfeld, 2016). In accordance with the first of Webster’s, (1979) definitions as presented by Schoenfeld, (2016), the following discussion assumes the term problem to mean an item in mathematics which requires individuals to perform a particular task. The term problem-solving refers to the process of performing that task. Problems which arise in an academic environment are often routine in nature with a known strategy (or collection of strategies) for producing a correct response. However, the domain of mathematics often involves tasks for which an individual may not yet be aware of such a prescribed algorithm; the definitions presented above further pertain to these less routine tasks. While Schoenfeld references that one intent of problem-solving is “to learn standard techniques in particular domains,” the methodology of the linear algebra course focused on in this study also considers problem-solving as a means for students to demonstrate their understanding of such techniques during academic assessment (Schoenfeld, 1983, 2016).

Polya, (1957) describes four phases of the problem-solving process: understanding, planning, carrying out the plan, and looking back. However, engagement in problem-solving tasks invokes multiple levels of cognitive processing. Kitchener, (1983) proposed a three-tier model for cognitive processing consisting of cognition, metacognition, and epistemic cognition. The present study focuses on the first two tiers of this model, with cognition referring to tasks such as “computing, memorizing, reading, perceiving, acquiring language, etc.,” and metacognition referring to “the processes which are invoked to monitor cognitive progress when an individual is engaged in … cognitive tasks” (Kitchener, 1983). As indicated by Garofalo & Lester, (1985), Polya’s early model of problem-solving does not directly incorporate metacognition. For instance, providing a correct response does not necessarily imply that an individual possesses knowledge pertaining to that task (i.e., a student guessing the correct answer from a list of options). Addressing this notion, Hunt describes how knowledge requires the truth of information to be justified with metacognitive belief (Hunt, 2003). This description of knowledge as multidimensional is further in accord with Ayer’s earlier description which states that an individual possesses knowledge when their belief is true, they are certain that their belief is true, and their certainty is justified (Ayer, 1958; Hunt, 2003). In reference to problem-solving, this view of knowledge not only addresses cognitive success, but also observes, via self-monitoring of cognitive performance, a higher tier of cognitive processing — metacognition. Adapting Polya’s four-phase model, Garofalo & Lester, (1985) proposed a cognitive-metacognitive framework for problem-solving which includes elements of metacognition which occur alongside cognitive tasks throughout each phase within the problem-solving process.

A key aspect of metacognition involves monitoring cognitive progress throughout a cognitive task, including both successes and failures (Flavell, 1979; Garofalo & Lester, 1985; Kitchener, 1983; Schneider & Artelt, 2010). This aspect of metacognition is referenced in literature by several names such as self-monitoring, ease-of-learning judgments, judgments of learning, feeling-of-knowing judgments, or perceived self-efficacy, each with subtle distinctions (Bandura, 1977; Schneider & Artelt, 2010). Garofalo & Lester’s, (1985) Cognitive-metacognitive framework for problem-solving consists of four main phases — orientation, organization, execution, and verification — and highlights, as subitems within these four phases, both cognitive and metacognitive activities invoked during the process of performing a problem-solving task. In the orientation phase of this framework, Garofalo & Lester, (1985) identify confidence as “assessment of familiarity with task” and “assessment of difficulty and chances of success,” which occurs prior to the planning or execution of the problem-solving task and supports the working definition in the present study. Furthermore, Garofalo & Lester, (1985) identify certainty with “evaluation of orientation and organization” and “evaluation of execution” during the verification phase, which occurs at the end of the problem-solving process and is again consistent with the use of the term in the present study. In these regards, Garofalo and Lester’s cognitive-metacognitive framework for problem-solving illustrates how confidence and certainty occur at critically different moments. In further agreement with Garofalo and Lester, Ayer’s criteria for possessing knowledge necessitates self-monitoring during the verification phase via self-reflection on performance (i.e., one’s level of certainty in their answer). Alternatively, self-monitoring during the orientation phase, which occurs prior to performance of cognitive tasks, pertains to confidence as a preemptive anticipation for one’s acquisition of certainty.

Assessing Metacognitive Elements Alongside Correctness

As introduced above, a comprehensive assessment of knowledge requires evaluation of both information correctness and justification. Assessment in education often evaluates knowledge by a single factor, correctness, allowing only for the distinction between correct and incorrect responses. However, the inclusion of response justification (i.e., self-reported certainty) allows for more refined classifications of knowledge. Absent knowledge is demonstrated when a student provides a response, correct or incorrect, using only guesswork, or does not provide a response at all. Oppositely, when a student is certain that their response is correct and their response is actually correct, complete knowledge is justified; however, when the response is actually incorrect in the same scenario, flawed knowledge is indicated. Furthermore, a student is said to have partial knowledge when they possess some correct information about a given assessment item but are unable to identify the correct response (Coombs et al., 1956) with full certainty. Aiming to distinguish between absent, complete, flawed, and partial knowledge, methods that assess response justification alongside correctness have been primarily researched among multiple-choice questions (MCQs) and some end-product questions (EPQs) where traditional “all or none” scoring would be sufficient (Bush, 2015; Snow, 2019).

Bush presents a compilation of eight alternative methods for MCQ assessments which aim to assess partial knowledge (Bush, 2015). Most of these methods are designed to discourage students from guesswork by introducing marking schemes that penalize guesswork and flawed knowledge while rewarding acknowledgement of gaps in their knowledge. For example, as a means for accomplishing these goals, one method allows students to select as many (multiple choice) options to be sure that the correct answer is included in their set of selected responses (Bush, 2015). While each of the methods presented by Bush demonstrates advantages and disadvantages, Bush suggests that “[in] mathematical calculation…, the notion of attaching a [certainty] level to one answer seems much more appropriate than having the freedom to select multiple answers” (Bush, 2001).

A method developed by Gardner-Medwin, often referred to as certainty-based assessment, collects response correctness in conjunction with student certainty that their response is correct (Gardner-Medwin, 1995). Evaluating post-item certainty in conjunction with response correctness via this method has been shown to accurately distinguish absent, partial, complete, and flawed knowledge among MCQs and EPQs and be especially effective as grounds for remediation (Snow, 2019; Snow & Brown, 2021). Methods similar to Gardner-Medwin’s approach have also been implemented with differences primarily focused to presentation and marking schemes (Farrell, 2006; Gardner-Medwin & Gahan, 2003; Ghadermarzi et al., 2015; Hunt, 2003).

While certainty-based assessments have been heavily researched in other disciplines, there has been relatively little research on the efficacy of these methods within mathematics education, especially at the undergraduate level. In one study, Foster emphasizes how confidence is an important factor which influences and is influenced by student competency (Foster, 2016). In his account, mathematical confidence represents a student’s belief in that they can provide a correct response (Foster, 2016). Despite this observation, the study actually uses Gardner-Medwin’s methodology to collect student certainty, a student’s belief in that they have provided a correct response. As previously discussed, confidence and certainty are two different elements of metacognition which occur at critically different moments during cognitive processing. Foster later acknowledges this discrepancy by stating that his methodology assesses student’s “procedural confidence” (certainty as defined in the “Confidence, Certainty, and Correctness” section) but does not necessarily observe a student’s “conceptual confidence” (confidence as defined in “Confidence, Certainty, and Correctness” section). This example is one of many which demonstrate the different interpretations and uses of the terms confidence and certainty in the literature and how often their intentions are more pertinent than the words. Furthermore, terms like procedural confidence and conceptual confidence may be too specific to assessment items for which process-based supporting work is required. By connecting the intended meaning of these phrases to the framework of this study outlined above, the terms confidence and certainty increase the versatility of the method later proposed in this paper to other domains.

While certainty has been investigated at the undergraduate level and above in other disciplines, research regarding confidence and similar self-constructs has been active among school pupils. A study by Lee, (2009) demonstrates that math self-efficacy (confidence as defined in this paper) among Program for International Assessment (PISA) 2003 participant students (15 years of age) was independent of two other self-constructs — math self-concept and math anxiety. The PISA 2003 assessment collected participants’ mathematics self-efficacy using a four-point Likert-type scale similar to that used in this study (Lee, 2009). However, the PISA 2003 assessment does not simultaneously investigate student certainty in their provided responses.

Purpose of This Study

The present study expands upon previous literature by investigating the outcomes of simultaneous assessment of confidence and certainty alongside correctness for individual student responses. The interplay of confidence, certainty, and correctness is investigated among students at the undergraduate level, while previous research into these notions has primarily taken place among school pupils. Additionally, the presented accuracy indices are designed to apply to scenarios where partial credit is possible, particularly among PBQs, while similar metrics discussed in other studies pertain to full or zero credit marking of correctness, such as MCQs or EPQs.

The accuracy indices (“Accuracy Indices and Metacognitive Accuracy Plots” section) and metacognitive accuracy plots (“Metacognitive Accuracy per Student” section) developed for this study serve STEM education researchers as tools by which to empirically investigate research questions related to student performance and metacognition. The presented methods provide STEM educators with a practical method of multidimensional knowledge assessment which can be implemented within routine course structures to further enhance instruction and student learning.

The objective of this study is to implement the novel 3C assessment method in an undergraduate linear algebra course to more comprehensively assess student performance and competency in MCQs, EPQs, and PBQs. This study aims to identify metacognitive behavior among varying assessment item types and investigate the efficacy of this 3C method in distinguishing between metacognitive behaviors of students exhibiting similar cognitive performance. The use of metacognitive accuracy indices developed for this 3C method also provides new and valuable data for analyzing assessment item efficacy.

Methods

Undergraduate Linear Algebra Course Structure

The undergraduate linear algebra course which hosts the focused student population in the present study primarily teaches terminology, computations, and applications. Notably, this course differs from others like it in that examinations do not require students to engage in proof-writing tasks. The assessment item types on these examinations include MCQs, EPQs, and PBQs. PBQs are distinct from EPQs in that they require substantially more supporting work to earn full credit, while EPQs often can be answered by observation or simple arguments. The content covered in this course includes systems of linear equations, properties of matrices and matrix operations, vector spaces, linear transformations, and eigenvalues and eigenvectors. As this course builds upon previous information, later course LOs require students to perform tasks with competence in LOs covered earlier in the course.

Assessment Information and Data Inclusion

Assessment data from this course was collected across four sections of the course, taught across two semesters by the same instructor and consisting of course content with minimal differences. Data was collected from 103 total students across six routine chapter examinations per semester, each consisting of nine items per regular exam and a final exam consisting of fifteen items. These exams were administered in the face-to-face classroom environment. Student data was omitted if the student provided complete responses for less than or equal to 85% of assessment items. This rule for exclusion corresponded to an individual missing one full examination and more than one additional item which may have occurred due to illegible marking of confidence/certainty.

Each exam contained an even distribution of MCQs, EPQs, and PBQs. Based on preset rubrics, partial credit was awarded to responses for both EPQs and PBQs when earned, but MCQs were scored with full or no credit based on selection of the correct answer choice. Self-reported confidence and certainty data had no impact in determining students’ scores or grades in the course. All data was blindly anonymized with alphanumeric codes prior to analyzing the data, and students were provided an opportunity to opt out of having their data included in the study.

The 3C Assessment Method Structure

Prior to the start of each semester, a detailed list of LOs was reviewed by the lead instructor for proper criteria and verified by the course director and authors of this paper. Each LO was written such that an active student (who attends class, attempts homework, etc.) in the course should be equipped to understand it. Each assessment item was written to primarily assess a single LO. The complete list of LOs for each chapter was provided to students at the start of each respective chapter and reviewed in class prior to each chapter examination. Students were reminded throughout the semester that these LOs encompass the potential tasks subject to assessment on the examinations. The mapping of each assessment item to a single LO was completed such that the respective LO is the highest level LO which it requires (i.e., the one which the item was written to assess). In this study, these LOs were used to provide clear criteria for which student confidence could be assessed.

Immediately preceding each of the six chapter examinations, students were instructed to self-report their levels of confidence for correctly answering forthcoming exam questions pertaining to individual LOs. The prompt requesting students to provide their self-assessed confidence appeared as follows:

Select your level of confidence in that you can provide a correct response to a question regarding each of the following learning objectives.

  • 4—High confidence

  • 3—Somewhat high confidence

  • 2—Somewhat low confidence

  • 1—Low confidence

The list of pertinent LOs followed this prompt. This assignment was administered to students at the start of the examination period, before students had seen any assessment items. Students were required to complete this assignment before beginning the examination. For each occurrence, all students completed this assignment within 5 min.

Once the confidence self-assessments were completed, students were given an examination packet which contained the assessment items. A corresponding prompt (similarly structured to the confidence prompt) followed each item and instructed students to record their certainty in their response before moving on to the next item. An example of an assessment item and its certainty prompt is as follows:

Sample Question: Use a determinant to find the equation of the plane passing through (2, 1, 0), (3, 0, 1), (4, 0, 0) in \({\mathbb{R}}^{3}\). Write your answer in standard form, \(ax + by + cz = d\).

Select your level of certainty in that you have provided a correct response.

  • 4—High certainty

  • 3—Somewhat high certainty

  • 2—Somewhat low certainty

  • 1—Low certainty

Upon completion of the examination, responses to assessment items were graded for correctness, and student scores were assigned independently of reported confidence or certainty levels.

Accuracy Indices and Metacognitive Accuracy Plots

An individual student response for any given assessment item yields the following three values: \({c}_{1}\) = confidence level, \({c}_{2}\) = earned credit, and \({c}_{3}\) = certainty level. The variables \({c}_{1}\) and \({c}_{3}\) report values 1, 2, 3, or 4 while \({c}_{2}\) is viewed as a fraction of possible credit in the interval [0, 1]. To measure the metacognitive accuracy of a student response, two similarly defined formulas were developed to measure the alignment of confidence with credit (AIConf) as well as the alignment of certainty with credit (AICert). A fully correct student response (\({c}_{2}=1\)) is indicated by an accuracy value of 1 when levels 3 or 4 of high confidence/certainty are reported or by an accuracy value of − 1 when accompanied by levels 1 or 2 of low confidence/certainty. Alternatively, an incorrect student response awarded with no partial credit (\({c}_{2}=0\)) would receive an accuracy value of − 1 when levels 3 or 4 of high confidence/certainty are reported or by an accuracy value of 1 when accompanied by levels 1 or 2. The formulas for AIConf and AICert as well as a graph of AIConf are provided in Fig. 1.

Fig. 1
figure 1

Accuracy indices for confidence and certainty. Note: Formulas for the accuracy index for confidence (AIConf) and accuracy index for certainty (AICert) are provided (a). The graph of AIConf against earned credit is shown for reported high and low confidence (b)

As partial credit is possible, the accuracy formula needed to measure not only full or zero credit, but any other such values in between. Based upon the standard structure of the exam grading rubrics and possible tendencies of the grader, it was determined that earned scores of 0, 1, 2, or 3 out of 10 on a response represent a similar level of low performance. As such, low reported confidence/certainty would be considered maximum accuracy for scores of 0, 1, 2, or 3. This is the reason for the plateau seen on the interval [0, 1/3] in the graph of AIConf shown in Fig. 1b.

Accuracy values of greater than or equal to 0.60 were classified as accurate, while values less than 0.25 were considered inaccurate. These values were determined based upon the interpretation of specific scores. A report of high confidence/certainty was considered accurate if the response earned 9 or 10 points out of 10. However, a report of high confidence/certainty was considered inaccurate if a response earned 7 or fewer points out of 10 as to miss three points would imply the student missed a critical aspect of the given assessment item. Similar distinctions can be described for reports of low confidence/certainty.

To establish whether an individual student demonstrated consistent inaccurate reports of confidence/certainty, each individual’s general accuracy of their metacognitive self-assessments was calculated. A weighted average of each student’s accuracy values was computed such that accuracy values of responses reporting confidence/certainty levels 1 and 4 were weighted at three times the weight of those corresponding to levels 2 and 3. These weighted averages were defined as the accuracy index of confidence (AIConf) and accuracy index of certainty (AICert). An accuracy index greater than or equal to 0.60 represents consistently accurate self-assessment of performance, while an accuracy index less than 0.25 represents consistently inaccurate self-assessment.

These accuracy indices were then used to plot AICert against AIConf where each point represents a student and is coded by shape and color to indicate their average score across all exams (Fig. 2). These metacognitive accuracy plots were used to identify individual students who exhibited outlying metacognitive behavior.

Fig. 2
figure 2

Metacognitive accuracy plots for student analysis. Note: Data from Semester 1 (a) and Semester 2 (b) is shown. Each point represents a student with the color of each point indicating that student’s average score across all exams: purple circle = [0.9, 1.0] indicating an A letter grade, green square = [0.8, 0.9) indicating a B letter grade, blue triangle = [0.7, 0.8) indicating a C letter grade, orange diamond = [0.6, 0.7) indicating a D letter grade, red X = [0.0, 0.6) indicating an F letter grade. The \(x\)-axis is given by AIConf and the \(y\)-axis is given by AICert. The dotted lines indicate the boundaries (at 0.25 and 0.60) between critical regions of metacognitive accuracy

Similar accuracy indices and metacognitive accuracy plots were produced for each individual assessment item (Fig. 3). AIConf and AICert for each item were computed using a weighted average across all responses to a given assessment item (rather than for a given student). The points representing assessment items on these plots were coded by color and shape to indicate levels of discrimination index (DI) (rather than average score). The DI was calculated as the difference of the average score earned by high performers (top 25th percentile on the exam) and the average score earned by low performers (bottom 25th percentile on the exam).

Fig. 3
figure 3

Metacognitive accuracy plots for item analysis. Note: Data from Semester 1 (a) and Semester 2 (b) is shown. Each point represents an individual assessment item with the color of each point indicating DI associated with that item: purple circle = [0.8, 1.0], green square = [0.6, 0.8), blue triangle = [0.4, 0.6), orange diamond = [0.2, 0.4), red X = [− 1.0, 0.2). The \(x\)-axis is given by AIConf and the \(y\)-axis is given by AICert. The dotted lines indicate the boundaries (at 0.25 and 0.60) between critical regions of metacognitive accuracy

Results

Assessment Item Types

The assessments constructed for this study were tested for internal reliability using Cronbach’s alpha. During each semester, Cronbach’s alpha was between 0.60 and 0.85 for five of the six exams. The exam which demonstrated a lower internal reliability produced a Cronbach’s alpha of 0.56 and 0.58 during Semesters 1 and 2, respectively. During each semester, AICert was higher than the AIConf among each assessment item type (Table 1). AICert among all assessment items was higher than AIConf by 0.13 and 0.10 in Semester 1 (Table 1, part a) and Semester 2 (Table 1, part b), respectively. Average scores among MCQs were lower than scores among EPQs and PBQs, respectively, by 7.33% and 8.91% in Semester 1 and by 13.41% and 17.34% in Semester 2. In both semesters the discrimination index was highest on average among MCQs and lowest among PBQs. The average difference between high and low performer’s scores on MCQs was 38.03% during Semester 1 and 32.45% during Semester 2, while for PBQs this difference was 30.92% during Semester 1 and 25.62% during Semester 2.

Table 1 Means of 3C data

Accurate confidence (AIConf ≥ 0.60) was reported less frequently than accurate certainty (AICert ≥ 0.60) across assessment item types in both semesters by a mean of 5% (Table 2). Accurate confidence and certainty occurred simultaneously at similar frequencies across assessment item types in both semesters, as did simultaneously inaccurate confidence and certainty.

Table 2 Percentage frequencies of accurate and inaccurate metacognition

Metacognitive Accuracy per Student

Correlation of confidence levels with certainty levels was tested for each individual student using Spearman’s rank correlation coefficient. Spearman’s rank correlation coefficient was then used to produce a T statistic for a two-tailed student’s T test to test against the null hypothesis that confidence levels and certainty levels were correlated. Using an alpha level of 0.05, confidence and certainty levels were found to be correlated among responses for 71% of students in Semester 1 and 65% of students the Semester 2. Among students whose confidence and certainty levels were not found to be correlated, 71% and 79% produced higher AICert than AIConf in Semester 1 and Semester 2, respectively. In each semester, the percentage of students for which AICert was considered accurate (> 0.60) was greater among students whose confidence and certainty were not found to be correlated than those with correlated confidence and certainty. The percentage of students with strongly correlated confidence and certainty levels who demonstrated highly accurate certainty was 23% in Semester 1 and 17% in Semester 2, while the percentage of students without correlated confidence and certainty levels who demonstrated highly accurate certainty was 36% in Semester 1 and 47% in Semester 2.

Students among the same grade categories did not necessarily occur in the same critical region of metacognitive accuracy. For example, students B1L29 (score = 88%, AIConf = 0.65, AICert = 0.74) and B1E35 (score = 89%, AIConf = 0.10, AICert = 0.44) both scored between 80 and 90% but occur in different critical regions of metacognitive accuracy (Fig. 2a).

Metacognitive Accuracy per Item

Correlation of confidence levels with certainty levels was tested for each individual assessment item using Spearman’s rank correlation coefficient. Spearman’s rank correlation coefficient was then used to produce a T statistic for a two-tailed student’s T test to test against the null hypothesis that confidence levels and certainty levels were correlated. Using an alpha level of 0.05, confidence and certainty levels were found to be correlated among student responses for 73% of assessment items in Semester 1 and 87% in Semester 2. Among assessment items which were not found to be correlated, 75% produced higher AICert than AIConf in both semesters.

Assessment items with good DI did not necessarily exhibit accurate confidence or certainty. For example, assessment items B1-3.2 (DI = 0.92, AIConf =  − 0.11, AICert =  − 0.18) and A2-2.1 (DI = 0.86, AIConf =  − 0.15, AICert = 0.09) each exhibited DI ≥ 0.8 alongside inaccurate confidence and certainty reporting (Fig. 3a and b). Furthermore, assessment items with poor DI often also exhibited high AIConf and AICert, with a few exceptions — notably items B1-F.2 (DI = 0.11, AIConf = 0.19, AICert = 0.41) and B1-2.8 (DI = 0.09, AIConf =  − 0.15, AICert =  − 0.48) (Fig. 3a).

Discussion

Confidence and Certainty

Confidence and certainty add uniquely important information to assessments that measure response correctness. Separately, assessing confidence informs upon students’ perceptions of prospective success pertaining to course learning objectives while assessing certainty informs upon students’ reflective perceptions of successful performance on individual items. As confidence levels are reported prior to seeing the specific item to which earned credit will be assigned, it should be expected that student certainty is more accurate than student confidence. Indeed, among all assessment item types, student certainty levels more accurately reflected earned credit than did their confidence levels (Table 1). This is likely due to subjective influence from the specific task stipulated by an assessment item, while reported confidence levels indicate students’ objective perceptions regarding the more general LO.

Confidence and certainty collectively provide valuable insight regarding shifts in students’ perceived sureness throughout performing a cognitive task, especially from start to finish of a problem-solving task. Though confidence and certainty were frequently correlated, a considerable number of students during each semester (29% in Semester 1 and 35% in Semester 2) displayed confidence and certainty levels which were not found to be correlated. Among these students, most demonstrated an increase in accuracy of perceived sureness from before an exam to after providing a response to an assessment item. This suggests that some students may require training in metacognition to improve their accuracy of pre-assessment confidence. Such improvement should benefit students in identifying which areas of course content might require them to dedicate additional time for review, study, and practice.

Assessment Item Types

EPQs and PBQs distinguish themselves from MCQs by the ability for students to earn partial credit on responses. Comparing the DI across item types, average DI was found to be highest for MCQs and lowest for PBQs (Table 1). This phenomenon is expected due to the nature of awarding partial credit among PBQs and not MCQs. As partial credit is often awarded to work shown in PBQs due to the pre-determined grading rubrics, low performers tend to earn some partial credit for correct procedural efforts even when the generated response is incorrect. Furthermore, high performers may be deducted minimal credit on a PBQ due to minor errors such as notation or slight miscalculation. As such, the average difference potential between low performers’ scores and high performers’ scores is lessened, leading to lower discrimination indices. This was evidenced by the average difference between high and low performer’s scores on MCQs being higher in both semesters than that on PBQs. This does not necessarily imply that PBQs are less informative of student knowledge. Despite this lower metric, mathematics educators typically maintain that PBQs provide a clearer distinction between student knowledge of techniques and strategies in mathematics than MCQs by requiring students to employ those strategies and show their work/process to arrive at a final answer. However, the present study demonstrates how a traditional calculation of DI as a measure of item validity may not be fully accurate for PBQs. Including AIConf and AICert as considerations for item analysis alongside DI better informs the efficacy of these assessment items.

As demonstrated in Table 2, reported confidence and certainty were more frequently accurate (independently) among MCQs than among EPQs or PBQs. Again, this may be expected since MCQs do not observe partial credit. Hence, for MCQs, AIConf and AICert report either 1 (complete accuracy) or − 1 (complete inaccuracy), and as such, all responses to MCQs which did not produce AIConf/AICert values of − 1 would report as accurate for this frequency calculation. This suggests that MCQs might provide less-refined information with this method than other assessment item types such as EPQs or PBQs, but this requires further investigation.

Efficacy of the 3C Method in Student Analysis

Consideration of AIConf and AICert alongside average score distinguishes between students who demonstrate similar cognitive performance but display critically different metacognitive behaviors, whereas a correctness-only assessment may observe two students with similar average scores as possessing the same knowledge of the material. For example, consider students B1L29 and B1E35 as mentioned in the “Metacognitive Accuracy per Student” section. These students would be considered to possess the same level of knowledge of course content according to correctness-only assessment. However, student B1L29 exhibited accurate confidence and certainty, while B1E35 exhibited inaccurate confidence with midlevel accuracy of certainty. Student B1E35’s misalignment of confidence with earned credit was caused by reporting low confidence alongside correct (≥ 0.8) answers among 38% of their responses. Moreover, of those 38% of responses, this student reported high and low certainty equally often. This indicates that this student exhibits frequent under-confidence as well as mixed under-certainty and accurate high certainty. Contrarily, student B1L29 displayed accurate high confidence and certainty. Frequent under-confidence may imply that a student does not understand the meaning of the LOs, did not understand what sorts of assessment items might pertain to those LOs, or was unaware of their abilities pertaining to the required tasks to answer such assessment items. This misunderstanding may be attributed to low attendance, incomplete homework assignments, inactive participation in class, or simply lacking knowledge of relevant terminology. Furthermore, the strategies employed by an instructor to guide students B1L29 and B1E35 toward more effective learning should differ due to their critically different metacognitive behaviors.

Similar distinctions can be observed among any two students who have similar scores and occur in different regions. An instructor may consider these characteristics to be more informative in practice than in the above presented state. Prior to viewing a metacognitive accuracy plot, an instructor might become aware of some observed tendencies for particular students based on routine course interactions (i.e., in-class participation and homework performance). Combining an instructor’s observed student behaviors with a student’s metacognitive accuracy would inform both instructor and student for best practice early intervention methods. For example, an instructor could meet with a student to discuss and identify the cause(s) of outlying metacognitive behavior, as well as provide that student with strategies for more effective learning which might improve the accuracy of their metacognitive performance.

Use of Metacognitive Accuracy Indices for Item Analysis

Accuracy indices for confidence and certainty provide valuable information for consideration alongside an assessment item’s DI. The DI, which measures how well an assessment item discriminates between high and low performers, is commonly used to justify and guide revision of assessment items. Often, when an assessment item produces a high DI, that item is not reviewed and will be recommended for use on future assessments. However, as seen in Fig. 3, a high DI is not always accompanied by high AIConf or AICert. In the case of assessment items A2-2.1 and B1-3.2 (the “Metacognitive Accuracy per Item” section), a high DI corresponds with low AIConf and low AICert. Upon review, it was determined that these assessment items indeed required significant revision to be proper and effective. Alternatively, low DI was often accompanied by high AIConf and AICert. This phenomenon was typically caused by the assessment item being “too easy” wherein most students answered correctly while exhibiting high confidence and high certainty. In a handful of cases, this was caused by the opposite — a “too difficult” item generating little credit, low confidence, and low certainty. In either case, the comprehensive information suggested revision of the items was needed.

In general, AIConf and AICert can be used to justify a variety of changes to assessment items and their associated LOs. Specifically, low AIConf implies a disconnect between the assessment item and its LO. Modifications based on low AIConf might include adjustments to grading rubrics, increased specificity in the item’s directions, mapping of the item to a more applicable LO, rewording of LOs, or breaking broad LOs into multiple specific LOs. Low AICert more generally pertained to specific items rather than its relation to a LO. Modifications based on low AICert might include changes in assessment item type (i.e., MCQ to EPQ), reworded prompts, adjusted MCQ options, or changes to the mathematical objects referred to in the question (often affecting the perceived difficulty of an item).

Applications, Limitations, and Future Directions

The 3C assessment method described above could be applied or adapted to investigate the efficacy of various experimental course structures or advanced theoretical frameworks which involve assessing specific criteria. Some such course structures might include active learning practices (e.g., flipped classrooms), adapted hybrid structures (e.g., HyFlex models), remote or online presentation methods, or innovative methods of formative assessment (e.g., various remediation practices) — many of which are heavily used in STEM education.

As indicated in the “Accuracy Indices and Metacognitive Accuracy Plots” section, the formulas for AIConf and AICert do not distinguish between earned credit values between zero and one-third. This plateau was intentionally designed based upon grading rubrics and acknowledged grader habits pertaining to the host course of this study. Specifically, earned credits of 0, 1, 2, or 3 out of 10 possible points were identified as representing similar levels of low cognitive performance. Moreover, earned credit of 6 or 7 points out 10 (producing accuracy values near zero) were identified by the grader as representing when a student provided substantial written work while simultaneously excluding some critical required supporting work. For alternative grading rubrics or graders, adjustments to these formulas may be necessary. This is seen not as a limitation of the method but rather as an improvement in the adaptability of this method to additional grading schemes. For example, alternative grading rubrics may extend this plateau to one-half earned credit while others may observe no plateau at all. The critical consideration for adapting this method to a different grading scheme is that the accuracy values correspond to meaningful interpretations of alignment of confidence/certainty with earned partial credit.

In this study, four levels of confidence/certainty were given (low, somewhat low, somewhat high, and high). This differs from previously developed methods of certainty-based assessments such as that of Gardner-Medwin or Hunt which used three levels (low, moderate, and high) (Gardner-Medwin & Gahan, 2003; Hunt, 2003). In a pilot study prior to the collection of data presented in this paper, 3-level prompts were used but were found to produce less informative data than desired (Preheim et al., 2022). Despite being the most commonly selected option, student interpretations of moderate confidence/certainty varied significantly (Preheim et al., 2022). By extending to 4-level prompts, students are unable to report the convenient middle level and are instead encouraged to think more deeply about their metacognition (Snow et al., 2022). Having collected data using both methods, the authors of this study contend that student interpretations of 4-level prompts are more consistent than those of 3-level prompts, thus producing more reliable data for study.

It must be assumed that students provided honest reports of confidence and certainty levels and did not intentionally report inaccurate reports of their metacognition. To ensure that students had minimal motivation to report dishonestly, student scores for the course were not impacted by reported confidence and certainty levels. In a study by Hodds et al., (2014), metacognitive influences such as self-explanation have been shown to improve student performance. Hence, simply asking students to provide self-assessments of their confidence and certainty without direct influence on their scores might have inherently affected student performance. Furthermore, assigning marks based on certainty to discourage dishonest reporting requires a penalty-based marking scheme (i.e., negative marks for high certainty when incorrect) to avoid scores being inflated by correct guesswork.

In this study, students were not given any report of their metacognitive accuracy index values, and the instructor did not directly approach students to discuss areas of metacognitive inaccuracy (though a handful of students did bring up particular instances of misaligned certainty during office hours). As such, students were not directly instructed on how to interpret or utilize this information to improve general metacognitive accuracy. This was intentional as the purpose of this study was to establish the efficacy of this assessment method in comprehensively assessing student knowledge, not to investigate strategies for improving metacognitive accuracy.

The internal reliability of the examinations given in this study was found to be acceptable or good for five of the six examinations as indicated by the acceptability ranges of Cronbach’s alpha as described by Ursachi et al., (2015). The acceptable (0.60–0.70) or lower (< 0.60) Cronbach’s alpha values which were seen among six of the twelve examinations across both semesters were likely due to each exam containing only nine assessment items. A low number of assessment items may violate the assumption of tau-equivalence and thus underestimate reliability (Tavakol & Dennick, 2011). The examinations administered in this study contained relatively few assessment items due to the time required for students to perform the tasks being assessed in linear algebra (e.g., performing Gaussian elimination). Moreover, the examination which produced a Cronbach’s alpha of less than 0.60 contained two assessment items which pertained to the same mathematical object. The grading rubric for the second of these two questions did not require the answer to the first to be correct. However, it is possible that student certainty in their response to the first question may have influenced their performance on the second. These factors may have led to lower Cronbach’s alpha values than desired.

The methods demonstrated in this study provide useful strategies for identifying student misconceptions and misunderstandings regarding course content within STEM education. In a study by Neidorf et al., (2019), specific incorrect responses to Trends in International Mathematics and Science Study (TIMSS) and TIMSS Advanced assessment data across several countries among school pupils at grades four, eight, and twelve were analyzed to identify common misunderstandings in mathematics content related to linear equations. This data does not include confidence or certainty, but instead focuses on using methods such as distractor analysis of MCQs or mapping incorrect response themes to EPQs or PBQs (Neidorf et al., 2019). A student exhibiting frequent misalignment of 3C responses may indicate the presence of underlying misunderstandings or misconceptions. Assessing confidence and certainty alongside correctness provides a method for detecting misunderstandings or misconceptions without requiring the identification of all possible types of misunderstandings or misconceptions which might occur for each item. Further investigation regarding the efficacy of 3C data to identify the specific misunderstandings is merited by the results of this study.

Practical implementation of this assessment method for widespread use in classrooms could benefit from improvements in efficiency of data collection and analysis. Collecting, inputting, and analyzing this 3C data manually requires a significant amount of time by the instructor. The development of software to collect and analyze 3C data from online assessments would greatly reduce this time requirement by the instructor. As some disciplines, such as mathematics, rely heavily on handwritten student responses to PBQs, using machine-readable paper forms to collect confidence and certainty data may also reduce the time required for data input. A further consideration should be made concerning additional time spent by students responding to these prompts. In this study, during each exam, all students took less than 5 min to respond to all confidence prompts. Gardner-Medwin & Curtin, (2007) specifically address this concern, indicating that while they have no data regarding amount of time, student responses to an evaluation study (Issroff & Gardner-Medwin, 1998) indicate students may use additional time spent responding to certainty prompts to reconsider their provided answer. It should also be noted that even during correctness-only examinations, students would likely be spending time reflecting upon the correctness of their provided responses (Gardner-Medwin & Curtin, 2007). Therefore, additional time providing certainty levels is likely negligible. Administering electronic 3C assessments would allow for proper investigation of time spent by students when providing confidence and certainty levels.

Conclusions

Improving student metacognition should be a featured goal for STEM educators and having a tool by which to assess student metacognition is critical for the advancement of research toward improving students’ metacognitive accuracy. The accuracy indices for confidence and certainty developed for this study, as well as the metacognitive accuracy plots generated by this method, serve as important tools which can be utilized by education researchers, instructors, and students for improvement of learning, performance assessment, and growth intervention. Future directions of this research include the use of 3C assessment data to guide students toward more effective learning, the investigation of the efficacy of various course frameworks or experimental instructional designs, the investigation of metacognitive behaviors across various disciplines or levels of education, and the investigation of metacognitive behaviors among students in underrepresented populations or groups.

Based on data collected over two semesters, confidence and certainty are critically different elements of metacognition. Moreover, this 3C assessment method produces a more meaningful and more accurate assessment of student performance than correctness-only methods or even certainty-based methods which do not consider confidence. Consideration of AIConf and AICert contributes useful information to be reviewed alongside a DI during item analysis. Simultaneously collecting confidence and certainty alongside correctness distinguishes between students who exhibit similar cognitive performance but display critically different metacognitive behaviors.