1 Introduction

Assessing handwritten tasks is often a daunting endeavour in large-scale exams where multiple assessors are involved. Finding efficient ways to provide (consistent) feedback and reliable grading in these settings is not straightforward (Baird et al., 2004; Meadows & Billington, 2005; Morgan & Watson, 2002). Grading reliability is the degree to which a grade genuinely reflects the quality of a student’s assignment, in which aspects outside the assignment should not play any role and is most often measured by letting multiple assessors rate the same assignment (Bloxham et al., 2016). Most exam designers try to ensure inter-rater reliability by pre-developing a solution key with grading criteria for assessors (Ahmed & Pollitt, 2011). Traditionally, these grading instructions also serve as a feedback tool, as students should be able to trace back the grades they obtained by comparing their solutions with the grading criteria (Price et al., 2012). Nevertheless, students often struggle to understand the language used in grading criteria (Cartney, 2010).

In an attempt to tackle the challenges of assessing handwritten mathematics tasks, we invented ‘checkbox grading’. It involves a semi-automated assessment method (Moons et al., 2022) in which assessors work with a software tool to grade and provide feedback on handwritten mathematics tasks. The method was researched during the final high school mathematics exam (grade 12) organised by the Flemish Exam Commission. The method can possibly make the grading process more efficient and reliable. Moreover, the grading method results in atomic feedback, giving a detailed insight into how grades were obtained. Atomic feedback consists of a hierarchical list of feedback items, resulting in a separate feedback item for each independent error.

Why is a semi-automated method still necessary? Can all the difficulties caused by assessing handwritten tasks be avoided by replacing them with digital exams that can be assessed fully-automatedly (Kloosterman & Warren, 2014)? Many studies have concluded that some mathematical learning goals lend themselves well to being assessed digitally—other, much less. For example, Hoogland and Tout (2018) warn that digital mathematics tasks often focus on lower-order goals (e.g., procedural skills). They argue that handwritten tasks better assess vital higher-order goals (e.g., problem-solving skills). Bokhove and Drijvers (2010) point out that handwritten tasks allow students to express themselves more freely. Lemmo (2021) highlights substantial differences in students’ thinking processes when the same task is asked digitally or paper-based. Together, these studies emphasise the continued significance of handwritten mathematics exam tasks. When designing a mathematics exam, it is best to decide individually whether the digital or handwritten mode is appropriate for each task, leading to exams that are a mixture of both (Threlfall et al., 2007).

This paper investigates the student perspective, specifically exploring whether students can effectively interpret the feedback provided through checkbox grading. Additionally, we have discussed the assessors’ usage and perspectives separately in another paper (Moons et al., 2023b) and presented a statistical method for measuring grading reliability resulting from this approach (Moons & Vandervieren, 2023a).

1.1 Checkbox grading

1.1.1 Idea

Using checkbox grading, exam designers produce a grading scheme for each task consisting of different feedback items written in an atomic way (see Sect. 1.1.2.), anticipating the mistakes students may make in the given question. Next, students solve these exam tasks the classical way by writing on paper. Subsequently, the papers are scanned, and the assessors use the checkbox grading system to assess the solutions on a computer. When correcting a student’s solution, the assessors just select the feedback items (‘checkboxes’) that apply.

To allocate grades, exam designers can associate items with partial points to be added (green items in Fig. 1) or subtracted (red items in Fig. 1). It is also possible to associate items with a threshold (e.g., ‘if this feedback item is ticked, maximum 1 out of 2 points’). Items that do not change the grade but provide essential information for the continuation of the assessment have a blue checkbox (e.g., as a note to the assessors that some solutions are fine as well or as an indicator for the system to know how to proceed). The point-by-point list of atomic feedback items ultimately forms a series of implicit yes/no questions to determine the student’s grade. Dependencies between items can be set so that items can be shown, disabled, or changed whenever a previous item is ticked, implying that the assessors must follow the point-by-point list from top to bottom. This adaptive sequencing is the main characteristic of the approach, resembling a flow chart that automatically determines the grade. By ticking the checkboxes that are relevant to a student’s answer, we envision (1) a deep insight into how the grade was obtained for both the student (feedback) as well as the Flemish exam commission, and (2) a straightforward way to grade handwritten mathematics tasks when multiple assessors are involved.

Fig. 1
figure 1

Checkbox grading scheme of exam task 4

An example of this approach is given in Fig. 1. The exam task consists of two sub-tasks. In the first sub-task, the student makes a mistake on the sign. As the item ‘Answer is completely correct’ is unticked, the computer knows that a mistake happened; therefore, the system shows two additional blue checkboxes to decide whether the assessor can continue grading the task. The sign error was an anticipated mistake that caused a deviation from the solution key. While the student did not gain points with sub-task (a), the assessor might continue with the assessment but now has to check the solution individually by calculating along. Any other mistake in sub-task (a) would have stopped the further assessment of the task. In sub-task (b), the student corrects the previous mistake but fails to provide the correct solution. As such, only the first item of sub-task (b) (‘The row echelon form is correct’) applies, leading to a total score of 1/3.5.

If a particular solution approach by a student is not covered in the available feedback items, an assessor could add additional checkboxes. Checkbox grading was developed as an advanced grading method plug-in for Moodle, an open-source e-learning platform (Gamage et al., 2022).

Finally, the name ‘checkbox grading’ was inspired by the bestseller ‘The Checklist Manifesto’ (Gawande, 2010), in which the author argues that using simple checklists in daily and professional life can make even very complex processes efficient, consistent and safe: “under conditions of complexity, not only are checklists a help, they are required for success” (p. 79).

1.1.2 Atomic feedback

The feedback in checkbox grading is called atomic feedback (Moons et al., 20222024). Classic written feedback has traditionally consisted of long pieces of written text (Winstone et al., 2017). With its long sentences describing all the errors in a student’s work, classic written feedback is intrinsically not reusable, as it is too explicitly targeted toward specific students. To overcome this difficulty and maximise the reusability of feedback, one of the key ideas underlying the checkbox grading system is that it encourages exam designers to write atomic feedback by breaking it into separate feedback items. To do so, one must (1) identify the possible independent errors occurring and (2) write separate feedback items for each error, independent of each other (making them atomic). These atomic feedback items form a point-by-point list covering all items that might be relevant to a student’s solution. The list can be hierarchical to cluster items that belong together (see the indentation in Fig. 1). An additional criterion for being atomic holds for checkbox grading: (3) a knowledgeable assessor must be able to determine unambiguously whether an item applies to a student’s answer. As such, each item implicitly represents a yes/no question. Related atomic feedback items and intermediate steps in a solution key can share the same colour to make their connection visually clear (see Fig. 1).

2 Theoretical background

In this section, we prepare for the study’s research questions (Sect. 2.3) on the students’ perspective on the received feedback through a short literature review on feedback interventions (Sect. 2.1) and the study’s theoretical framework (Sect. 2.2).

2.1 Literature review

The efficacy of feedback interventions is ultimately determined by the degree to which students engage with the feedback content (Jonsson & Panadero, 2018). This engagement, termed ‘proactive recipience’, is predicated on students’ understanding of their feedback; without comprehension, feedback fails to facilitate improvement (Jonsson, 2013). In this research context, where ‘checkbox grading’ feedback is given after a high-stakes mathematics exam, we speak of assessment literacy: students’ ability to comprehend and utilise the grading process to evaluate their performance (Winstone et al., 2017). In this context, assessment literacy is a prerequisite for proactive recipience, which can be supported by (1) understanding the connection between assessment, learning, and expectations, (2) assessing their own and others’ performance based on specific criteria, (3) grasping the terminology and concepts used in feedback, and (4) becoming familiar with assessment methods and feedback practices (Price et al., 2012). Facilitating proactive recipience is especially important when the exam needs to be retaken. In addition, the provided checkbox grading feedback also aims to promote transparency so that students perceive the assessment as fair (Darabi Bazvand & Rasooli, 2022).

Several studies have investigated how engagement with grading criteria affects students’ assessment literacy. Students generally rate these interventions positively (Atkinson & Lim, 2013) and see their importance (Orsmond et al., 2002), and some studies have shown that such interventions can improve grades and self-reported awareness of learning objectives (Case, 2007). Engaging with grading criteria seems to help learners understand the assessment process and expectations (O’Donovan et al., 2004; Rust et al., 2003). However, not all learners respond positively to these interventions (Bloxham & West, 2007), and some struggle to understand the language used in the grading criteria (Cartney, 2010). Additionally, understanding the grading criteria does not automatically translate to better future work (Rust et al., 2003).

2.2 Theoretical framework

2.2.1 Revised student-feedback interaction model

In 2016, Lipnevich et al. (2016) proposed a student-feedback interaction model (Fig. 2) that may be useful in considering the complexity of feedback and the factors that may affect student perceptions and subsequent action (or lack thereof). Later, Lipnevich and Smith (2022) revised the model, including a step-wise understanding of the feedback process. The model is based on several studies and meta-analyses on feedback and gives an overview of all the factors that relate to how students respond and interact with feedback. We will use this revised model to frame our research. The model suggests that feedback is received in a context that can influence how important or familiar the students perceive it. The interaction process starts with the feedback message and the source that generated it. Feedback can vary in tone, length, specificity, and complexity, and the source’s trustworthiness plays an important role. Next, the model investigates how the student receives the feedback and how it is processed: cognitively, affectively, and behaviourally. Three main questions describe this student’s feedback processing: Do I understand the feedback? How do I feel about the feedback? What am I going to do with the feedback? Answers to these questions provide the student with self-feedback (Panadero et al., 2019). The final step concerns actions, outcomes, and the growth that results from the feedback. The interaction model served as a guiding theoretical framework for our study, as explained in the following section.

Fig. 2
figure 2

The revised student-feedback interaction model by Lipnevich and Smith (2022)

2.2.2 The student-feedback interaction model applied to our study

This study explores how students interpret and perceive feedback reports from checkbox grading. One approach to achieving this goal involved contrasting the checkbox grading feedback with various other delivery methods, such as classic written feedback, only communicating a grade and the Flemish Exam Commission’s traditional grading procedure (see Fig. 5 to compare these feedback types). The traditional grading scheme of task 2 is shown in Fig. 3. In the traditional grading process, the assessors apply the provided grading scheme of a task to the student’s solution and only communicate the grade they deduced. During a review appointment, students receive their exams, the traditional grading schemes, and their obtained grades. In doing so, students sometimes have to guess which criteria were applied to arrive at a particular grade (Price et al., 2012).

Fig. 3
figure 3

Traditional grading scheme of exam task 2

To connect our conditions to the revised student-feedback interaction model, the context consists of students taking a high-stakes mathematics exam to graduate from Flemish secondary education, a stressful and relatively uncommon context for most students. The source of the feedback is the Flemish Exam Commission. The feedback message is mainly presented as checkbox grading feedback, but some messages are also phrased using classic feedback approaches in order to compare them to checkbox grading feedback. We gather most individual characteristics through a questionnaire, as well as glimpses of the cognitive and affective processing. Semi-structured interviews were conducted to gain deeper insight into the cognitive processing. A blind spot in our study remains the behavioural processing and the resulting outcomes, as we did not follow the students who failed the exam on a second attempt.

2.3 Research questions

Now that we have established the theoretical and conceptual underpinnings of the study, we pose two research questions to guide our inquiry:

[RQ 1] How do students understand (cognitive processing) and perceive (affective processing) feedback reports from checkbox grading?

[RQ 2] How do students’ perceptions of receiving feedback from checkbox grading differ from perceptions of feedback from classical approaches (such as traditional grading, written feedback, or communicating grades)?

Note that our prior aim is to investigate how students interpret checkbox grading feedback (RQ1); the other classical feedback approaches are included for comparison purposes (RQ2) but will not be investigated in depth.

3 Methods

The study was conducted with the Exam Commission of Flanders (the Dutch-speaking part of Belgium) for Secondary Education. Flanders is a region without any central exams (Bolondi et al., 2019): every secondary school decides autonomously on the assessment of students. Consequently, the Exam Commission does not organise national exams for all Flemish students. However, they organise large-scale exams for anyone who cannot graduate from the regular school system. In this way, students who pass all their exams with the Exam Commission can still obtain a secondary education diploma. Students participating in these exams prepare by themselves or with the support of a private tutor/school. The commission provides clear guidelines for students on the content of the exams, carries out the exams, and awards diplomas but does not provide any teaching activities or materials to students.

Ethical clearance for this study was obtained from the University of Antwerp Ethics Committee. The Committee approved the study design and the procedures for data management, consent, and protecting the participants’ privacy.

In the following sections, we describe the mathematics exam used in this study, the development of the questionnaire and interview protocols, the sample of participating students and the data analysis procedures.

3.1 Mathematics exam

The mathematics exam for this study was developed by the three mathematics exam designers of the Flemish Exam Commission (without any influence from the researchers) following their standard practice. The exam is part of the advanced mathematics track of Flemish secondary education's senior years (11th/12th grade). The exam is a mixture of digital and paper-and-pencil tasks: 46% of the exam grades are obtained with the digital part and 54% with paper-and-pencil tasks. In this study, only the feedback on the paper-and-pencil tasks is considered. These tasks vary considerably in points allocated based on the importance of the topic in the curriculum; 0.5 points was the smallest partial score. An overview of the content of the exam can be found in Table 1.

Table 1 Content of the mathematics exam, including the maximum, mean and standard deviation of the scores of the students who filled in the questionnaire

A timeline of the study can be found in Fig. 4. The study started when the exam designers had prepared the exam by the 1st of October 2021. Next, their traditional solution key with grading instructions was turned into checkbox grading in close cooperation with the researchers. After the exam, the assessors (mostly mathematics teachers who do this as a side job for the Exam Commission) used the checkbox grading system to assess the exam. All the paper-based tasks of the exam, including the checkbox grading schemes, can be found in the electronic supplementary material.

Fig. 4
figure 4

Timeline of the study

3.2 Instrument development

3.2.1 Questionnaire

The questionnaire was implemented in Qualtrics, consisted of four parts and was developed based on our presented theoretical framework (Lipnevich & Smith, 2022). A key aspect was to keep the completion time below 15 min to motivate students to answer truthfully until the end (Yan et al., 2011). The four parts of the questionnaire were:

  1. 1.

    Individual characteristics and past experiences. The first part gathered some personal information about the students (age, study direction, reasons to opt for the Exam Commission, and number of previous exam attempts). We also asked how the students experience the current feedback practices at the Flemish Exam Commission.

  2. 2.

    Ranking exercise on the comprehensibility of the feedback types (RQ2). In the second part of the survey, students ranked four types of feedback from most comprehensible to least comprehensible by drag-and-drop. All feedback types dealt with the same exemplar task from a peer student, were content-wise equivalent and resulted in the same grade; the only difference was their appearance. The four feedback types were checkbox grading, classic written feedback, only a grade, and traditional grading. The four feedback types were adapted from Harks et al. (2014) and Koenka et al. (2019). This ranking question was repeated for two different exam tasks to avoid a dependency between the type of task and the preferred feedback. An example of one of the ranking questions can be found in Figure 5.

  3. 3.

    Quiz on understanding the feedback given to a fellow student (RQ1). In the third part, students saw the feedback report depicted in Figure 1. They were asked to answer 10 short false/true questions about the feedback content, as shown in Table 3. The questions polled their understanding of the feedback and the sequencing of the grading scheme. Students had to answer each question and could not return to previous questions as some following questions sometimes revealed the answer of a previous one.

  4. 4.

    Personal checkbox grading feedback: student’s cognitive & affective processing (RQ1). In the last part, students received a link to access their personal checkbox feedback on exam tasks 1, 7 and 10, which looked like the feedback report in Figure 1. Based on Weaver (2006), we tried to measure how students perceived the personal checkbox grading feedback they received. The survey questions can be found in Figure 6.

Fig. 5
figure 5

Ranking exercise on the comprehensibility of different feedback types on the same solution of exam task 1

3.2.2 Interview protocol

The semi-structured interviews of students took place online at most a week after completing the survey. The interviews investigated the students’ understanding of their exam feedback. We used open questions and a think-aloud protocol (Gillham, 2005) to reveal their thinking while processing their feedback. One researcher prepared each interview by scanning the student’s exams and indicating interesting solutions for exam tasks to discuss. The chosen exam tasks were usually (partially) incorrect, as these are the best triggers to see if students understand what should be improved. Correct exam tasks were occasionally discussed with hypothetical supplementary interview questions (e.g., ‘What would have happened if your numerator had been wrong?’). When traditional grading was discussed, the students saw the traditional grading scheme of the task (Fig. 3) and their solution. When checkbox grading was discussed, they saw their complete feedback report (Fig. 1). The interview protocol contained two interview questions inspired by O’Donovan et al. (2004):

  1. 1.

    Cognitive processing of traditional grading. I’m sharing my screen, showing your solution to exam task x and the traditional grading scheme. Can you determine the grade you should receive and explain your reasoning?

  2. 2.

    Cognitive processing of checkbox grading. I’m showing you your feedback report on exam task x. Can you think aloud about how your grade was obtained? What was correct in your solution? What was wrong or missing?

Exam task x was always replaced with the task number the researcher had chosen in advance. During the interviews, exam task 2 was chosen for all students to investigate their interpretation of traditional grading, and three to four other tasks were chosen to investigate their interpretation of checkbox grading, as this was the main focus.

The researcher always let the students talk and intervened only: (1) to remind students to think aloud, (2) when clarifications of their reasoning were necessary, or (3) to ask a follow-up question when a student made an incorrect interpretation. Follow-up questions were formulated as open and non-corrective as possible. In the case of an incorrect interpretation, the researcher briefly summarised the student’s conclusion as a first follow-up question (e.g., ‘So you are saying that..?’). If a student did not correct themselves after hearing the researcher’s summary of their incorrect interpretation, one more follow-up question was asked, such as ‘But does that hold for your solution?’.

3.3 Participants

3.3.1 Sampling

When all assessors finished their work (see Fig. 4), the questionnaire was ready to be sent to all students. Students received personal checkbox grading feedback on three exam tasks during the questionnaire as an incentive. Furthermore, upon completing the questionnaire, students were sent their final exam score, which was sooner than the official announcement. The questionnaire was closed two weeks after its release.

At the end of the questionnaire, students were asked if they would like to take part in an in-depth online interview of 45 min about the feedback they received on their exam. As an incentive to participate in an interview, they would receive their personal checkbox grading feedback on the whole exam (not just three tasks), eliminating the need for a traditional review appointment in Brussels.

3.3.2 Questionnaire

The questionnaire was filled in by 36 of the 60 students who took the exam. In total, 19 female students and 17 male students participated. They were, on average, 17.39 years old (SD = 1.46). All the students had advanced mathematics as part of the curriculum of their studies. The exam results of the 36 students who participated in the questionnaire can be found in Table 1. The exam results were, on average, relatively low. Indeed, many students just come to an exam session to know what preparation they need for the following session. Exactly half of our participants took the exam for the first time, 14 for the second time and 6 for the third time.

3.3.3 Interviews

Four of the 36 students who filled in the questionnaire agreed to be interviewed: Sacha (female, 17 years old, scored 19%), Jana (female, 17 years old, scored 42%), Tom (male, 17 years old, scored 60%) and Emile (male, 19 years old, scored 41%).

3.4 Data analysis procedures

3.4.1 Questionnaire

The questionnaire analysis mainly consists of a descriptive analysis of the results. Additionally, the average ranks were calculated for the ranking exercise (part 2), and a correlation test was executed for the quiz on feedback understanding (part 3).

3.4.2 Interviews

In the preparatory stage, interviews were transcribed verbatim by the researchers. A straightforward ‘traffic light coding’ procedure was implemented for each exam task discussed during the interview, as shown in Table 2. Another researcher double-checked the coding.

Table 2 Traffic light coding procedure of the interviews. The x denotes the number of the exam task

4 Results

4.1 Cognitive and affective processing of checkbox grading feedback [RQ1]

The quiz aimed to ascertain student’s understanding of the feedback in Fig. 1. As ‘understanding the given feedback’ can be seen as a latent construct, we analysed the composite reliability (Brunner & Süb, 2005) of the 10 initial questions. Three questions were deleted to achieve an acceptable composite reliability of 0.72. It seemed that these deleted questions could be interpreted ambiguously. The 7 remaining questions and the results can be found in Table 3. It shows that, overall, students understood how the checkbox grading is used in the exam they just took: on average, the students scored 72.6% (SD: 18.2%) on the quiz, which is much higher than the mean score of 37.5% on the actual exam (SD: 16.3%, see Table 1). A Pearson correlation coefficient was computed between students’ quiz and exam scores, which showed no correlation between the two variables, r(34) =  − 0.02, p = 0.91 with 95% CI [− 0.34, 0.31].

Table 3 Quiz on the understanding of the checkbox grading feedback shown in Fig. 1

Moreover, the results of the last part of the survey, in which students could access their personal checkbox grading feedback on three questions, can be found in Fig. 6. The results on students’ understanding and affective processing indicate that they would greatly appreciate it if the Exam Commission would adopt this approach. Students feel that they understand their feedback, learn from it, and see the connection with the grades they obtained.

Fig. 6
figure 6

Overview of the students’ survey items corresponding to their checkbox feedback

These results are confirmed by analysis of the interviews (see Table 4). Table 4 shows that students could independently draw correct conclusions in 11 of the 16 exam tasks discussed, confirming the survey results. Below we summarise the interpretations the students gave of the feedback form the checkbox grading.

Table 4 Results of the traffic light analysis of the cognitive processing of checkbox grading

Sasha could interpret the checkbox grading feedback she received on 3 of the 4 tasks. She scored 1.5/2.5 on task 1 by not writing one intermediate step explaining her calculation. At the same time, the instructions indicated that all intermediate steps should be shown as no calculator could be used (see Fig. 5). She worried when seeing the first item that would lead to a zero score on task 1: ‘No intermediate steps provided’ (which did not apply to her solution). When the researcher asked if the item applied to her solution (although the box was not checked), she insisted it did. She probably confused it with the only intermediate step missing in her solution, for which she was indeed penalised by 0.5 points as the box ‘Correct calculation of the numerator with intermediate step’ was not checked. When asked why her final score for task 1 was 1.5/2.5 and not 2/2.5, she said the indentation of this checkbox was probably the reason. It was not: the indentation indicated a parent–child sequence in the grading scheme: this (child) item could only be selected when the parent item was. The researcher then explained that the missed 0.5 points came from the last item, which an assessor could select but only added + 0.5 when everything else was fully correct. After this clarification, Sasha correctly interpreted the checkbox grading feedback on all questions that followed. Even hypothetical questions like ‘If the assessor would have selected the item that you drew the segment line \([AB]\) correctly, would it have changed something to your score?’ (task 7, answer: no) were answered correctly. Moreover, she could independently say what she has to change in her solutions when taking the exam a second time. To conclude, Sasha showed a high assessment literacy with checkbox grading, which was somewhat surprising due to her low final score on this second exam (19%).

Jana could independently interpret her checkbox grading on all three tasks discussed in the interview. She correctly inferred her result on task 3 (3/3.5), and could indicate that she missed some of the keywords needed for obtaining full marks. She repeatedly stressed how clear she found the checkbox grading scheme, for example, when discussing task 1 (see Fig. 5):

“For the first step, it is already very clearly stated that \(1-3i\) must be present in light blue colour, and this is also present in the solution key in light blue, so they are clearly connected. You immediately know which expressions are linked together. And it is really step-by-step: in this step, this must be present; in this step, you get these points. This and the link with the colours: very clever. This is so much more clear (than traditional grading, ed.) (…) If you got it wrong, you could say: here I was wrong, and my solution is not correct anymore because I made a mistake here.” (Jana).

Tom started the interview by stressing that he liked the checkbox grading much more than the traditional grading schemes. The discussion began with his correct solution to task 1 (2.5/2.5). On the hypothetical question of what his score would have been if the numerator had been wrong, after encouragement to read the entire checkbox grading scheme, he correctly concluded it would be 1.5/2.5 because the last item only adds + 0.5 if everything else was correct. For task 4 (2/3.5), he immediately indicated that his solution was missing some elements. Interestingly, when the researcher was scrolling through his exam paper, Tom asked to discuss tasks 7 and 8. For task 7 (1/2.5), he said he could not correctly draw the point \(B\) as he forgot to bring a set square, implicitly indicating he understood the feedback. When the researcher asked what would have happened to the grade if he had drawn the segment, Tom replied he would have received + 0.5 extra points. This interpretation was incorrect: even if the assessors had selected the item, it would not have changed the grade as the item indicates that + 0.5 is only awarded when points \(A\) and \(B\) are drawn correctly. Hence, Tom struggled to understand the sequencing in the grading scheme. In task 8 (4/4.5), his solution to the task was correct, but he lost half a point because he needed to explain his reasoning. Tom said he did not understand why he lost half a point for this and made it clear that he disagreed with the grading criteria. When the researcher asked whether it is important to justify the steps in mathematical reasoning, Tom reluctantly agreed. This exchange was coded in Table 4 as orange because it is likely that Tom understood the feedback after the remark from the interviewer, although he disagreed with it. Finally, in task 9 (2/2.5), a remark was added by the assessors pointing to an inappropriate use of double arrows, which makes for a half-point loss. While Tom understood the mechanism behind checkbox grading, knowing that selecting such an additional remark affected his grade by -0.5, he said:

“Yes, but I do not understand it. I’ve read it, but I don’t understand it. Why is there a problem with the double arrows?” (Tom)

We could only conclude that this additional remark was too short for Tom; therefore, this task was coded in red in Table 4. Interestingly, this was the first time during the interviews that a lack of content knowledge was the cause of a lack of understanding of the feedback.

Finally, Emile, who only needed 2% extra to pass, could interpret the feedback on his task 3 (2.5/3.5) well but reacted emotionally to the feedback on sub-task 3(b):

“Ow ow ow ow. Oh, wait. Wait, I don’t have that?! Oooooh, I have written 0.013 and not 1.013. I had to add 1! Ooooooh nooooooo. Oooh, such a bummer! That would have been my needed 2%! That would have been my needed 2%!” (Emile)

When discussing sub-task 4(b) (2/3.5), he gave the solution \(\{-\mathrm{1,4}\}\), which is impossible as there are 5 unknowns. His solution was far from correct, and Emile could not link the unchecked items to his solution.

“Well, I really don’t know what I am doing there (with the solution set, ed.). I failed to solve that last part, and I also don’t know how you should have solved it.” (Emile)

While discussing his correct task 1 (2.5/2.5), Emile mentioned that he found this kind of feedback very clear because it is easy to see what contributed to the grade and how the feedback and solution are linked together due to the use of colours. Finally, for task 10 (0/4), he could correctly interpret the sequencing in the grading scheme.

4.2 Perceptions of feedback of checkbox grading compared to classical approaches [RQ2]

Regarding the second research question, past students’ experiences were surveyed in the first part of the questionnaire. Most students (52.8%) already attended a review appointment after a previous mathematics exam where they could compare their received grades with the traditional grading schemes. All students said they attended to be better prepared the following time. However, 57.9% indicated they had problems understanding how grades were obtained using traditional grading.

To compare perceptions, we asked students to rank checkbox grading among other, more usual approaches to feedback (see Fig. 5) regarding comprehensibility. The results of this ranking exercise on two different exam tasks can be found in Table 5. On average, the traditional grading schemes were preferred above checkbox grading, classic written feedback and only a grade. Only a grade was by far the least preferred option.

Table 5 Results of the ranking exercise on the comprehensibility of feedback types

The interview data does not fully confirm the perceived higher comprehensibility of traditional grading over checkbox grading during the ranking exercise in the questionnaire. The traffic light coding of traditional grading, displayed in Table 6, shows that only Sasha could independently link the traditional grading scheme with the grade she received in exam task 2. She answered \(-5b \cdot ({\text{cos}}\left(\alpha - 5\right) + i{\text{sin}}(\alpha - 5)\) on 2(a) and correctly inferred she must have failed the entire task. When the researcher asked for explanations, she could immediately identify what should have been included in her answer. Two students could not draw correct conclusions when comparing their solutions against the traditional grading scheme. Jana, who submitted the wrong answer \(-5[b \cdot ({\text{cos}}\alpha + i{\text{sin}}\alpha )] = -5b({\text{cos}}\alpha - 5 + i{\text{sin}}\alpha -5\)) and received 0/1.5, only noticed a superficial difference between her solution and the solution key but could neither link her mistake to the grading scheme nor suggest a grade:

“I don’t really know. Because, well, I wrote α’s without +180°, so instead of α + 180°, but they (the grading scheme, ed.) don’t tell anything about this.” (Jana)

Table 6 Results of the traffic light analysis of the cognitive processing of traditional grading

Tom, who wrongly answered \(-5b({\text{cos}}(\alpha )+i{\text{sin}}(\alpha ))\) on 2(a) and received 0/1.5, also failed to interpret the traditional grading scheme. First, he guessed he scored 0.75/1.5 because he noticed he forgot to write \(+\pi\) in the argument of \({z}_{1}\). When the researcher intervened and asked whether 0.75 was a possible outcome based on the grading scheme (it was not), he changed his answer to 0.5/1.5. When the researcher asked to clarify his reasoning, he answered:

“I did not get zero because I have written a part of the solution. So yes, I would still say I received 0.5/1.5.” (Tom)

Therefore, Tom misinterpreted the first criterion, which gave a partial score of 0.5 for transforming − 5 to the correct polar form. Tom thought his grades were too low and believed he should have accrued some points anyway (“I have written a part of the solution”), as can also be seen in his interpretation of sub-task 2(b). Tom answered \(b \cdot {e}^{ia} \cdot c \cdot {e}^{i3\beta } = bc({\text{cos}}\left(\alpha + 3\beta \right) + i{\text{sin}} (\alpha + 3\beta ))\), thereby using the Euler form of complex numbers, which is not part of the Flemish mathematics curriculum. The sub-task was graded 0.5/1. Asked to give a grade using the traditional grading scheme, Tom initially said he had full marks on the sub-task. When the researcher suggested this was not the case, he came to the correct conclusion that he forgot a third power in his modulus:

“It seems I forgot the third power. Nevertheless, I used a nicer method (Euler form, ed.), and I think that deserves a little bit more appreciation. (...) And for the first sub-task: I still don’t get why I got such a low score. I thought I would receive at least half a point.” (Tom)

Finally, Emile, who wrongly answered \(-5 \cdot b({\text{cos}}\alpha + i{\text{sin}}\alpha )\) on sub-task 2(a) and was marked 0/1.5, immediately noticed he forgot a part in the argument of the polar form. So, he inferred he would receive 0.5/1.5 from the first criterion. When the researcher asked how he treated the − 5 in his solution, he corrected his answer and correctly concluded that the first criterion did not apply and that he got 0/1.5.

5 Conclusions and discussion

In this study, we investigated the cognitive and affective processing of checkbox grading feedback (RQ1) and compared the perceptions of checkbox grading feedback with classical feedback approaches (RQ2). In order to do so, we distributed a questionnaire among all the 60 students who had participated in the mathematics exam at the Flemish Exam Commission, providing them access to sections of their checkbox grading feedback reports. Out of the group of 60 students, 36 took the time to complete the questionnaire, and 4 of them agreed to semi-structured interviews. These sample sizes are a limitation of the study, especially concerning the interviews, as no saturation can be expected with 4 participants (Hennink & Kaiser, 2022).

Concerning the first research question, the quiz and self-reports administered during the questionnaire and subsequent interviews were utilised. In regard to self-reports, it is noteworthy that a substantial proportion of students were positive with respect to checkbox grading: 97% of the students indicated the Flemish Exam Commission should adopt the approach, 91% of students concurred that the received feedback clearly indicates how the score was determined, 79% demonstrated an understanding of the relationship between the feedback and the score (which may also encompass the design choices made by the Flemish Exam Commission), 74% affirmed that they understand most of their feedback, and 77% indicated they will make even better exams based on their feedback. It is important to acknowledge that self-reports, while illuminating, may not fully capture the intricate cognitive processes that students engage in. Nonetheless, these findings suggest that the sampled students, at a minimum, perceive a degree of understanding and effectiveness in their interaction with checkbox grading feedback.

The quiz results, with an average score of 72%, exceeded expectations, particularly in contrast to the exam scores (M: 37%). These results were surprising since the quiz questions were intentionally designed to be challenging, necessitating a deep cognitive engagement with three complex components: the checkbox grading items, the examination of a fellow student’s solution, and responding to quiz questions presented in the form of yes/no statements, all of which entailed substantial mathematical content. Furthermore, an interesting observation is the absence of a significant correlation between quiz performance and exam scores. This underscores the possibility that, despite the relatively low exam scores on average, a substantial proportion of the sampled students, including those with lower performance levels, could effectively decipher and interpret the checkbox grading feedback.

A consistent pattern emerged when connecting the outcomes of the quiz to the findings from the interviews. The four students could independently interpret their checkbox grading feedback for 11 of the 16 exam tasks discussed, constituting a success rate of 69%. Three distinct types of interpretation difficulties were identified for the remaining five exam tasks where an independent and accurate interpretation was not achieved. First, some students exhibited affective responses to the checkbox grading feedback, which obstructed a correct interpretation (e.g., Sasha on task 1, Tom on tasks 7 and 8). Such reactions are well-documented in the literature (Goetz et al., 2018; Koenka et al., 2021; Lipnevich & Smith, 2022), suggesting that the difficulties in understanding the feedback due to an affective response may not be inherently related to checkbox grading, and would possibly emerge in other feedback approaches too. The second interpretation challenge was insufficient content knowledge to comprehensively interpret the feedback (e.g., Tom on task 9, Emile on task 4). Notably, this was only a prevalent issue in two instances, considering that three out of the four interviewed students did not pass their respective exams. The final interpretation challenge emerged when a student disagreed with the grading criteria (Tom, task 8).

For the second research question, about the perceptions of checkbox grading feedback compared to classic approaches, a ranking exercise of equal feedback content, but formulated in different ways, was used in the questionnaire. Moreover, exam task 2 was discussed during the interviews using the traditional grading to compare the cognitive processing between traditional and checkbox grading.

Regarding the ranking exercise, the traditional grading schemes of the Flemish Exam Commission were predominantly favoured over other options, such as checkbox grading, classic written feedback, or only communicating a grade. Unsurprisingly, the latter option, simply communicating a grade, was the least preferred. However, the position of classic written feedback is intriguing. These feedback comments were characterised by concise, personalised explanations of students’ errors and omissions. Students only needed to invest effort in understanding these comments, which were straightforward and self-explanatory, without requiring them to relate their solutions to grading guidelines or solution keys. It is worth noting that this approach was less favoured compared to checkbox grading and traditional grading, both of which involve the application of grading criteria. This observation aligns with the findings of Harks et al. (2014), who concluded that “Learners receiving process-oriented feedback (written feedback) for the first time might struggle to deduce evaluation criteria from the unfamiliar, copious feedback message, to memorise them, or to apply them while evaluating their learning processes and outcomes” (p. 283). Another potential explanation for this preference is students’ limited familiarity with receiving classic written feedback in the context of mathematics. Studies such as Knight (2003) have demonstrated that students in mathematics classes typically receive grades for their tests, with infrequent instances of mistakes being explicitly pointed out. Explanations accompanying these grades are seldom provided. It is also possible that students failed to discern a clear link between the grades assigned to the intermediate steps in classic written feedback, while traditional and checkbox grading distinctly illustrates how a grade is calculated.

A similar phenomenon may have influenced the preference for traditional grading over checkbox grading when comparing these two approaches. Students encountered checkbox grading reports for the first time during the ranking exercise, which might have led them to opt for what they were more accustomed to. Notably, the rank for traditional grading in first place somewhat contradicts the fact that 57.9% of students who had previously participated in a review appointment (involving traditional grading schemes) indicated that they required assistance in connecting their grades to the grading criteria. Simultaneously, 91% of students agreed that checkbox grading clearly indicated how a score is derived.

During the interviews, only one student (Sasha) could independently deduce the grade she received when comparing her solution to the traditional grading scheme, which depicts a more negative picture than the interview data on checkbox grading. Remarkably, all four students interviewed expressed their preference for checkbox grading without any prompting from the interviewer.

While our study was framed in a high-stakes mathematics exam at the Flemish Exam Commission, it is important to acknowledge the context-specific limitations of this research. Further research in larger-scale exams or real classroom settings with peer and self feedback could provide valuable insights into the value of checkbox grading in other contexts. Additionally, taking the behavioural processing ('What do students do with the received feedback?', see Sect. 2.2) and the resulting outcomes into account during a follow-up study would provide valuable insights into the generalisability of our findings, as this was not investigated.

In conclusion, this study explored students’ perspectives on checkbox grading. It is essential to exercise caution when drawing conclusions due to the sample size. Nonetheless, the findings indicate that students generally view checkbox grading positively. They desire the approach to be implemented and demonstrate a notable comprehension of the feedback provided through checkbox grading. Notably, this positive perception seems consistent among students, as no significant correlation was observed with their respective exam scores. When examining preferences for different feedback approaches, students exhibit an intuitive preference for feedback presented in grading guidelines, as opposed to written feedback or mere grade communication. It is worth noting that students tend to favour traditional grading guidelines over checkbox grading. However, some limited evidence suggests that the traditional grading schemes may be more challenging to understand.