Keywords

1 Introduction

Learning complex skills as for instance writing or problem-solving in the domain of physics is not an easy endeavor for many students. To optimize students’ learning, showing examples can be helpful (To et al., 2021; Sadler, 1989). Viewing examples should enable students to gain a better understanding of what constitutes quality (Orsmond et al., 2002; Rust et al., 2003; Sadler, 1989, 2009) and can support students’ self-regulation (To et al., 2021). However, merely presenting examples is not sufficient. Students should also engage with these examples to reach a deeper understanding of quality (Carless & Chan, 2017; Handley & Williams, 2011; Sadler, 1989; Tai et al., 2018). This raises the question how students should ideally interact with examples to optimize their learning. A promising way to do so is setting-up a peer assessment where students assess each other’s work (Carless & Boud, 2018; Tai et al., 2018; To et al., 2021). Several peer assessment methods can be adopted to support students in judging their peers’ work. Using a predefined list of criteria to assess pieces of work one by one is the most commonly used method (Carless & Chan, 2017; Rust et al., 2003) because it results in reliable judgements and makes quality criteria explicit for students (Jonsson & Svingby, 2007; Panadero & Jonsson, 2013). However, research shows that learning gains are higher if people compare two examples rather than looking at only one example (e.g., Alfieri et al., 2013). This suggests that comparative judgement, in which students compare two pieces of work and choose the better one, might also be a valuable peer assessment method.

Only a limited number of studies explicitly compared the effectiveness of using a criteria list or comparative judgement in the context of peer assessment. Jones and Wheadon (2015) examined the reliability and validity of the outcomes of both approaches to peer assessment but did not dig into its learning effect. The latter was done by Bouwer et al. (2018) and only recently also by Stuulen et al. (2022). In the study of Bouwer and colleagues (2018), forty second-year students enrolled in the course International Trade English 2A in the Bachelor of a Business Management program assessed essays of their fellow students using either a criteria list or comparative judgement. Findings show that assessment method influences the quantity and quality of the feedback: students in the comparative condition give more feedback in general and look more often at higher-order aspects and less at lower-order aspects when giving negative feedback than students in the criteria condition. Furthermore, peer assessment method also impacts students’ writing performance. Students in the comparative condition outperform their peers in the criteria condition (Bouwer et al., 2018). This suggests that the comparative approach might be valuable in supporting students. However, Stuulen and colleagues (2022) find no effect of assessment condition on high school students’ writing performance in Dutch and the opposite effect regarding the quality of feedback: students in the criteria condition give more higher-order feedback than students in the comparative condition. This raises the question to what extent the learning effects that Bouwer et al. (2018) find in the context of writing English essays in higher education can be generalized to other contexts and other subjects. Investigating the generalizability of findings to other contexts and subjects requires conceptual replication (Hendrick, 1990; Schmidt, 2009).

Therefore, this study sets out to conceptually replicate the study by Bouwer et al. (2018) in other contexts and for other subjects. More specifically, this study investigated the effect of both peer assessment methods on a) the quality of students’ peer feedback in the context of writing in French (secondary education) and scientific reporting of statistical results (university education) and b) on students’ performance. The latter was also examined in the context of problem-solving in physics (secondary education).

2 Theoretical Framework

This theoretical framework first explains what is referred to with quality of peer feedback in the context of writing. This is followed by a discussion of both peer assessment methods, and their expected learning effects.

2.1 Quality of Peer Feedback

During peer assessment, students are often asked to provide feedback. It is expected that this encourages students to process information in a deep way (Lundstrom & Baker, 2009; Topping, 2009). This stimulates students to critically assess the works of their peers, to formulate strengths and weaknesses with which that student could improve his work (Nicol & Macfarlane-Dick, 2006). Hence, it is important that the feedback that students give is of high quality (Patchan et al., 2016). Bouwer et al. (2018) conceptualize quality of feedback as the content and quantity of feedback. They define quantity as the number of unique aspects per essay that a student refers to in their feedback. For the content of feedback, a distinction is made between higher-order and lower-order feedback. In the context of writing, higher-order aspects are related to, for example, content, structure, and style of the essay. Lower-order aspects refer to, for example, spelling, grammar, length, and layout (Bouwer et al., 2018; Cumming et al., 2002; Lesterhuis et al., 2018). Feedback that focuses on higher-order aspects is preferred as it contributes more to improving the quality of a text than feedback on lower-order aspects (Bouwer et al., 2018; Patchan et al., 2016).

2.2 Peer Assessment Using Criteria

Assessing pieces of work one by one using a list of criteria requires students to break down the quality of a piece of work into several separate aspects (Weigle, 2002). These criteria make it transparent how a piece of work will be assessed and what the expectations are (Jonsson & Svingby, 2007; Panadero & Jonsson, 2013; Sadler, 1989). The student evaluates each criterion one by one. The final grade for a piece of work is obtained by summing these criterion scores (Norton, 2004; Sadler, 2009).

It is expected that by scoring each other’s work based on criteria, students learn how high-quality pieces of work differ from works of lower quality. This increases students’ knowledge of text quality and makes criteria and standards concrete (Bloxham & Boyd, 2007; Handley & Williams, 2011; Orsmond et al., 2002; Rust et al., 2003). Furthermore, understanding quality criteria helps students in monitoring and evaluating their own progress and performance (Tai et al., 2018). This helps them in self-regulating their own learning and makes them less dependent on the teacher (Bloxham & Boyd, 2007). It is important that self-regulation and self-evaluation skills are developed as they have been shown to be a strong predictor of better writing performance (Boud, 2000; Zimmerman & Risemberg, 1997).

Although studies show that students can reliably assess the work of fellow students using a criteria list (Topping, 1998, 2009), there are also some criticisms regarding this method. Some studies indicate that there is no certainty that the use of criteria results in reliable and valid outcomes (Sadler, 1989, 2009; Weigle, 2002). Assessors are not always consistent in their judgements, and they often disagree (Schoonen et al., 1997). Some assessors are stricter than others (Weigle, 2002). Furthermore, when evaluating text quality, assessors differ in their interpretation of the criteria (Eckes, 2008). It is also difficult to define all criteria in concrete terms (Chapelle et al., 2008). As a result, this approach prevents students from reaching a full understanding of the entire quality of a piece of work. Students may have the tendency to only consider the predefined criteria while other aspects may also be relevant for assessing quality (Bouwer et al., 2015). Finally, this approach does not allow students to develop skills to determine for themselves which criteria are relevant for a given task. It is important that these skills are developed in students so that they are ready for life outside school where no predetermined criteria are available. Finally, when students perceive the criteria as demands from teachers, this may be associated with only superficial learning and achievement (Bell et al., 2013; Torrance, 2007).

2.3 Peer Assessment Using Comparative Judgement

Comparative judgement asks students to compare two pieces of work and indicate which is better in terms of the skill under assessment (Pollitt, 2012a, 2012b). All students make several comparative judgements. These judgements are statistically modelled to create a rank-order that orders the pieces of work from low to high quality (Pollitt, 2012a, 2012b). Comparative judgement requires students to make a holistic judgement which implies that a student evaluates the pieces of work as a whole and directly arrives at an overall judgement (Pollitt, 2012a, 2012b; Sadler, 2009). In addition, comparative judgement gives students the opportunity to reflect on how they conceptualize the quality of a piece of work (Sadler, 2009; Williamson & Huot, 1992).

Evidence from research into learning from comparison underpins that learning gains are higher when comparing examples than viewing examples separately. While comparing, students look for similarities and differences between two pieces which make different aspects of each piece of work salient (Alfieri et al., 2013; Gentner, 2010; Gentner & Markman, 1997). For example, in one comparison, the content of an essay may stand out, while in another comparison, spelling mistakes may be noticeable. In this way, students come to a better understanding of important characteristics that a good piece of work must satisfy (Alfieri et al., 2013; Gentner, 2010; Pachur & Olsson, 2012), which in turn enables them to deliver tasks of higher quality (Orsmond et al., 2002; Sadler, 1989).

That students gain a better understanding in quality criteria through comparison is also demonstrated in the context of peer assessment using comparative judgement (Bartholomew et al., 2019; Jones & Alcock, 2014; Seery et al., 2012). For example, the study by Seery et al. (2012) underpins that comparative judgement has a positive influence on the development of higher-order thinking of student teachers who comparatively assessed design projects of their peers. Similarly, the study by Bartholomew et al. (2019) shows that students in secondary education gain a better understanding of the assignment’s criteria by making comparative judgements on the work of their peers.

In addition, comparative judgement can be expected to have an impact on the quality of students’ feedback although evidence regarding the direction of this effect is unclear. Students in the comparative condition of the study by Bouwer et al. (2018) provide more feedback in general than students in the criteria condition. This is also found by Stuulen et al. (2022) but only for positive feedback, while students in the criteria condition give more negative feedback. Also, the content of the feedback differs depending on peer assessment method. Results of Bouwer et al. (2018) indicate that students in the comparative condition give more negative feedback on higher-order aspects and less on lower-order aspects of their peers’ text than students in the criteria condition. Positive feedback does not differ between conditions. The reverse is found in the study by Stuulen et al. (2022) as students in the criteria condition give more higher-order feedback than students in the comparative condition. No differences in lower-order feedback are found.

Peer assessment using comparative judgement can also improve students’ performance (Bartholomew et al., 2019; Bouwer et al., 2018). In the study by Bartholomew et al. (2019), the performance of students who participated in a peer assessment via comparative judgement is improved compared to that of students who only discussed their work with peers. The study by Bouwer et al. (2018) also shows that the performance of students who comparatively judged essays is higher than that of students who assessed essays using a criteria list. However, Stuulen and colleagues (2022) find no difference in students’ performance after participating in a peer assessment exercise using either a criteria list or comparative judgement.

3 This Study

The current study conceptually replicated the study of Bouwer et al. (2018) on the learning effect of two peer assessment methods (use of criteria and comparative judgement). In doing so, the extent to which the results of Bouwer et al. (2018) can be generalized to other contexts and subjects was examined (Hendrick, 1990; Schmidt, 2009). For this purpose, three small scale studies were set up in Flanders (Belgium). The first two studies were run in secondary education and focused on problem-solving in physics and writing in French. For the third study, data on scientific reporting of statistical results was collected in one pre-master program of a Flemish university.

In line with Bouwer et al. (2018), two research questions were answered. The first research question investigated the effect of the use of criteria and comparative judgement on the quality of the peer feedback that students provided. Based on the results of the original study (Bouwer et al., 2018), it was expected that students in the comparative condition would provide more feedback in general and focus more on higher-order aspects than students in the criteria condition. Furthermore, the latter students were expected to focus more on lower-order aspects.

The second research question examined the effect of both assessment methods on students’ performance. In line with Bouwer et al. (2018), students’ prior knowledge and self-efficacy were controlled for. Based on the results of the original study, it was expected that students in the comparative condition would perform better than students in the criteria condition.

4 Method

Three small-scale studies were set up to conceptually replicate the findings of Bouwer et al. (2018). This section describes the methodology underpinning each of these studies. First, an outline is given of the three samples. Then, the three phases of the research design are discussed including a description of the instruments employed in each sample. Finally, operationalization of the key variables and the analysis approach are detailed upon.

All code used to clean and prepare the data sets, run the analyses and report on the results can be found online. All data files, fitted models, tables and figures can be consulted at the Open Science Framework.

4.1 Samples

Sample A (physics) was collected in one secondary school in Flanders (Belgium). All pupils who were enrolled in the third grade (aged 14 or 15 years) of the study track “Sciences” or “Sports sciences” were asked to voluntarily participate in this study. After being informed about the study, 81 pupils gave their written consent for participation (response rate: 94%). However, three pupils were not able to complete all assignments due to medical reasons. Excluding these pupils left data of 78 participants available for analysis. Most pupils were enrolled in the “Sciences” track (68%). The sample consisted for 59% of boys (n = 46).

The sample on writing in French (sample B) was collected in the fourth grade of the same secondary school. The participants were 42 pupils within the “Human sciences” (n = 22) or “Latin” study track (n = 20). All participants gave their written consent for participation in the study (response rate: 100%). The group of participants was composed of 30 girls and 12 boys, all aged 15 or 16 years.

Sample C (scientific reporting) was collected in a statistics course of a pre-master programFootnote 1 at a Flemish University (Belgium). Of the 27 students who completed the consent form (response rate: 26%), 26 students participated in one or more phases of the study. Most students were female (n = 18) with an average age of 37.8 years (SD = 8.81).

For the samples collected in secondary education, ethical advice was asked and granted. No ethical advice was required for sample C. Nonetheless, the same ethical guidelines were implemented in collecting the data.

4.2 Design and Instruments

The design of all studies replicated the set-up used in the study of Bouwer et al. (2018): a pre-test to capture students’ prior knowledge and self-efficacy, an intervention with students randomly allocated to either the criteria or comparative condition and a post-test to measure students’ performance. Since data collection took place during covid, data was mainly captured online. Next, each phase is discussed briefly. For more information on the materials that were used, interested readers are referred to the Open Science Framework. Table 4.1 summarizes the essential information per intervention phase for each sample.

Table 4.1 Overview of the three phases of the experimental design for each sample. Aspects printed in bolded italics refer to differences in set-up across the samples

4.2.1 Pre-test

During the pre-test, students’ prior knowledge was mapped using one or more open questions. In sample A, students were presented with a math problem on the topic of speed (“How long does it take to cover a distance of 5.25 km at an average speed of 13.8 m/s?”). Answers were scored using eight criteria that were agreed upon by four domain experts. Students could score either 0 or 1 for each criterion. These criteria tapped into students’ procedural (e.g., “Only symbolic language used”) and conceptual knowledge regarding physics (e.g., “Correct identification of the physics concepts”). Internal consistency (α = 0.61) and inter-rater reliability (ICC = 0.90) were checked. Students’ prior knowledge in sample B was measured by asking them to write down as many features of a good, emotive text in French they could think of. Two raters independently coded the number of features provided (ICC = 0.91). Students in sample C were given a test consisting of five open questions to measure their prior knowledge. Two questions tapped into students’ factual knowledge regarding t-tests, while the three other questions required students to interpret the output of a t-test. Rules to score the responses were developed and discussed. Responses were partly double coded by two researchers (ICCQ1 = 0.91, ICCQ2 = 1, ICCQ3 = 1, ICCQ4 = 0.94, ICCQ5 = 1).

A survey was administered to measure students’ self-efficacy. For sample A, an adapted version of Usher and Pajares’ (2009) survey on four sources of self-efficacy for mathematics mapped students’ vicarious experience, mastery experience, social persuasion, and psychological state (24 items rated on a six point-scale). In sample B, an adapted version of the Bruning et al. (2013) survey on self-efficacy for writing was administered. The instrument consisted of 15 items that captured students’ self-efficacy for ideation, conventions and self-regulation of writing using a slider ranging from 0 to 100. To map students’ self-efficacy in sample C, 11 items were developed that measured students’ self-efficacy regarding the content to be reported, interpreting statistical results, scientific writing style and language use (slider ranging from 0 to 100). These dimensions were aligned with the dimensions of the criteria list that was used in the criteria condition (see Intervention).

4.2.2 Intervention

Respectively eight, five and six pieces of work of different quality were selected to be assessed during the intervention phase in samples A, B and C. These works were either constructed based on common mistakes of students (sample A), selected from the texts of previous year (sample B), or selected from an authentic (optional) assignment that students made during the statistics course (sample C). Examples were anonymized in all samples.

Because all peer assessments were set up online, students could be randomly assigned to the comparative or criteria condition (even within classes). Students in the criteria condition scored pieces of work using a predefined criteria list that was implemented in Qualtrics. The criteria list in sample A was constructed by experts. The same eight criteria that were developed to score students’ prior knowledge were rephrased into questions (e.g., “Does the pupil use only symbolic language?”) that students had to answer by either ‘yes’ or ‘no’. In the two other samples, the criteria list was adapted from the one used by Bouwer et al. (2018). To assess their peers’ emotive texts in French (sample B), students had to judge the vocabulary, spelling, grammar, syntax, and content of a text by awarding maximum four points per criterion. To aid students’ judgements, descriptions were provided per criterion that were indicative of a good, mediocre, or poor performance (e.g., descriptions for grammar: one or two grammatical errors—multiple grammatical errors—lots of grammatical errors). In sample C, structure and content, correct interpretation of results, scientific style, and language use had to be judged. Each aspect was further divided into sub criteria that were rated on a five-point scale (0: not at all good, 5: very good). Students rated respectively eight (sample A), five (sample B) or six pieces of work (sample C). Students in sample B and C were also allowed to give open feedback on the strengths and weaknesses of each piece of work. As it was felt that the criteria used to judge the physics problems didn’t leave any room for additional open feedback, this was not implemented in the criteria condition of sample A. Consequently, the data of sample A couldn’t be used to examine the effect of assessment method on the quality of the feedback (RQ1).

Students in the comparative condition chose the better of two pieces of work presented side-by-side using Comproved (https://comproved.com/en/). Students were instructed to “Choose the most correct or complete solution” (sample A), “Choose the better text” (sample B), or “Choose the report that is overall of better quality” (sample C). Also, they were allowed to give open feedback regarding the strengths and weaknesses of each piece of work (see Fig. 4.1 for screen shots of the implementation in Comproved in sample A). Students in the comparative condition were also provided with the assessment criteria, but the criteria list wasn’t discussed with the students. Students made respectively ten (sample A), five (sample B) or three comparative judgements (sample C).

Fig. 4.1
A set of screenshots of the assessment exercise page. The top section of the page has a comparison page to decide which is better between A and B. The bottom section provides the results of who is right, along with options to provide feedback on the strengths and weaknesses of A and B.

Screen shots of the peer assessment exercise in the comparative condition of sample A (top: comparative judgement, bottom: feedback). Translations are added as bold text

4.2.3 Post-test

In the final phase of the experiment, students’ performance was captured using a writing task (samples B and C) or by letting students solve two math problems in the context of physics (sample A). Students in sample A and B had only 50 minutes to perform the task, while no time restrictions were given to the students in sample C.

The two math problems concerned the topic of speed (“How much time does it take a cyclist to cover a distance of 17.3 km at an average speed of 6.2 m/s?”) and force (“Professor Jones has landed on an unknown planet. A mass of 500 g exerts a force of 17.6 N. What is the gravitational field strength on this planet?”). Students’ responses were scored using the same criteria as in the pre-test (α = 0.56, ICC: 0.82). In sample B, students had to write a short emotive text (between 120 and 150 words) in French that described a confidant from the family with whom they have a strong relationship. The texts were uploaded to Comproved and assessed by ten experts. Each expert made 32 comparative judgements which resulted in a reliable rank-order of the texts (SSR = 0.80; see Verhavert et al., 2019 for more information on the SSR). Students in sample C were given a research question and the output of a t-test and asked to write a report using that information. The resulting reports were comparatively judged by six experts. Each expert made about 60 comparisons which resulted in rank-order of acceptable reliability (SSR = 0.61).

4.3 Variables

4.3.1 Prior Knowledge, Self-efficacy, and Performance

Students’ scores on the prior knowledge tests were summed (in samples A and C). Scores could range between 0 and 8 (sample A and C) and from 0 onwards (sample B). Tables 4.2 and 4.3 provide an overview of the minimum and maximum scores, average and standard deviation of prior knowledge. In sample A, seven students scored the maximum on the prior knowledge test (score of 8). This was accounted for when examining the effect of both approaches on students’ performance (RQ2). Prior knowledge was standardized before analysis to facilitate comparison across samples.

Table 4.2 Range (Min, Max), mean (M) and standard deviation (SD) of prior knowledge, sources of self-efficacy and performance for sample A
Table 4.3 Range (Min, Max), mean (M) and standard deviation (SD) of prior knowledge, self-efficacy and performance for sample B and sample C

Exploratory factor analysis (EFA) with oblique rotation was used to examine the factor structure of the self-efficacy instruments. For sample A, three scales were retained tapping students’ mastery experience (4 items, α = 0.84), social persuasion (6 items, α = 0.86) and psychological state (6 items, α = 0.84). EFA on the self-efficacy items of sample B and sample C indicated that only one factor could be retained with respectively 14 items (α = 0.95) and 8 items (α = 0.98). Full results of EFA can be found online. Items were summed and divided by the number of items to create the variables on self-efficacy. The self-efficacy measures in sample A could range from 1 to 6, while self-efficacy scores were bounded between 0 and 100 in sample B and C (see Tables 4.2 and 4.3). Self-efficacy measures were standardized before analysis.

Students’ performance refers to their total score on the physics problems (sample A) or the quality of their text (samples B and C). Total scores for problem-solving were calculated by adding up students’ scores on the criteria of both post-test problems (range 0 to 16). The quality of the texts was estimated using the comparative judgements of all experts. This resulted in a score per text (expressed in logits) which indicates the probability that this text would be judged as better than an average text. Thus, texts with a positive score are relatively better than an average text, while the opposite is true for texts with a negative score. Tables 4.2 and 4.3 show descriptive statistics of the variables operationalizing students’ performance. All performance scores were standardized before analysis.

4.3.2 Feedback

To operationalize the quantity and content of the feedback, students open comments were qualitatively coded. First, students’ open comments were divided into feedback statements that referred to a single aspect which resulted in 380 statements for sample B and 386 statements for sample C. The variable representing the amount of feedback was created by summing the number of unique aspects mentioned per piece of work. Similarly, a variable representing the amount of positive and of negative feedback was created for sample B (positive: 170, negative: 210) and sample C (positive: 181, negative: 187). Table 4.4 provides descriptive statistics for the variables that represent the total amount of feedback and the amount of positive and negative feedback. Overall, students provided feedback about at least one positive or negative aspect in 77.1% of the judgements in sample B and in all judgements in sample C. Figure 4.2 shows for both samples the relative share of the number of arguments provided (positive and negative) per judgement.

Table 4.4 Range (Min, Max), mean (M) and standard deviation (SD) of amount of (positive and negative) feedback statements for sample B and sample C
Fig. 4.2
Two bar graphs present the number of arguments per judgment of samples B and C. a, the bar graph records a maximum of 26.2% at 0 followed by 22.9% at 1. b, the graph plots a maximum of 33.3% at 3 followed by 27.1% at 2.

Relative frequencies of number of arguments provided per judgement for sample B and sample C

Then, content of the feedback was operationalized by assigning each feedback statement to one of the categories also included in the criteria list. For sample B, this could be either ‘content’, ‘syntax’, ‘grammar’, ‘vocabulary’, or ‘spelling’. The two first categories were considered as statements referring to higher-order aspects of writing in French in the fourth grade of secondary education, the three remaining categories were labelled as lower-order aspects. A distinction was made between positive and negative statements. 14% of all feedback statements were independently coded by two raters resulting in ICC’s ranging from 0.73 to 0.96. The feedback statements in sample C were also deductively coded (based on the criteria list) resulting in the categories of ‘content’, ‘interpretation’, ‘scientific style’, and ‘language use’. Three categories were inductively added that referred to making a holistic judgement, the length of the report, or other aspects. The categories of ‘content’, ‘interpretation’ and ‘scientific writing’ were considered as referring to higher-order aspects of scientific writing. As in sample B, positive and negative statements were distinguished. Inter-rater reliability was calculated using Cohen’s kappa and ranged from acceptable to good (0.6 < κ < 1). Sometimes, a student referred more than once to the same category. Therefore, all variables representing the content of the feedback were recoded to dummy variables taking a value of 0 (aspect not mentioned) or 1 (aspect mentioned). Table 4.5 presents descriptive statistics for the dummy variables on the content of the positive and negative feedback.

Table 4.5 Absolute frequency (N) and relative frequency (%) of the dummy variables representing the content of the positive and negative feedback for sample B and sample C

4.4 Analyses

Before answering both research questions, randomization of students was checked. Randomization failed only in two instances. Students in the comparative condition of sample A (M = 0.13, SD = 1.02) scored 0.27 SD higher (t(75.78) = −1.21, p = 0.22, d = −0.27) on psychological state (self-efficacy) than students in the criteria condition (M = −0.13, SD = 0.97). In sample B, students’ self-efficacy was 0.22 SD higher (t(39.79) = 0.70, p = 0.49, d = 0.22) in the criteria (M = 0.11, SD = 1.04) than the comparative condition (M = −0.11, SD = 0.97). Results of all randomization checks can be consulted online.

The effect of condition on the quality of the feedback provided (RQ1) was tested using generalized cross-classified linear mixed-effect models fitted with the R-package lme4 (version 1.1-25; Bates et al., 2015). These models account for hierarchy in the data by examining the effect of condition on the amount/content of feedback for an average student (fixed effects) while also taking differences in amount/content of feedback between students and between products (random effects) into account (Fielding & Goldstein, 2006). Dependent variables were not normally distributed as they were either counts (amount of feedback) or binary variables (content of feedback). Therefore, generalized mixed-effect models assuming a Poisson-distribution with log-link (amount of feedback) or a binomial distribution with logit-link (content of feedback) were used. Two effect sizes were calculated using the MuMin-package (version 1.43.17; Bárton, 2022). The marginal R2 represents differences in amount/content of the feedback attributable to the average effect of condition (fixed effects). Its interpretation is analogous to that of the R2 in ordinary linear regression models. The conditional R2 represents the differences in amount/content of feedback that can be explained by the whole model (fixed and random effects). Consequently, the difference between both R2-statistics gives an indication of variation in amount/content of the feedback attributable to differences between students and between products (random effects). However, interpreting these effect sizes should be done cautiously given that their size depends on the location of the interceptFootnote 2 (see Johnson, 2014; Nakagawa & Schielzeth, 2013).

To examine the effect of both assessment methods on students’ performance (RQ2), two analyses were performed. First, an independent sample Welch t-test was applied to examine differences in average performance across both conditions. All t-tests were performed assuming unequal variances (see Delacre et al., 2017). Cohen’s d was also estimated to gain insight into the size of the effect. Then, a regression analysis was run with condition, prior knowledge and self-efficacy as independent variables and students’ performance as dependent variable. For sample A, these analyses were performed using the full data set and a data set excluding information of students with maximum scores on prior knowledge (see 4.4.3 Variables).

5 Results

Results are discussed per research question. Findings on the effect of peer assessment method on the quality of the feedback (RQ1) are presented using visualizations. Tables with full results for RQ1 can be consulted in the appendix of this chapter.

5.1 Quality of Feedback

Results regarding sample B show that condition only impacts the number of positive feedback statements that an average student provides. An average student in the comparative condition mentions 0.8 arguments per judgement compared to 0.4 arguments for a student in the criteria condition (see Fig. 4.3). The marginal R2-statistic points to a moderate effect (marginal R2 = 0.09). The results also indicate that students differ in the total amount of feedback (SD = 0.93), in the amount of positive feedback (SD = 0.99) and in the amount of negative feedback (SD = 0.69) they give.

Fig. 4.3
Two double bar graphs compare the argument count for samples B and C under different conditions. Sample B has the value 1 for both total and negative conditions. Sample C, has the value 1.7 for positive criteria and 1.8 and 1.2 for negative comparative and criteria condition, respectively.

Estimated average number of arguments per judgement by condition

In study C, opposite results are found as an average student in the criteria condition provides more feedback per judgement (3.5 arguments) than a student in the comparative condition (2.8 arguments; see Fig. 4.3). An average student in the criteria condition also mentions more negative aspects per judgement (1.8 arguments) than a student in the comparative condition (1.2 arguments). The amount of positive feedback does not differ between conditions (1.7 positive arguments per judgement). Also, students do not vary in the amount of feedback they provide which is reflected in the small conditional R2-statistics (< 0.1). Consequently, differences can be mainly attributed to peer assessment condition. Marginal R2-statistics point to a small effect for the total amount of feedback (marginal R2 = 0.05) and to a moderate effect for amount of negative feedback (marginal R2 = 0.07). Full results can be consulted in Table 4.10 of the appendix.

Further analysis of the content of the positive feedback shows that the probability of mentioning most aspects is, on average, the same across both conditions. Figure 4.4 visualizes the estimated average probability of mentioning each quality aspect when giving positive feedback for both samples.

Fig. 4.4
Two double bar graphs. a, Sample B, syntax yields maximum average feedback of 27.9 under comparative conditions, and content yields a maximum average of 18.2 under criteria conditions. b, Sample C, content yields a maximum of 50.0 under both conditions.

Estimated average probability of giving positive feedback for each quality aspect by condition

Only two differences are found that can be attributed to condition (see Fig. 4.4). First, the average student in the criteria condition of sample B has a lower probability of mentioning the higher-order aspect ‘Syntax’ than an average student in the comparative condition (4.2% versus 27.9%). This points to a moderate effect (marginal R2 = 0.13). Second, the probability of mentioning the higher-order aspect ‘Interpretation’ is higher in the criteria than in the comparative condition of sample C (24.1% versus 7.1%). Again, the marginal R2-statistic indicates a moderate effect (marginal R2 = 0.07). In addition to the effect of condition, it also appears that especially in sample B students differ in the probability of mentioning the higher-order aspect ‘Content’ (SD = 1.52) and the lower-order aspects ‘Grammar’ (SD = 2.16) and ‘Vocabulary’ (SD = 2.36). In sample C, differences between students are only found regarding the higher-order aspect ‘Content’ (SD = 1.17). The complete results of the analyses can be found in Table 4.11 of the appendix.

In-depth analysis of the content of the negative feedback does not find any average effect of condition. Figure 4.5 shows that the estimated average probability of mentioning a quality aspect is the same across both conditions in sample B and sample C. Moreover, all effect sizes are negligible or small (marginal R2 ≤ 0.03). Only differences between students or products in the probability of mentioning certain aspects are found. In sample B, the probability of mentioning the higher-order aspect ‘Syntax’ (SD = 1.33) and the lower-order aspect ‘Grammar’ (SD = 1.24) varied across students. Differences between products are found in sample C regarding the higher-order aspect ‘Content’. Hence, the probability of mentioning this aspect was higher for some products than for others (SD = 1.24). All results on the content of negative feedback can be consulted in Table 4.12 f the appendix.

Fig. 4.5
Two double bar graphs. a, Sample B, spelling yields maximum average feedback of 17.1 under comparative conditions, and criteria conditions followed by vocabulary, 16.7. b, Sample C, content and language use yield a maximum of 50.0 under both conditions.

Estimated average probability of giving negative feedback for each quality aspect by condition

5.2 Effect on Performance

The results in Table 4.6 indicate that students’ performance after the intervention doesn’t differ across both conditions. The effect sizes (Cohen’s d) vary between −0.15 (sample A) and 0.16 (sample B). Hence, whether a student judged their peers’ works using criteria or made comparative judgements does not impact their performance differently.

Table 4.6 Mean (M), standard deviation (SD) per condition and results of independent sample Welch t-tests

This lack of effect remains after controlling for prior knowledge and self-efficacy (see Tables 4.7 and 4.8). The difference between the criteria and the comparative condition ranges between −0.06 (sample B) and 0.13 (sample C). However, the 95% confidence intervals indicate that this effect cannot be generalized to the population of students in any sample (see Tables 4.7 and 4.8). Thus, assessment method has no differential effect on students’ performance in any of the samples.

Table 4.7 Estimates (Est.) and 95% confidence intervals (95% CI) of the regression models that examine the impact of condition on performance and control for prior knowledge and self-efficacy (SE) for sample A
Table 4.8 Estimates (Est.) and 95% confidence intervals (95% CI) of the regression models that examine the impact of condition on performance and control for prior knowledge and self-efficacy for samples B and C

6 Discussion

This study conceptually replicated the study of Bouwer et al. (2018) on the learning effects of peer assessment using either predefined criteria or comparative judgement. Three small scale studies were set up, two in secondary education (problem-solving in physics, writing in French) and one in university education (scientific reporting of statistical results). After mapping students’ prior knowledge and self-efficacy, students were randomly allocated to a peer assessment condition. Students in the criteria condition assessed the work of their peers using a predefined criteria list, while students assigned to the comparative condition made comparative judgements. Students in the study on writing in French and on scientific reporting were also allowed to provide open feedback on the strengths and weaknesses of the works they assessed. Analyses were analogue to the ones performed by Bouwer et al. (2018) and focused on the effect of both assessment methods on the quantity and content of the peer feedback and on students’ performance. Overall, the results of the conceptual replications showed that the effects found by Bouwer and colleagues (2018) can only be replicated to a limited extent. Table 4.9 compares the results of the studies reported in this chapter to those of the Bouwer et al.-study (2018). Because the study of Stuulen et al. (2022) had a similar set-up, findings of this study are also added to the table.

Table 4.9 Comparison of the results regarding the impact of comparative judgement (CJ) and the use of criteria (CRIT) on the quantity and content of feedback and on students’ performance found in the three studies reported in this chapter, the study by Bouwer et al. (2018) and the study by Stuulen et al. (2022)

Regarding the quantity of the feedback, results of the conceptual replications are partly in line with those of Bouwer et al. (2018). The original study found a positive effect of comparative judgement on the total amount of feedback given. The replication studies also found differences in quantity of feedback across peer assessment conditions. However, the direction of the effect differed between samples. In one sample (secondary education, writing in French), students in the comparative condition gave more positive feedback (in line with Bouwer et al., 2018), while in the other sample (higher education, scientific reporting) the opposite was found as students in the criteria condition gave more negative feedback. These results are in line with those of Stuulen and colleagues (2022) who also found students in the comparative condition to give more positive feedback but less negative feedback than students in the criteria condition. One explanation for the inconclusive findings might be the difference in the number of works students assessed. In the studies by Bouwer et al. (2018), Stuulen et al. (2022) and two of the samples in this chapter, students in the comparative condition judged more pieces of work than their peers in the criteria condition. This gave these students more feedback opportunities than students in the criteria condition. Moreover, the number of judgements could also vary within the comparative condition. Therefore, future research should control for the number of judgements made across and within condition. Another explanation for the results relates to students’ initial task experience. Students in the sample on scientific reporting hardly had any experience with the task at hand before the intervention, while students in the study of Bouwer et al. (2018) and in the sample on writing in French all had prior experience with writing in English or French. Hence, students in the sample on scientific reporting might have been less able to fall back on their own understanding of quality than students in the other samples. This might have benefitted students in the criteria condition since they could draw on the predefined criteria to formulate feedback (Jonsson & Svingby, 2007; Panadero & Jonsson, 2013; Sadler, 2009). Future research can investigate to what extent the interaction between students’ prior task experience and assessment method influences the amount of feedback students give.

Looking at the results on the content of the feedback, a complex picture emerges. Whereas Bouwer and colleagues (2018) found differences between both assessment methods in the type of negative feedback provided, results of the replication studies presented in this chapter found only two differences across both conditions in the content of the positive feedback that students gave. Students who comparative judged French texts gave more positive feedback on the syntax of the texts, while students in the criteria condition of the sample on scientific reporting provided more positive feedback on the aspect of interpretation. These effects refer in both cases to higher-order aspects of writing which is partly in line with the results of Bouwer et al. (2018) and of Stuulen et al. (2022). Reasons for the differences found are unclear. One important aspect that might have been at play and hasn’t been considered in any of the studies is which pieces of work the students assessed. Research into comparative judgement underpins that the aspects assessors look at depends on the pair composition (Lesterhuis, 2018). This also applies when using criteria because students are assumed to compare examples to their own internal standards or previous work (Nicol, 2020). Hence, future research should be set up that looks at the impact of confronting students with specific (pairs of) examples.

The positive effect of comparative judgement on students’ writing performance found by Bouwer et al. (2018) wasn’t replicated in any of the samples. This lack of effect is in line with the findings of Stuulen et al. (2022) who also didn’t find differences in performance due to peer assessment method. However, the design of the replication studies in this chapter (and of the Bouwer et al.-study) did not allow drawing any conclusions regarding improvement in students’ performance due to assessment condition. Furthermore, according to Nicol (2020) and To et al. (2021), it might be beneficial if students already have some experience with a task before engaging in peer assessment. In that case, they already developed a sense of quality and generated internal feedback on their own work (strengths, gaps). This allows them to be more focused on information that is relevant for them during the peer assessment which can enhance the learning effect of the peer assessment exercise. Together this calls for future studies that capture students’ performance before the intervention and allow them to revise the same task after they have participated in a peer assessment exercise. Furthermore, future studies should also dig into students’ learning processes while engaging in peer assessment. Students’ feedback statements only capture those quality aspects that students were aware of and that they reported. These statements do not reveal how students came to noticing these aspects nor how they cognitively process the examples. Therefore, studies should combine feedback statements with online measures such as eye-tracking and log data to fully map students’ learning processes. Eye-tracking data and log data provide objective measures of cognitive processes that student engage in (e.g., attention allocation). Replaying students’ eye movements can also be used to capture retrospective cued recall data of students’ learning processes.

This study set out to conceptual replicate the findings of the study by Bouwer et al. (2018). Overall, it can be concluded that results are only replicated to a limited extent. According to literature on replication research, conceptual replication studies that fail in replicating results add little insights to the scientific knowledge base (Hendrick, 1990). However, replicating the Bouwer et al.-study (2018) three times in a different context sheds light on individual characteristics (e.g., difference in initial task experience) and characteristics of the peer assessment design (e.g., difference in number of judgements) that might explain the variety in the results found. In this respect, a systematic replication study would be interesting. Then, different foundational aspects of a study (e.g., number of judgements made, characteristics of respondents in sample) are systematically varied, while the hypothesis behind the study (learning effect of peer assessment method) is retained (Hendrick, 1990). This would provide systematic insight into which type of peer assessment method (use of criteria or comparative judgement) and peer assessment design (e.g., number of judgements, type of exemplars) is most beneficial for which student (e.g., high prior task experience).

Some additional limitations of this study should be mentioned. First, results are based on small samples making the results of this study more uncertain. Further replication research that uses bigger samples is needed. Second, all samples were collected amidst the covid-pandemic. Consequently, some procedures were less controlled than common in experimental designs. This is especially important for the peer assessment exercises which were run online in all samples. Although this mirrors actual classroom conditions, it makes it unclear to what extent students engaged in the peer assessment exercise as intended which can have biased the results. Finally, students’ effort and time investment were not considered which might have confounded the results of this study. Despite these limitations, this study provided insights into the effect of using criteria and comparative judgement in the context of peer assessment. Furthermore, it also highlights the need for (conceptual) replication studies within the educational sciences as this can shed light on the replicability of effects and can provide an avenue for further research and theory development.