Introduction

When a student’s assessment is assigned a grade we want that grade to be entirely dependent on the quality of the student’s work and not at all dependent on the biases and idiosyncrasies of whoever happened to assess it. Where we succeed we can say that the assessment is reliable, and where we fail we can say the assessment is unreliable (Berkowitz et al. 2000). Reliability in high-stakes educational assessment is particularly important where students’ grades affect future educational and employment prospects.

One way to ensure reliability is to use so-called objective tests in which answers are unambiguously right or wrong (Meadows and Billington 2005). Examples are an arithmetic test in mathematics, a spelling test in languages, or a multiple-choice test in any subject. Objective tests also have the advantage of being quick and inexpensive to score and grade because the task can be automated (Sangwin 2013) and in practice often is (Alomran and Chia 2018). For these reasons—reliability and efficiency—objective tests are common in education systems around the world (Black et al. 2012).

However, objective tests are far from universal because they risk delivering reliability at the cost of validity (Assessment Research Group 2009; Wiliam 2001). Educationalists have argued that there is more to doing and understanding mathematics or languages than answering a series of closed questions. We typically want evidence that students can make connections between learned ideas, apply their understanding to novel contexts, construct arguments, demonstrate chains of reasoning, and so on (Newton and Shaw 2014; Wiliam 2010). Such achievements are not readily assessed using objective tests but instead require students to undertake a sustained performance that reflects the discourse and cultural norms of the subject (Jones et al. 2014). In mathematics this might be a student report in which mathematical ideas are creatively applied to a problem situation; in languages this might be a student report in which literary texts are critiqued. Scholars have argued that such assessments are more valid and fit for purpose than objective tests (e.g. Baird et al. 2017). However, assessing performance-based assessments is subjective and therefore tends to be less reliable than objective tests, in part because outcomes reflect to some extent the biases and idiosyncrasies of individual assessors (Murphy 1982; Newton 1996).

As such there exists a tension between ensuring an assessment is reliable and ensuring that it is valid (Wiliam 2001; Pollitt 2012). To negotiate this tension education systems commonly make use of three techniques. First, many tests are populated with short questions, typically worth several marks, which might be seen as a compromise between objective-style and performance style questions (Hipkins et al. 2016). Second, many exams and assessments are designed with a rubric or marking scheme that assessors use as the basis for awarding marks. Rubrics typically attempt to anticipate the full range of student answers, although in practice professional judgement is often required when matching a script to a rubric (Suto and Nadas 2009). The third technique is for a sample of marked tests to be moderated, or re-marked, by an independent assessor.

However these techniques produce imperfect outcomes. Short questions have been criticised for being more similar to objective questions and contributing to a tick-box approach to assessment (Assessment Research Group 2009; Black et al. 2012). Where questions are more substantive, rubrics become so open to interpretation that reliable assessment outcomes are not achieved (Meadows and Billington 2005; Murphy 1982; Newton 1996). Moderation exercises can only be applied to a sample of student scripts or the marking workload is doubled, and so commonly a student receives a grade without their particular script having been moderated.

In this paper we provide evidence in support of the reliability and validity of an alternative approach to producing student assessment outcomes, called comparative judgement (Pollitt 2012). We describe the origins and mechanics of comparative judgement and argue that there are educational assessment contexts in which it can deliver acceptable reliability and validity. We then present two studies in which we evaluated the application of comparative judgement to assess standard secondary school statistics and English assignments in New Zealand.

Comparative Judgement

Comparative judgement is a long-established research method that originates in the academic discipline of psychophysics. Work by the American psychologist Thurstone (1927) established that human beings are consistent with one another when asked to compare one object with another, but are inconsistent when asked to judge the absolute location of an object on a scale. For example, Thurstone found that participants were consistent when asked which of two weights is heavier and inconsistent when asked to judge the weight of a single object. Thurstone (1954) later used comparative judgement to construct scales of difficult-to-specify constructs such as beauty and social attitudes.

The recent two decades has seen the growth of research and practice in applying comparative judgement to educational assessment. The internet has enabled larger scale of usage than is possible in traditional laboratory settings because student work can be digitised and presented to examiners remotely and efficiently (Pollitt 2012). A key motivation for considering comparative judgement rather than traditional assessment methods are claims that comparative judgement is more reliable than marking for the case of open-ended assessments (Jones and Alcock 2014; Steedle and Ferrara 2016; Verhavert et al. 2019).

Once the judging is complete the examiners’ binary decisions are fitted to a statistical model to produce a unique score for each script (Bramley 2007). It is not necessary to compare every script with every other script; the literature suggests that to assess n scripts then a total 10n to 37n judgements are required (Jones and Alcock 2014; Verhavert et al. 2019), depending on the nature of the tests and the expertise of the assessors, to produce a reliable scale. The reliability and validity of outcomes can be estimated using the techniques described in the methods section. Good reliability and validity for assessment outcomes have been reported in a variety of contexts including mathematical problem solving (Jones et al. 2014; Jones and Inglis 2015) and essays (Heldsinger and Humphry 2010; Steedle and Ferrara 2016; van Daal et al. 2019).

There are two key reasons that comparative judgement outcomes can be more reliable than traditional marking when assessing student responses to open assessment tasks. First is people’s capacity to compare two objects against one another more consistently than they can rate an object in isolation, as discussed above. Second is that assessor bias has no scope for expression when making binary decisions rather than assigning marks (Pollitt 2012). For example, a harsh assessor is likely to assign a lower mark than a lenient assessor. However when comparing two pieces of work both assessors must simply decide which piece of work is the better, and so they produce the same outcome.

The Research

We contribute to literature on using comparative judgement for the case of secondary assessments in statistics (Study 1) and English (Study 2) in New Zealand. Our focus was on whether the good reliabilities reported in the literature could be replicated for the case of secondary assessments in New Zealand. We also explored validity, and were able to do so in depth for the case of Study 1 where relevant data were available. Finally, we sought to understand the judges’ experience of the comparative judgement process using a survey.

Study 1: Statistics

Materials

The assessment task used was for NCEA Level 2 AS91264 ‘Use Statistical methods to make an inference’ and is shown in Appendix 1. The task requires students to take a sample from a population and make an inference that compares the position of the median of two population groups.

Participants

The participants were 134 students in Year 12 (ages 16 and 17) from a single school in New Zealand. The school was decile 8 had a roll of about 1400 students with 60% of students of NZ European ethnicity, 14% of students with Maori ethnicity and 5% of students of Pasifika ethnicity. The school was recruited by contacting the Head of Department of mathematics through a contact of one of the authors and individual teachers were invited to participate on a voluntary basis. The study followed NZQA ethical practices for conducting research and guidelines for assessment conditions of entry.

Method

The students completed the assessment task as a summative assessment of a unit of work that was taught over a period of 6–8 weeks. Students completed the assessment electronically in Google Drive over four days. They could work on the assessment task in and out of normal lesson time. 21 scripts were incomplete and were excluded from further analysis, leaving a total 113 participant scripts. The anonymised student scripts were uploaded to the online comparative judgement engine nomoremarking.com for assessment. In addition six boundary scripts, produced by NZQA to exemplify levels of achievement (High Not Achieved, Low Achieved, High Achieved, Low Merit, High Merit, Low Excellence), were also uploaded. Therefore the total number of participant and boundary scripts to be judged was 119.

The scripts were judged by 21 judges: eight judges were teachers from the same school as the participants; eight were pre-service teachers undertaking a practicum at a range of New Zealand schools; five were NZQA employees who work as National Assessment Moderators, and one of whom had previously worked as a subject advisor. The project included two initial sessions of professional development for the judges prior to the assessment tasks being undertaken. The first session, led by the NZQA moderator, supported assessors to become fully familiar with the assessment activity and the corresponding achievement standard. The second session supported assessors to become fluent in using the comparative judgement online process. The judges then undertook the assessment procedure using this online comparative judgement process (nomoremarking.com). The judges were allocated 99 pairwise comparisons each. Eighteen judges completed all 99 comparisons and the other three completed 16, 50 and 63 each. This resulted in 1911 judgements in total with each script receiving between 31 and 36 judgements each with a mode of 32 judgements.

We also collected participants’ Grade Point Averages (GPAs) for mathematics and English from their Year 11 results, and these data were available for 99 of the participants.Footnote 1 It was not possible to calculate a GPA for the other 14 students because they arrived at the school from overseas, either as the result of family immigration or as exchange students.

Analysis and Results

Comparative Judgement Scores

The judgement decisions were fitted to the Bradley-Terry model to produce a unique score for each student script (Pollitt 2012). The scores had a mean of 0.0 and a standard deviation of 1.7. The distribution of scores is shown in Fig. 1.

Fig. 1
figure 1

Comparative judgement scores for statistics scripts

Reliability

Reliability was explored using three techniques (Jones and Alcock 2014). First, we calculated the Scale Separation Reliability (SSR), which gives an overall sense of the internal consistency of the outcomes considered analogous to Cronbach’s alpha (Pollitt 2012). A high value, SSR ≥ 0.7, suggests good internal consistency and this was the case here, SSR = 0.90.

Second, we estimated inter-rater reliability using the split-halves technique as described in Bisson et al. (2016). The judges were randomised into two groups and each group’s decisions refitted to the Bradley Terry model to produce two sets of scores for the scripts. The Pearson product–moment correlation coefficient is calculated between the two sets of scores, and this process is repeated 100 times and the median correlation coefficient taken as an estimate of inter-rater reliability. The inter-rater reliability in this case was strong, r = 0.81.

Third, we calculated an infit statistic for every judge and compared these statistics to a threshold value (two standard deviations above the mean of the infit statistics) to check for any ‘misfitting’ judges (Pollitt 2012). An infit statistic above the threshold value suggests that judge’s decisions were not consistent with the other judges’ decisions. There were no misfitting judges bar one who was marginally above the threshold value. We recalculated comparative judgement scores with the misfitting judge removed and found the correlation with the original scores (with the misfitting judge included) to be perfect, r = 1. Therefore the misfitting judge had no effect on final outcomes and was included in the remainder of the analysis.

Taken together, these analyses suggest that the scores were reliable; that is we would expect the same distribution of scores had an independent sample of experts conducted the judging.

Validity

The validity of assessment outcomes is more difficult to establish than reliability (Wiliam 2001). This difficulty is in part because whereas reliability is an inherent property of assessment outcomes, validity tends to require comparison with external data such as independent achievement scores and expert judgement (Newton and Shaw 2014) and needs to take into account the intended purpose and interpretation of test scores (Kane 2013). In terms of the purposes of test scores, a limitation of much assessment research is that tests are often administered in contexts that are low- or zero-stakes from the perspective of the student participants. This threat to ecological validity was overcome in the present study which made use of responses gathered as part of students’ usual summative assessment activities.

For Study 1, we investigated validity using three techniques as described here. First, we explored criterion validity (Newton and Shaw 2014) by scrutinising the positioning of the boundary scripts in the final rank order. The boundary scripts represented grades at six levels (High Not Achieved, Low Achieved, High Achieved, Low Merit, High Merit, Low Excellence) and were determined using traditional marking methods. A valid comparative judgement assessment would be expected to position the boundary scripts in the correct order, from High Not Achieved to Low Excellence, and this was the case as shown in Fig. 1.

Second, we investigated convergent and divergent validity (Newton and Shaw 2014) using Mathematics and English GPAs that were available for 99 of the 113 students who completed the statistics test from their Year 11 results. The validity of the comparative judgement scores would be supported by evidence of convergence with mathematics and divergence with English GPAs. To investigate this we regressed mathematics and English GPAs onto comparative judgement scores. The regression explained 26.9% of the variance in comparative judgement scores, F(2,96) = 17.67, p < 0.001, η2 = 0.269. Mathematics GPA was a significant predictor of comparative judgement scores, b = 0.052, p = 0.002, and English GPA was also a significant predictor, b = 0.044, p = 0.001. This was not expected because previous research conducted in the UK has consistently found that mathematics, but not English, attainment is a significant predictor mathematics comparative judgement scores (Jones et al. 2013; Jones and Wheadon 2015; Jones and Karadeniz 2016). To understand the relationship between the comparative judgement scores and GPAs further we investigated the correlations. The comparative judgement scores correlated moderately with mathematics GPAs, ρ = 0.450, and moderately with English GPAs, ρ = 0.468, as shown in Fig. 2, and these correlations were not significantly different (Steiger, 1980), z =  − 0.19, p = 0.851. Interestingly, the correlation between mathematics and English GPAs was similar, ρ = 0.390. Therefore, the correlations between the comparative judgement outcomes and mathematics and English GPA are in line with what we might generally expect from different assessments.

Fig. 2
figure 2

Scatter plots of the relationship between comparative judgement scores and GPAs

Third, to further explore criterion validity, teachers at the school (who were also part of the comparative judgement process) awarded grades (Not Achieved, Achieved, Merit, Excellence) to the scripts using the traditional marking and internal moderation process that was familiar to them. Following usual protocol, each teacher awarded a grade to the students in his/her class and the name of the student was visible to the teacher. A small sample of scripts from each class were check marked by the Head of Department for internal moderation processes. The teachers’ grades were further moderated by National Assessment Moderator employed by NZQA. The moderator reviewed each script and confirmed, or not, the teacher judgement. Where there was disagreement the moderator judgement took precedence. Teacher grades were available for 94 of the participants.

A valid comparative judgement assessment would be expected to produce increasing scores for those scripts graded at Not Achieved through to those graded at Excellence. Indeed the correlation between comparative judgement scores and grades was moderate, ρ = 0.740, as shown in Fig. 3. A one-way ANOVA was conducted to compare the comparative judgement scores of scripts in each of the four grades. There was a significant difference of scores at each grade, F(3, 96) = 44.81, p < 0.001. Using a Bonferroni-adjusted alpha level of 0.017 there was a significant difference between scripts graded by teachers as Not Achieved (M =  − 1.98, SD = 1.17) and Achieved, (M =  − 0.14, SD = 0.86), t(32.49) =  − 5.69, p < 0.001. There was a marginal difference between scripts graded as Achieved and Merit (M = 0.64, SD = 1.12), t(47.57) =  − 2.44, p = 0.019. There was a significant difference between scripts graded as Merit and Excellence (M = 1.37, SD = 0.92), t(50.55) =  − 3.07, p = 0.003.

Fig. 3
figure 3

Comparative judgement scores and teacher grades for the statistics scripts

Taken together these three analyses support the validity of using comparative judgement to assess the statistics task.

Study 2: English

Materials

The assessment task used was NCEA Level 1 AS90852 ‘Explain significant connection(s) across texts, using supporting evidence’, as shown in Appendix 2. The task required students to write a report in which they identify a connection across four literature texts. Three of the texts were teacher-selected and one was student-selected. Students identify examples of their chosen connection from each of the literature texts and then explain them. The assessment activity was open book and completed online within a specified timeline.

Participants

The participants were 253 students in Year 11 (ages 15 and 16) from a single school in New Zealand. The school was not that in Study 1. It was decile 8 with a roll of about 1300 students with 55% of students of NZ European ethnicity, 27% of students with Maori ethnicity and 7% of students of Pasifika ethnicity. Recruitment and ethical practices were the same as for Study 1.

Method

The method was broadly the same as Study 1, although unlike Study 1 boundary scripts were not available and none were included in the judging pot. The 253 scripts were judged by 17 judges: ten judges were teachers from the same school as the participants; seven were NZQA employees who work as National Assessment moderators. There were two initial sessions of professional development for the judges, led by the NZQA moderator. The first session supported assessors to become fully familiar with the assessment activity and the corresponding Achievement Standard, and the second session showed assessors how to comparatively judge scripts online. The judges were allocated 119 pairwise comparisons each; 15 judges completed all 119 comparisons and the other two completed 20 and 51 each. This resulted in 1856 judgements in total with each script receiving between 13 and 19 judgements each with a mode of 15 judgements.

Analysis and Results

Comparative Judgement Scores

Fitting the decision data to the Bradley Terry model produced comparative judgement scores with mean 0.0 and standard deviation 2.4, and the distribution is shown in Fig. 4. The boundaries in Fig. 4 are between two scripts whereas in Fig. 1 the boundaries were scripts within the rank order. This difference across the two studies arose because boundary scripts were not available for inclusion in the judging pot for English and consequently the grade boundaries were derived quite differently: six scripts were selected using teacher knowledge of the students grounded in classroom formative assessment activities. Two of these scripts were expected to be on the boundary of Not Achieved and Achieved, another two on the boundary of Achieved and Merit and a further two on the boundary of Merit and Excellence. In every case these were confirmed as boundary scripts, with one above the boundary and one below. The six scripts were also graded by the Head of Department and the National Moderator, and their position noted in the comparative judgement ranking. A further two scripts either side of these two in the ranking were graded in the same way. Thus at each boundary six scripts were graded by the traditional method and the boundary within the comparative judgement ranking established. For each boundary this proved to be a reliable process: the scripts either side of the initial two scripts were graded as weaker and stronger than the first two.

Fig. 4
figure 4

Comparative judgement scores for English scripts

Reliability

The internal consistency of the comparative judgement scores was strong, SSR = 0.85 and the split-halves inter-rater reliability was moderate, r = 0.72. The reliability of comparative judgement scores is related to the number of judgements (Verhavert et al. 2019), and the lower values here compared to Study 1 (SSR = 0.90, r = 0.81) would be expected because there were fewer judgements per script. There were no misfitting judges for the case of the English assessment. Taken together these analyses provide some support for the reliability of the assessment outcomes.

Validity

Unlike Study 1, GPAs were not available for the participating students in Study 2. Instead, two experienced teachers from schools not involved in the comparative judgement study independently assigned grades (Not Achieved, Achieved, Merit, Excellence) to a random sample of 50 scripts using traditional marking processes. The agreement between the examiners’ grades was acceptable, κ = 0.76.

To evaluate criterion validity we investigated whether the comparative judgement scores increased from scripts graded at Not Achieved through to those graded at Excellence. This was confirmed as shown in Fig. 5. A one-way ANOVA was conducted to compare the comparative judgement scores of scripts in each of the four grades for the case of the first examiner. There was a significant difference of scores at each grade, F(3, 46) = 21.34, p < 0.001. Using a Bonferroni-adjusted alpha level of 0.017 there was a significant difference between scripts graded as Not Achieved (M =  − 2.14, SD = 1.20) and Achieved, (M = 0.23, SD = 1.93), t(27.16) =  − 4.19, p < 0.001. There was no significant difference between scripts graded as Achieved and Merit (M = 0.60, SD = 1.59), t(15.94) =  − 0.62, p = 0.545. There was a significant difference between scripts graded as Merit and Excellence (M = 3.02, SD = 1.46), t(16.99) =  − 3.45, p = 0.003. These results were replicated for the grades of the second examiner, although the difference between scripts graded as Merit and Excellence was marginal with a Bonferroni-adjusted alpha level of 0.017, t(17.30) =  − 2.56, p = 0.020.

Fig. 5
figure 5

Comparative judgement scores and two sets of independent examiner grades for the English scripts

Survey

A survey was conducted to further investigate the validity, from the perspective of the judges, of the assessment process across both studies. The survey was designed to collect quantitative data on judges’ demographics and experience of conducting comparative judgement, including how they made decisions. The survey also collected qualitative data through two open-text questions that asked for any other feedback about the marking experience using comparative judgement. All judges involved in the project were invited to complete the survey and 26 or 68% (14 for statistics, 12 for English) did so. The quantitative questions and responses are shown in Table 1, and the open-text questions and responses are shown in Appendix 3.

Table 1 Summary of judges’ responses to the demographic and quantitative survey questions

Table 1 shows that the demographic and quantitative questions received broadly similar patterns of responses from judges in both studies. Overall the respondents reported that mechanics of the process were relatively easy (Q1 and Q2), and that making judgement decisions was moderately difficult (Q3 to Q5).

An interesting difference between the studies is that the statistics judges reported being less likely to use intuition (Q6) than the English judges, Mann–Whitney U = 137.0, p = 0.005. Conversely, the statistics judges reported being more likely to match responses to the standard (Q7) than the English judges, Mann–Whitney U = 30.0, p = 0.004. This difference might reflect our expectations for each subject; assessing English involves more subjective judgement whereas assessing statistics is a more objective activity of judging correctness. Indeed, evidence of mathematics teachers and examiners mentally ‘converting’ scripts to grades and then conducting a comparative judgement between those grades has been reported previously (Jones et al. 2014). Nevertheless, judges in both studies commonly converted scripts to grades in their heads and then compared these estimated grades, as reflected in Q7 as well as in eight of the open-text responses. For example, one English judge wrote “comparing two pieces of work that both were at, for example, Merit seemed pointless as we do not differentiate between high and low Merit”.

Converting to grades and then comparing those grades is not in the intended spirit of comparative judgment because mental grading involves comparing each script against an absolute standard rather than directly comparing one script against another. It is the authors’ experience that entering the spirit of comparative judgement is a difficult transition for some teachers who are used to traditional marking methods. One statistics judge wrote “sometimes found it hard to break from tradition and I wanted to grade [the scripts]”. The mental application of grades to scripts prior to comparison left some judges of the view that comparative judgement is an inappropriate or ineffective method of assessment, a sentiment that was explicit or implicit in five of the open-text responses. This may have impacted on the reliability and validity of the outcomes.

A related concern, evident in four of the open-text responses, was that relative judgements cannot be used to grade scripts. As one English judge wrote, comparative judgement consists of “ranking texts that should be assessed against the standard; so ranking could be completely irrelevant (e.g. the highest may still not actually achieve)”. This reflects a common intuition that comparative judgement exclusively produces norm rather than criterion-referenced outcomes. However this is not the case: scores can be used to rank or to grade whether they result from comparative judgement or traditional marking, as we have demonstrated using different techniques in both the studies reported here.

Eight of the open-text responses commented on the time required to do judging. One respondent stated that it was a “time saver”, another that they were “fixated” about how long each judgement was taking, and the remaining six that it was time consuming, presumably in comparison to traditional marking. This is interesting and perhaps surprising in the light of claims that comparative judgement is more efficient in terms of assessor time (Steedle and Ferrara 2016). We can roughly estimate total judging time because the nomoremarking website records the amount of time taken for each judgement. These time data are an overestimate because the timer continues when judges take a break, and we know from Q8 that the judges did not always log off when taking breaks. However we can take the mean of the judges’ median judging time as an estimate and multiply this by 10, which is the typical number of judgements per script required for internally consistent outcomesFootnote 2 (Jones and Alcock 2014; Jones and Inglis 2015). We then halve this value because every pairwise judgement involves two scripts. This rough calculation suggests that for both statistics and English each script takes about 7 min to assess on average. Anecdotally, statistics teachers inform us that this is approximately half the time taken to grade by a traditional marking method. For the case of English, the two examiners who marked a random sample scripts each reported this taking around 6 h to complete 50 scripts, thereby matching the estimate of 7 min per script for comparative judgement. In addition, comparative judgement produces inherently moderated outcomes, because every script has been seen by several assessors, whereas the estimate for traditional marking does not include time for moderation.

Discussion

In the studies reported here we explored the application of comparative judgement to the assessment of statistics and English in New Zealand. Across both applications we reported that the comparative judgement procedure used here produced reliable outcomes; i.e., outcomes were largely independent of the particular individuals who conducted the comparative judging online. We also found evidence supporting the validity of the approach in each discipline, by comparing comparative judgement outcomes with achievement data and independently marked scripts. Further evidence for validity was provided by responses to a survey from 68% of the judges across the two studies. This evidence also revealed differences between the two subjects, with statistics judges reporting greater matching of scripts to Achievement Standards, and English judges reporting greater use of intuition. Finally, we considered the efficiency of the comparative judgement process relative to traditional marking methods, and concluded that comparative judgement produces moderated outcomes in the same or less time than marking produces unmoderated outcomes.

Therefore we conclude that the comparative judgement technique as applied here efficiently produced moderated assessment outcomes that were reliable and valid. This conclusion applied to the context of particular secondary students’ responses to standard statistics and English assessment tasks, and contributes this context to the growing literature that evaluates the application of comparative judgement to a varied ranged educational assessment activities. In the remainder of the discussion we consider limitations of the study and issues that might arise were a comparative judgement-based approach to assessment to be applied more generally.

The key limitation of the study design is common to assessment research: it is relatively straightforward to evaluate the reliability of an assessment, at least in the sense of the extent to which the final scores were independent of the particular assessors who produced them as was the focus here. However, evaluating validity is a less straightforward matter and requires the triangulation of quantitative and qualitative data to construct a validation argument that accounts for the intended purposes of the scores (Kane 2013; Newton and Shaw 2014). We attempted this here with the data available to us including prior achievement data, consideration of grade boundaries, independent assessment of responses and a survey of assessors. However, validation can never be fully demonstrated or completed. Future research might harness further validation methods including the use of independent measures of the assessed construct (e.g. Bisson et al. 2016), standardised measures of potential covariates (e.g. Jones et al. 2019), content analysis (Hunter and Jones 2018), comparison of expert and non-expert judge outcomes (Jones and Alcock 2014), and other methods.

There are also limitations to the study design and potential applications of comparative judgement that are particular to the standard assessments considered in this paper. For example, each student response was assigned one of four grades, whereas comparative judgement results in a unique score for each script. As such comparative judgement scores need then be translated into grades, as was done here through the use of boundary scripts (statistics) or boundary setting (English). A consequence translating scores into grades is that some pairwise judgements contribute little to the final outcome; as we reported above one judge commented that this is little point choosing one Merit script as over another Merit script. Comparative judgement might therefore be too precise a measure of achievement for certain NCEA level 1 and 2 tasks; or to put it another way comparative judgement might be too reliable for contexts where grades are the only required outcome. However, reducing the reliability through, say, using fewer pairwise judgements to construct scores might not be viable because it would risk blurring the boundaries and thereby increase the number of students who receive the ‘wrong’ grade. To investigate this we conducted a post-hoc analysis on the statistics data using only one third of the available judgements, resulting in a set of comparative judgement scores with lowered reliability (SSR = 0.74 rather than the SSR = 0.90 reported in Study 1). We then assigned one of four grades (Not Achieved, Achieved, Merit, Excellence) to each script with reference to the boundary grades for both the original and the lowered-reliability set of scores. We found that 44 (39%) of students would be assigned a different grade if the lowered-reliability scores were used in place of the original Study 1 scores. Therefore while the precision of comparative judgement might be wasted on discriminating scripts that are well within grade boundaries, it is required to ensure adequate discrimination near grade boundaries. Notwithstanding comparative judgement was as or more efficient than traditional marking in both studies, there seems nevertheless to be an inherent inefficiency when the method is used in contexts where scores are translated into grades.

A second possible weakness of the results reported here is the use of a small number of boundary scripts to set grade cut offs.Footnote 3 Boundary scripts have been used in other comparative judgement-based studies in mathematics (e.g. Jones and Sirl 2017) and in English (e.g. Heldsinger and Humphry 2010) but here the grade boundaries reflect quite specific and detailed NCEA standards, and we have assumed that judges are making their comparisons on a dimension that is appropriate to the standard. Our interview data provide some support that judges are mindful of the standards when judging, but more systematic methods to boundary setting could be employed. For example, Heldsinger and Humphry (2010) reported a systematic comparative judgement-based method for generating boundary scripts for standard setting in primary English in Australia. Such a method might be adapted to present NCEA practice by using a standard-setting procedure in which a sample of scripts are read and holistically compared with the standard for a grade.

Alternatively, comparative judgement methods might be better suited to contexts in which the task and assessment criterion are designed with pairwise judgement by subject experts in mind from the start (Jones and Inglis 2015). That is, educational assessment tasks should be designed with the method of converting student responses into scores, whether by multiple-choice, rubrics and marking, or indeed comparative judgement, considered from the outset. For the case of comparative judgement this might lead to more open and less predictable assessment tasks than those currently offered. Given the consensus that a rich diet of assessment is necessary to maximise validity and equity in educational outcomes (e.g. Wiliam 2010), comparative judgement might have an important part to play in the national education systems of the future.