Assessment by Comparative Judgement: An Application to Secondary Statistics and English in New Zealand

Marshall, Neil; Shaw, Kirsten; Hunter, Jodie; Jones, Ian

doi:10.1007/s40841-020-00163-3

Assessment by Comparative Judgement: An Application to Secondary Statistics and English in New Zealand

Article
Open access
Published: 08 April 2020

Volume 55, pages 49–71, (2020)
Cite this article

Download PDF

You have full access to this open access article

New Zealand Journal of Educational Studies Aims and scope Submit manuscript

Assessment by Comparative Judgement: An Application to Secondary Statistics and English in New Zealand

Download PDF

Neil Marshall¹,
Kirsten Shaw¹,
Jodie Hunter² &
…
Ian Jones³

5838 Accesses
11 Citations
3 Altmetric
Explore all metrics

A Correction to this article was published on 07 May 2020

This article has been updated

Abstract

There is growing interest in using comparative judgement to assess student work as an alternative to traditional marking. Comparative judgement requires no rubrics and is instead grounded in experts making pairwise judgements about the relative ‘quality’ of students’ work according to a high level criterion. The resulting decision data are fitted to a statistical model to produce a score for each student. Cited benefits of comparative judgement over traditional methods include increased reliability, validity and efficiency of assessment processes. We investigated whether such claims apply to summative statistics and English assessments in New Zealand. Experts comparatively judged students’ responses to two national assessment tasks, and the reliability and validity of the outcomes were explored using standard techniques. We present evidence that the comparative judgement process efficiently produced reliable and valid assessment outcomes. We consider the limitations of the study, and make suggestions for further research and potential applications.

A systematized review of research with adaptive comparative judgment (ACJ) in higher education

Article 01 January 2021

Adaptive Comparative Judgement

The problem of assessing problem solving: can comparative judgement help?

Article 29 May 2015

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

When a student’s assessment is assigned a grade we want that grade to be entirely dependent on the quality of the student’s work and not at all dependent on the biases and idiosyncrasies of whoever happened to assess it. Where we succeed we can say that the assessment is reliable, and where we fail we can say the assessment is unreliable (Berkowitz et al. 2000). Reliability in high-stakes educational assessment is particularly important where students’ grades affect future educational and employment prospects.

One way to ensure reliability is to use so-called objective tests in which answers are unambiguously right or wrong (Meadows and Billington 2005). Examples are an arithmetic test in mathematics, a spelling test in languages, or a multiple-choice test in any subject. Objective tests also have the advantage of being quick and inexpensive to score and grade because the task can be automated (Sangwin 2013) and in practice often is (Alomran and Chia 2018). For these reasons—reliability and efficiency—objective tests are common in education systems around the world (Black et al. 2012).

However, objective tests are far from universal because they risk delivering reliability at the cost of validity (Assessment Research Group 2009; Wiliam 2001). Educationalists have argued that there is more to doing and understanding mathematics or languages than answering a series of closed questions. We typically want evidence that students can make connections between learned ideas, apply their understanding to novel contexts, construct arguments, demonstrate chains of reasoning, and so on (Newton and Shaw 2014; Wiliam 2010). Such achievements are not readily assessed using objective tests but instead require students to undertake a sustained performance that reflects the discourse and cultural norms of the subject (Jones et al. 2014). In mathematics this might be a student report in which mathematical ideas are creatively applied to a problem situation; in languages this might be a student report in which literary texts are critiqued. Scholars have argued that such assessments are more valid and fit for purpose than objective tests (e.g. Baird et al. 2017). However, assessing performance-based assessments is subjective and therefore tends to be less reliable than objective tests, in part because outcomes reflect to some extent the biases and idiosyncrasies of individual assessors (Murphy 1982; Newton 1996).

As such there exists a tension between ensuring an assessment is reliable and ensuring that it is valid (Wiliam 2001; Pollitt 2012). To negotiate this tension education systems commonly make use of three techniques. First, many tests are populated with short questions, typically worth several marks, which might be seen as a compromise between objective-style and performance style questions (Hipkins et al. 2016). Second, many exams and assessments are designed with a rubric or marking scheme that assessors use as the basis for awarding marks. Rubrics typically attempt to anticipate the full range of student answers, although in practice professional judgement is often required when matching a script to a rubric (Suto and Nadas 2009). The third technique is for a sample of marked tests to be moderated, or re-marked, by an independent assessor.

However these techniques produce imperfect outcomes. Short questions have been criticised for being more similar to objective questions and contributing to a tick-box approach to assessment (Assessment Research Group 2009; Black et al. 2012). Where questions are more substantive, rubrics become so open to interpretation that reliable assessment outcomes are not achieved (Meadows and Billington 2005; Murphy 1982; Newton 1996). Moderation exercises can only be applied to a sample of student scripts or the marking workload is doubled, and so commonly a student receives a grade without their particular script having been moderated.

In this paper we provide evidence in support of the reliability and validity of an alternative approach to producing student assessment outcomes, called comparative judgement (Pollitt 2012). We describe the origins and mechanics of comparative judgement and argue that there are educational assessment contexts in which it can deliver acceptable reliability and validity. We then present two studies in which we evaluated the application of comparative judgement to assess standard secondary school statistics and English assignments in New Zealand.

Comparative Judgement

Comparative judgement is a long-established research method that originates in the academic discipline of psychophysics. Work by the American psychologist Thurstone (1927) established that human beings are consistent with one another when asked to compare one object with another, but are inconsistent when asked to judge the absolute location of an object on a scale. For example, Thurstone found that participants were consistent when asked which of two weights is heavier and inconsistent when asked to judge the weight of a single object. Thurstone (1954) later used comparative judgement to construct scales of difficult-to-specify constructs such as beauty and social attitudes.

The recent two decades has seen the growth of research and practice in applying comparative judgement to educational assessment. The internet has enabled larger scale of usage than is possible in traditional laboratory settings because student work can be digitised and presented to examiners remotely and efficiently (Pollitt 2012). A key motivation for considering comparative judgement rather than traditional assessment methods are claims that comparative judgement is more reliable than marking for the case of open-ended assessments (Jones and Alcock 2014; Steedle and Ferrara 2016; Verhavert et al. 2019).

Once the judging is complete the examiners’ binary decisions are fitted to a statistical model to produce a unique score for each script (Bramley 2007). It is not necessary to compare every script with every other script; the literature suggests that to assess n scripts then a total 10n to 37n judgements are required (Jones and Alcock 2014; Verhavert et al. 2019), depending on the nature of the tests and the expertise of the assessors, to produce a reliable scale. The reliability and validity of outcomes can be estimated using the techniques described in the methods section. Good reliability and validity for assessment outcomes have been reported in a variety of contexts including mathematical problem solving (Jones et al. 2014; Jones and Inglis 2015) and essays (Heldsinger and Humphry 2010; Steedle and Ferrara 2016; van Daal et al. 2019).

There are two key reasons that comparative judgement outcomes can be more reliable than traditional marking when assessing student responses to open assessment tasks. First is people’s capacity to compare two objects against one another more consistently than they can rate an object in isolation, as discussed above. Second is that assessor bias has no scope for expression when making binary decisions rather than assigning marks (Pollitt 2012). For example, a harsh assessor is likely to assign a lower mark than a lenient assessor. However when comparing two pieces of work both assessors must simply decide which piece of work is the better, and so they produce the same outcome.

The Research

We contribute to literature on using comparative judgement for the case of secondary assessments in statistics (Study 1) and English (Study 2) in New Zealand. Our focus was on whether the good reliabilities reported in the literature could be replicated for the case of secondary assessments in New Zealand. We also explored validity, and were able to do so in depth for the case of Study 1 where relevant data were available. Finally, we sought to understand the judges’ experience of the comparative judgement process using a survey.

Study 1: Statistics

Materials

The assessment task used was for NCEA Level 2 AS91264 ‘Use Statistical methods to make an inference’ and is shown in Appendix 1. The task requires students to take a sample from a population and make an inference that compares the position of the median of two population groups.

Participants

The participants were 134 students in Year 12 (ages 16 and 17) from a single school in New Zealand. The school was decile 8 had a roll of about 1400 students with 60% of students of NZ European ethnicity, 14% of students with Maori ethnicity and 5% of students of Pasifika ethnicity. The school was recruited by contacting the Head of Department of mathematics through a contact of one of the authors and individual teachers were invited to participate on a voluntary basis. The study followed NZQA ethical practices for conducting research and guidelines for assessment conditions of entry.

Method

The students completed the assessment task as a summative assessment of a unit of work that was taught over a period of 6–8 weeks. Students completed the assessment electronically in Google Drive over four days. They could work on the assessment task in and out of normal lesson time. 21 scripts were incomplete and were excluded from further analysis, leaving a total 113 participant scripts. The anonymised student scripts were uploaded to the online comparative judgement engine nomoremarking.com for assessment. In addition six boundary scripts, produced by NZQA to exemplify levels of achievement (High Not Achieved, Low Achieved, High Achieved, Low Merit, High Merit, Low Excellence), were also uploaded. Therefore the total number of participant and boundary scripts to be judged was 119.

The scripts were judged by 21 judges: eight judges were teachers from the same school as the participants; eight were pre-service teachers undertaking a practicum at a range of New Zealand schools; five were NZQA employees who work as National Assessment Moderators, and one of whom had previously worked as a subject advisor. The project included two initial sessions of professional development for the judges prior to the assessment tasks being undertaken. The first session, led by the NZQA moderator, supported assessors to become fully familiar with the assessment activity and the corresponding achievement standard. The second session supported assessors to become fluent in using the comparative judgement online process. The judges then undertook the assessment procedure using this online comparative judgement process (nomoremarking.com). The judges were allocated 99 pairwise comparisons each. Eighteen judges completed all 99 comparisons and the other three completed 16, 50 and 63 each. This resulted in 1911 judgements in total with each script receiving between 31 and 36 judgements each with a mode of 32 judgements.

We also collected participants’ Grade Point Averages (GPAs) for mathematics and English from their Year 11 results, and these data were available for 99 of the participants.^{Footnote 1} It was not possible to calculate a GPA for the other 14 students because they arrived at the school from overseas, either as the result of family immigration or as exchange students.

Analysis and Results

Comparative Judgement Scores

The judgement decisions were fitted to the Bradley-Terry model to produce a unique score for each student script (Pollitt 2012). The scores had a mean of 0.0 and a standard deviation of 1.7. The distribution of scores is shown in Fig. 1.

Reliability

Reliability was explored using three techniques (Jones and Alcock 2014). First, we calculated the Scale Separation Reliability (SSR), which gives an overall sense of the internal consistency of the outcomes considered analogous to Cronbach’s alpha (Pollitt 2012). A high value, SSR ≥ 0.7, suggests good internal consistency and this was the case here, SSR = 0.90.

Second, we estimated inter-rater reliability using the split-halves technique as described in Bisson et al. (2016). The judges were randomised into two groups and each group’s decisions refitted to the Bradley Terry model to produce two sets of scores for the scripts. The Pearson product–moment correlation coefficient is calculated between the two sets of scores, and this process is repeated 100 times and the median correlation coefficient taken as an estimate of inter-rater reliability. The inter-rater reliability in this case was strong, r = 0.81.

Third, we calculated an infit statistic for every judge and compared these statistics to a threshold value (two standard deviations above the mean of the infit statistics) to check for any ‘misfitting’ judges (Pollitt 2012). An infit statistic above the threshold value suggests that judge’s decisions were not consistent with the other judges’ decisions. There were no misfitting judges bar one who was marginally above the threshold value. We recalculated comparative judgement scores with the misfitting judge removed and found the correlation with the original scores (with the misfitting judge included) to be perfect, r = 1. Therefore the misfitting judge had no effect on final outcomes and was included in the remainder of the analysis.

Taken together, these analyses suggest that the scores were reliable; that is we would expect the same distribution of scores had an independent sample of experts conducted the judging.

Validity

The validity of assessment outcomes is more difficult to establish than reliability (Wiliam 2001). This difficulty is in part because whereas reliability is an inherent property of assessment outcomes, validity tends to require comparison with external data such as independent achievement scores and expert judgement (Newton and Shaw 2014) and needs to take into account the intended purpose and interpretation of test scores (Kane 2013). In terms of the purposes of test scores, a limitation of much assessment research is that tests are often administered in contexts that are low- or zero-stakes from the perspective of the student participants. This threat to ecological validity was overcome in the present study which made use of responses gathered as part of students’ usual summative assessment activities.

For Study 1, we investigated validity using three techniques as described here. First, we explored criterion validity (Newton and Shaw 2014) by scrutinising the positioning of the boundary scripts in the final rank order. The boundary scripts represented grades at six levels (High Not Achieved, Low Achieved, High Achieved, Low Merit, High Merit, Low Excellence) and were determined using traditional marking methods. A valid comparative judgement assessment would be expected to position the boundary scripts in the correct order, from High Not Achieved to Low Excellence, and this was the case as shown in Fig. 1.

Second, we investigated convergent and divergent validity (Newton and Shaw 2014) using Mathematics and English GPAs that were available for 99 of the 113 students who completed the statistics test from their Year 11 results. The validity of the comparative judgement scores would be supported by evidence of convergence with mathematics and divergence with English GPAs. To investigate this we regressed mathematics and English GPAs onto comparative judgement scores. The regression explained 26.9% of the variance in comparative judgement scores, F(2,96) = 17.67, p < 0.001, η² = 0.269. Mathematics GPA was a significant predictor of comparative judgement scores, b = 0.052, p = 0.002, and English GPA was also a significant predictor, b = 0.044, p = 0.001. This was not expected because previous research conducted in the UK has consistently found that mathematics, but not English, attainment is a significant predictor mathematics comparative judgement scores (Jones et al. 2013; Jones and Wheadon 2015; Jones and Karadeniz 2016). To understand the relationship between the comparative judgement scores and GPAs further we investigated the correlations. The comparative judgement scores correlated moderately with mathematics GPAs, ρ = 0.450, and moderately with English GPAs, ρ = 0.468, as shown in Fig. 2, and these correlations were not significantly different (Steiger, 1980), z = − 0.19, p = 0.851. Interestingly, the correlation between mathematics and English GPAs was similar, ρ = 0.390. Therefore, the correlations between the comparative judgement outcomes and mathematics and English GPA are in line with what we might generally expect from different assessments.

Third, to further explore criterion validity, teachers at the school (who were also part of the comparative judgement process) awarded grades (Not Achieved, Achieved, Merit, Excellence) to the scripts using the traditional marking and internal moderation process that was familiar to them. Following usual protocol, each teacher awarded a grade to the students in his/her class and the name of the student was visible to the teacher. A small sample of scripts from each class were check marked by the Head of Department for internal moderation processes. The teachers’ grades were further moderated by National Assessment Moderator employed by NZQA. The moderator reviewed each script and confirmed, or not, the teacher judgement. Where there was disagreement the moderator judgement took precedence. Teacher grades were available for 94 of the participants.

A valid comparative judgement assessment would be expected to produce increasing scores for those scripts graded at Not Achieved through to those graded at Excellence. Indeed the correlation between comparative judgement scores and grades was moderate, ρ = 0.740, as shown in Fig. 3. A one-way ANOVA was conducted to compare the comparative judgement scores of scripts in each of the four grades. There was a significant difference of scores at each grade, F(3, 96) = 44.81, p < 0.001. Using a Bonferroni-adjusted alpha level of 0.017 there was a significant difference between scripts graded by teachers as Not Achieved (M = − 1.98, SD = 1.17) and Achieved, (M = − 0.14, SD = 0.86), t(32.49) = − 5.69, p < 0.001. There was a marginal difference between scripts graded as Achieved and Merit (M = 0.64, SD = 1.12), t(47.57) = − 2.44, p = 0.019. There was a significant difference between scripts graded as Merit and Excellence (M = 1.37, SD = 0.92), t(50.55) = − 3.07, p = 0.003.

Taken together these three analyses support the validity of using comparative judgement to assess the statistics task.