Centre assessment grades in 2020: a natural experiment for investigating bias in teacher judgements

Magowan, Louis

doi:10.1007/s42001-023-00206-x

Centre assessment grades in 2020: a natural experiment for investigating bias in teacher judgements

Research Article
Open access
Published: 15 May 2023

Volume 6, pages 609–653, (2023)
Cite this article

Download PDF

You have full access to this open access article

Journal of Computational Social Science Aims and scope Submit manuscript

Centre assessment grades in 2020: a natural experiment for investigating bias in teacher judgements

Download PDF

Louis Magowan ORCID: orcid.org/0000-0001-7105-1057¹

1662 Accesses
3 Altmetric
Explore all metrics

Abstract

The COVID-19 pandemic meant that, in 2020, students in England were unable to sit their examinations and instead received predicted grades, or “centre assessment grades” (CAGs), from their teachers to allow them to progress. Using the Grading and Admissions Data for England (GRADE) dataset for students from 2018 to 2020, this study treats the use of CAGs as a natural experiment for causally understanding how teacher judgements of academic ability may be biased according to the demographic and socio-economic characteristics of their students. A variety of machine learning models were trained on the 2018–19 data and then used to generate predictions for what the 2020 students were likely to have received had their examinations taken place as usual. The differences between these predictions and the CAGs that students received were calculated and then averaged across students’ different characteristics, revealing what the treatment effects of the use of CAGs were likely to have been for different types of students. No evidence of absolute negative bias against students of any demographic or socio-economic characteristic was found, with all groups of students having received higher CAGs than the grades they were likely to have received had they sat their examinations. Some evidence for relative bias was found, with consistent, but insubstantial differences being observed in the treatment effects of certain groups. However, when higher-order interactions of student characteristics were considered, these differences became more substantial. Intersectional perspectives which emphasise interactions and sub-group differences should be used more widely within quantitative educational equalities research.

Learning Analytics and Fairness: Do Existing Algorithms Serve Everyone Equally?

Are Assessment Practices Well Aligned Over Time? A Big Data Exploration

Impact of Predictive Learning Analytics on Course Awarding Gap of Disadvantaged Students in STEM

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

This study will look at education in England in the 2020 academic year, and how it was disrupted by the COVID-19 pandemic, as a lens through which to examine how educational inequalities may result from the use of teacher judgements in the assessment of academic ability. For context, on the 20th of March 2020, the Secretary of State for Education decided to close all schools and colleges in England to try and slow the spread of COVID-19 [11]. Furthermore, it was announced that summer examinations for that year would be cancelled and that General Certificate of Secondary Education (GCSE), Advanced Subsidiary Level and Advanced Level (AS and A-Level) students (these are all UK secondary school-leaver examinations) would instead receive calculated grades to allow them to progress into the labour market and higher education [21]. Following this decision, teachers were instructed to produce centre assessment grades (CAGs) for their students to represent what they think the students would have achieved had schools remained open and exams gone ahead [35].

It is important that this process was as fair as possible, as substantial educational inequalities already exist in the UK. In terms of free school meals (FSM), which are a proxy measure for socio-economic status (SES), the results for the 2019 GCSEs showed that only 22.5% of students who were eligible for FSM received grade 5 or above in English and Maths, compared with 46.6% of students who were not eligible [10]. In other words, lower SES students tend to perform worse academically. Similarly, clear ethnic divisions can be seen in the results with 37.8% of Black, 42.4% of White and 76.3% of Chinese students achieving those grades. Educational inequalities can also be found in terms of gender, whether English is an additional language (EAL) for a student, and whether the student has special educational needs (SEN) [10, 12, 24]. In the interests of brevity, these characteristics (SES, ethnicity, gender, EAL and SEN) will be referred to as “protected characteristics”^{Footnote 1}. This study will focus primarily on SES and ethnicity, as there is strong evidence that they are some of the most important contributing factors to educational inequality. For example, Strand [44] finds the impact of ethnicity and SES to be three and nine times larger, respectively, than the impact of gender on mean attainment of 14-year-olds in the UK.

There are both intrinsic and extrinsic reasons why such inequalities should be reduced [3]. Intrinsically, one might deem an extremely large gap between those with the highest educational attainment and those with the lowest to be undesirable – particularly if that gap is delineated along the lines of a characteristic such as ethnicity or SES. Extrinsically, there are many consequences of educational inequality that can make its reduction worthwhile. Educational inequalities in younger years can propagate with age and certain poor-performing students may not have access to the same range of subjects (e.g., Higher tier GCSEs in the UK) as their better-performing counterparts. Poor-performing students may also find they are unable to progress as far as they would like with their education, such as into university/higher education, or, in the UK, to their A-Levels. This can have material effects on their social mobility, labour market participation and even lifetime earnings [50]. Indeed, it has been shown that, across a range of countries, making education distributions more equal plays a significant role in making income distributions more equal [19]. Educational equality begets income equality and other external benefits.

Given the substantial inequalities already alluded to, the study of the CAG process could be regarded as worthwhile in its own right – as it is important the process was as equitable as possible. However, drawing on a case study typology [47], the CAG process can be thought of as a subject that helps to explicate the object of the use of teacher judgements in the assessment of academic ability. Appreciating the CAG process as a case in this way gives the study relevance beyond the summer 2020 exams that were cancelled in England. Furthermore, given how frequently teacher judgements are used to assess academic ability, it is important to understand how they may or may not be biased according to student’s protected characteristics. For example, in the UK teacher judgements are used as the basis of the predicted grades that A-Level students rely on in their applications to universities [49]. They also inform various Key Stage assessments, including being a component in the Key Stage 2 assessments that determine a pupil’s transition from primary to secondary school [43]. Teacher judgements also play a role in determining academic progression in many educational settings outside the UK [31, 48].

The CAG process has created a unique opportunity for investigating how teacher judgements may be biased. Indeed, the fact that they were awarded to all English students in 2020 has resulted in the largest dataset on teacher grading judgements that are available in the UK [45]. Moreover, it has created a natural experiment. Natural experiments, to use causal inference parlance, are observational studies in which some naturally occurring phenomena allows us to regard the assignment mechanism of some treatment to units as “as if” or virtually random [14]. In this instance, 2018, 2019 and 2020 English GCSE students can be regarded as essentially homogenous (see descriptive statistics section), except that the 2020 students received an exogenous treatment – examinations being cancelled and replaced with CAGs.

This study aims to exploit this natural experiment to assess causally how the use of teacher judgements in CAGs impacted students of various protected characteristics. A range of models will be trained and evaluated on 2018–19 data (and will be discussed in greater detail in the methodological section). The model with the highest predictive accuracy will then be used to generate predictions of GCSE examination grades for students in 2020. These predictions will then be compared with the CAGs that students of different protected characteristics received (e.g., a Chinese student; a low SES student; a low SES, Chinese student etc.), thereby throwing any causal impacts of the use of CAGs/teacher judgements into relief.

Literature review

Psychology of bias: stereotyping

Before considering the potential evidence for bias in teacher judgements, it is helpful to give a theoretical justification for it. In general terms, social bias can be classified into one of three forms [13]:

1.
Prejudice: Individual-level attitudes which create or maintain hierarchical status relations between groups (can be subjectively positive or negative).
2.
Discrimination: Behaviour that creates or reinforces an advantage for a group/group-member over another group/group-member.
3.
Stereotyping: Beliefs about the characteristics and attributes of a group and its members that shape how people think about and respond to the group.

It is hoped, at least in a UK context where educational equality commissions and standards are well-established, that any bias that may arise in teacher judgements is primarily due to implicit, unconscious stereotyping rather than explicit discrimination or prejudice. That stereotyping is the main component of bias in teacher judgements would be hard to verify, however, Campbell [5] does find evidence that stereotypes according to income-level, gender, SEN, and ethnicity all play a part in forming biases in teacher judgements. Using data from the Millennium Cohort Study, Campbell demonstrates that certain categories of student were less likely to be judged “above average” by their teachers in terms of reading and maths ability when compared to students of other categories—despite having scored similarly in reading and maths tests. Even if stereotyping is not the main component of bias, it clearly plays a significant role.

There are several schools of thought on the psychological processes behind how stereotypes are formed and maintained. Some stereotypes stem from accurate, real group differences—accurate, at least, in terms of the local reality of the person who perceives them [23]. However, much psychological literature emphasises pathways in which stereotypes can be formed independently of any real, group differences. A widely cited example of such a pathway is that of the “self-fulfilling prophecy” [39]. This is the idea that the expectations teachers hold for their students can cause the students to alter their behaviour such that they end up aligning with their teachers’ expectations. Initially, there may have been no real, group differences in the academic performances of the students – but the teachers’ expectations manifest one, thereby maintaining the stereotype. Other, more recent literature on stereotypes highlights their interactive nature [26]. Stereotypes and other individuating factors (such as behaviour or personality) are not processed serially. Instead, each piece of information is combined by the mind in a simultaneous, rather than additive, fashion. In this way, stereotypes can jointly influence each other, interacting to produce a distinct impression about someone. Given that stereotypes are likely an important contributor to teacher bias and have themselves been shown to be influenced by protected characteristics, any study of teacher bias should therefore pay attention to interactions between protected characteristics.

Examples of bias in teacher judgments

A considerable amount of research on bias in teacher judgements has been conducted both in the UK and internationally. In a sample of 53 Flemish primary schools, Boone and Van Houtte [4] found that, regardless of prior achievement, pupils of lower socio-economic backgrounds were less likely to be advised by their teachers to enroll in academically oriented school tracks than their counterparts from higher socio-economic backgrounds. Similar results have been found within the Dutch context. In a study of 500 classes [48] it was found that teachers held higher academic expectations for students from more affluent families, even after controlling for the students’ performance. Higher expectations were also observed for girls in this study. Some evidence of SES impacts on teacher judgements has also been found in the UK. Murphy and Wyness’ [33] study of A-Level predicted grades found small but significant differences in the predicted grades received by high-achieving students, depending on their school type and SES. Among high-achieving students, state school students received 0.16 fewer predicted grade points than their privately educated counterparts and low SES students got 0.059 fewer predicted grade points than their higher SES counterparts. Based on these studies, gender, school type, and particularly SES would seem to have an impact on teacher judgements—although the SES effect may be working interactively with prior attainment.

SES and gender impacts on teacher judgements are not found in all literature on the topic, however. Jussim and Eccles’ [25] study of 100 teachers in the US found no evidence of teachers being biased against students from lower social class backgrounds, or against either gender. They also found no evidence of bias against African American students. However, other US-based investigations would seem to contradict this last result. Zucker and Prieto [52] asked 280 special education teachers to indicate whether placement into special education classes would be appropriate for a given set of children. They found evidence of a significant main effect for ethnicity- with special class placement being deemed more appropriate for Mexican American children than for white children. Shiner and Modood’s [42] investigations of UK A-Level predicted grades contradicts both two previous studies yet again – instead of finding a negative or no ethnic bias in teacher judgements, they found evidence of a positive one. They found that while teachers’ A-Level predictions generally tended towards optimism when wrong, this was particularly the case for ethnic minorities. On average, predicted scores were 2-grade points higher than the final, achieved scores for White candidates, compared with 5, 4 and 3 points higher for Black Caribbeans/Black Africans, Indians/Pakistanis/Bangladeshis, and Chinese candidates, respectively, in their sample. Indeed, Murphy and Wyness’ [33] study, dealing with a similar sample of UK students’ A-Level predictions, reveals a similar pattern – with Asian and Black students being more likely than other ethnicities to be severely^{Footnote 2} overpredicted.

Overall, the role that ethnicity plays in affecting teacher judgements seems to be unclear, though it may be a contributor to relative, positive bias for certain students in a UK context. It should also be noted that much of the existing UK research focusses solely on AS/A-Level students, as this was where teacher prediction data was most readily available previously. However, AS/A-Levels are not compulsory for all students like GCSEs are and so are not as a representative of the UK population. For example, there are SES differences between GCSE and AS/A-Level cohorts, with low SES students being significantly less likely to progress to AS/A-Level [41]. Studying GCSE teacher prediction data rather than AS/A-Level data could help ensure results are more generalisable to the UK population. Furthermore, given that educational inequalities can be seen even in early childhood and propagate with age [6], it could be worthwhile to consider students of a younger age range than AS/A-Level students – as GCSE students are.

Meta-analyses of bias in teacher judgements: contradictory findings

Given the large amount of research on the topic of bias in teacher judgements and the contradictory findings reported in a lot of them, it can be helpful to instead consider meta-analyses of the topic. Dusek and Joseph’s [15] meta-analysis of 77 studies found that both social class and race were significant bases in how teachers formed expectancies about their students’ academic ability and that gender was not. Middle SES students were expected to perform better academically than low SES students and White students were expected to perform better than Black or Mexican students. Tenenbaum and Ruck’s [46] review of 32 US studies also found differences in terms of race for the expectations that teachers held for their students. They found small, but statistically significant effects that suggested teachers held lower expectations for African American and Latino/a children than for European American children.

While Dusek and Joseph’s results on the importance of ethnicity in teacher judgements would seem to be corroborated by this second meta-analysis, their results around the impact of gender are contradicted by a third. A review of 30 studies, mainly from the US and the UK, [20] found that there was strong evidence of bias in teacher judgements in terms of both gender and SEN. Indeed, within many of the studies included in the three previous meta-analyses (and in the works reviewed earlier in this study) many of the magnitudes and even signs of coefficients for various protected characteristics with teacher judgements seem to disagree. Even the conclusions between the reviews/meta-analyses themselves are not consistent, as was noted in Ofqual’s [28] recent literature review on the topic. Something that is consistent between these literature reviews and the studies they discuss, however, is that few, if any, of them have had access to a dataset of teacher judgements that is as large or as representative as that provided by the CAG process. The analysis of such a dataset and the natural experiment context it is set in could help bring greater clarity to an area of research that is full of contradictions. Furthermore, much of the existing literature only considers a small number of protected characteristics at a time. However, the dataset that is available around the CAG process is extremely rich and has many features of a protected characteristic in it. This means that potential inequalities in teacher assessments can be explored across a larger number of features at the same time.

Prior research on centre assessment grades: an intersectional perspective

Some research into the use of CAGs has already been conducted. In an analysis by He and Black [22], exam results for the 2020 year were compared with those of the preceding year. GCSEs were on average three-fifths of a grade higher in 2020.^{Footnote 3} This suggests that the CAG predictions were optimistic overall. This is to be expected as previous research on the UK university application system (which relies heavily on teacher-predicted grades) has shown as much as 75% of applicants in 2013–15 received lower grades than they were predicted [Wyness, 51]. An interesting difference between these studies, however, is that while He and Black found the correlations between prior attainment and grades to be the same in 2019 as in 2020 – Wyness’ findings somewhat contradict this. Her study showed that high-achieving disadvantaged students were more likely to be under-predicted than their more advantaged counterparts. Additionally, low-achieving students (who were disproportionately low SES) were far more over-predicted. This could imply that there is an interaction between SES and prior attainment that isn’t being considered in He and Black’s analysis.

Interactions such as this are why an “intersectional” perspective on the CAG process could be helpful. Intersectionality is a concept derived from feminist theory that views categories of ethnicity, class, gender, etc. as interrelated and mutually shaping one another [8]. Though the concept has not been frequently applied within quantitative educational research, it can be highly appropriate if the underlying data is rich and granular [7], as the CAG data is. An intersectional approach emphasises how different types of (dis)advantage are not the same for everyone who experiences them and stresses the importance of interactions and sub-group differences, rather than just the main effects of e.g., protected characteristics. Bias in teacher judgements may operate in complex ways, which may not be noticed if viewed in purely additive terms.

That teacher judgements used for CAGs were in fact biased, cannot be assumed, however. In fact, two key Ofqual (UK examinations watchdog) investigations of the topic concluded that systematic bias was unlikely. The first investigation, a student-level equalities analysis, did not find evidence of bias against students in terms of their protected characteristics [27]. The second study looked more directly at the use of teacher judgements, trying to determine if the factors related to grades in 2020 were different from those related to grades in previous years in any consistent way [45]. Overall, grading patterns between 2020 and previous years were found to be similar – with only minor differences in the relationships between student and centre-level features with grades. While Stratton, Zanini and Noden [45] do consider some interactions^{Footnote 4} in their analysis, they are at most two-way interactions – and many possible two-way interactions of protected characteristics are left unexplored. Higher-order interactions are not considered in Lee, Stringer and Nadir’s [27] work either. By drawing on an intersectional perspective and considering more (and higher order) interactions, biases could potentially be revealed that are nuanced, complex and would otherwise be hidden. Furthermore, to the best of the author’s knowledge, no non-Ofqual studies on the topic have been conducted. Ultimately, given the size and richness of the dataset, and the significance of the subject matter, it is important that the CAG process be investigated with a variety of perspectives and methodological tools.

Research questions

This study has two research questions it seeks to answer. It uses a specific question around the subject or case [47] of the 2020 CAGs to address the object and a more general research question on the use of teacher judgements in the assessment of academic ability. Importantly, these research questions do not assume anything about the presence or direction of bias in teacher judgements according to protected characteristics during 2020, leaving space for the detection of no bias.

Object: Which, if any, and how do protected characteristics of students impact upon teachers’ judgements of their academic ability?

Subject: What were the total grade point differences for English students of different protected characteristics between the CAGs they received and the grades they were likely to have received in 2020 had COVID interruptions not occurred?

Methodology

Data collection

This study uses secondary data from the Grading and Admissions for England (GRADE) data-sharing project that is available through the Office for National Statistics’ (ONS) Secure Research Service (SRS). This is joined, student-level data taken from Ofqual and the Department for Education’s (DfE) National Pupil Database (NPD) which contains anonymised examination results, demographic information, and prior attainment indicators for English students. The full GCSE datasets for the 2018–2020 years will be considered. The analysis could also have been extended to cover summer 2021, as teacher assessments were used to replace exams then too [37]. However, pupils received in-person teaching for even less of that year than in 2020. Analysis of those assessments would likely be impacted by the differences in home learning environments [18] of pupils (access to private tuition, computers, internet)—on which data is not readily available. Similarly, the analysis does not extend back further than 2018. There have been major GCSE reforms since then [34] which have meant changing curricula and marking schemes. In restricting the analysis to 2018–2020, the data should be reasonably comparable across years.

The highly sensitive nature of the GRADE data was a key constraint for this study. To ensure non-disclosure, results could only be shared in aggregated form, and only if they belonged to a sub-group of at least 100 students.

Data pre-processing

Bearing in mind the limited time and computational resources for this study (Appendix D), and the need for interpretability, several variables needed to be filtered out or collapsed into fewer categories. Many of these steps are outlined in Table 2. There were also, however, some pre-processing steps involving variables that were not used for analysis/prediction that are outlined in Appendix F.

Only GCSEs that had been reformed since 2018 were considered, though this still covers many of the most popular subject choices [34]. Additionally, only results for students who took at least 8 GCSEs including English and Mathematics^{Footnote 5} were included. The data was then split into a control group of 2018–2019 data and a treatment group of 2020 data.

Splitting the data in this way acts as the “as if” random assignment mechanism that forms the basis of natural experiments [14]. COVID happened in 2020 and GCSE students in that year received the treatment of being given CAGs rather than sitting their exams but COVID could easily have happened in 2018 or 2019. This pseudo-randomisation balances, at least in expectation, all observed and unobserved pre-treatment covariates between treatment and control groups. This is where the internal validity of the study lies, as, provided the groups are homogeneous, it creates a reasonably strong basis for inference about the effects of the treatment on the students within the dataset [38].

As Table 1 shows, even after these pre-processing steps, the amount of data was considerable – with over 1 million results from over 100,000 students each year. Yet despite the size of the remaining sample, it had some important limitations such as some systematic missingness outlined in (Appendix F). The pre-processing steps taken also limit the representativeness of the sample, with each filtering out of certain categories of students reducing the external validity of results to students of other years, other nations,^{Footnote 6} or indeed students from the same years that were dropped from the sample. However, it is hoped that this sacrifice in external validity is compensated by having a more manageable sample and more interpretable results.

Table 1 Sample sizes by year

Centre assessment grades in 2020: a natural experiment for investigating bias in teacher judgements

Abstract

Similar content being viewed by others

Learning Analytics and Fairness: Do Existing Algorithms Serve Everyone Equally?

Are Assessment Practices Well Aligned Over Time? A Big Data Exploration

Impact of Predictive Learning Analytics on Course Awarding Gap of Disadvantaged Students in STEM

Introduction

Literature review

Psychology of bias: stereotyping

Examples of bias in teacher judgments

Meta-analyses of bias in teacher judgements: contradictory findings

Prior research on centre assessment grades: an intersectional perspective

Research questions

Methodology

Data collection

Data pre-processing

Data analysis

Results

Descriptive statistics—control vs treatment

Main effects

Two-way interactions

IDACI X prior attainment

IDACI X ethnicity

Ethnicity X prior attainment

IDACI X SEN

Prior attainment X SEN

IDACI X gender

Prior attainment X gender

Three-way interactions

Ethnicity X IDACI X prior attainment

IDACI X prior attainment X gender

Discussion

No absolute negative bias

Relative bias

Intersections matter

Limitations and further research

Conclusion

Data availability

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A

Subject descriptive statistics

Appendix B

Feature importance analysis with SHAP

Appendix C

Dashboard of full results

Appendix D

Replication materials

Accessing the SRS

Appendix E

Further results: IDACI X EAL

Further results: prior attainment X EAL

Appendix F

Additional pre-processing steps

Systematic missingness

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation