Background

Cluster randomised trials (CRTs) are studies in which groups, or clusters, of individuals are allocated to trial arms rather than the individuals themselves [1]. The clusters may be geographic areas, health organisations or social units. CRTs are used when the intervention is delivered to the entire cluster or there is a chance of contamination between trial arms if individuals are randomised [2].

CRTs can be more complex to design and analyse than individually randomised controlled trials. The most documented methodological consideration for CRTs is that observations on participants from the same cluster are more likely to be similar to each other than those on participants from different clusters [2]. This similarity is quantified by the intra-cluster correlation coefficient (ICC), defined as the proportion of the total variability in the trial outcome that is between clusters as opposed to between individuals within clusters [3]. The statistical dependence between observations within clusters needs to be taken account of when calculating the sample size and analysing data in CRTs [1]. The use of standard methods may result in the sample size being too small to detect the intervention effect, and analysis results that exaggerate the evidence for a true intervention effect. Estimates of the ICC or coefficient of variation of clusters for the outcome from previous studies are required to calculate the design effect, the factor by which the number of individuals that would be required in an individually randomised trial needs to be inflated to account for within-cluster correlation in the sample size calculation. In addition, when calculating the sample size in CRTs, a degrees of freedom correction should be incorporated to take account of the uncertainty with which variability in the outcome across clusters is estimated in the analysis [4], and a further inflation of the sample size should be considered to allow for loss of efficiency that results from recruiting unequal numbers of participants from the clusters[5]. When estimating the intervention effect from the resulting trial data the main analytical approaches are to either apply standard statistical methods to summary statistics that represent the cluster response (cluster-level analyses) or use methods at the individual participant level that account for within-cluster correlation in the model or by weighting the analysis. Another important methodological consideration in CRTs is the potential for recruitment bias that might occur in studies where the participating individuals are recruited after the clusters are randomised. Finally, when using meta-analysis to pool findings from studies that use the CRT design, there is the need to consider how best to incorporate estimated effects from studies that did not allow for clustering in the analysis, and consider the extent to which differences in the types of clusters that were randomised are a source of heterogeneity. These considerations are detailed in several textbooks [1, 2, 6,7,8].

CRTs are increasingly used to evaluate non-pharmacological interventions for improving child health outcomes [9,10,11]. Although the use of CRTs to evaluate the effectiveness of interventions for improving educational outcomes is long established [12, 13], their use to evaluate health interventions in schools is more recent [10]. Schools provide a natural environment to recruit, deliver public health interventions to and measure outcomes on children, due to the amount of time they spend there [10]. Cluster randomisation is consistent with the natural clustering found within school settings (i.e., classrooms within year groups within schools). School-based CRTs share common challenges with other settings, but specific considerations may be more challenging when schools are randomised, for example, consent procedures [10, 14].

In 2011, a methodological systematic review on the characteristics and quality of reporting of CRTs involving children reported a marked increase in such studies [9]; three quarters of the included studies randomised schools. To date, no systematic review has focussed specifically on the characteristics of school-based CRTs for improving pupil health outcomes. Such a review would help identify common methodological challenges, obtain estimates of parameters (e.g., the ICC) that are of use to researchers planning similar trials and inform the design of simulation studies that use synthetic data to evaluate the properties of statistical methods applied in the context of school-based CRTs with health outcomes.

The aim of this methodological systematic review is to describe the characteristics and practices of school-based CRTs for improving health outcomes in pupils in the United Kingdom (UK).

Methods

This is a systematic review of school-based CRTs with pupil health outcomes that were conducted in the UK. The review was focussed on the UK to align with constraints on available resources and collect richer data on CRT methodology in a single education system.

Data sources and search methods

The systematic review was registered with PROSPERO (CRD42020201792) and the protocol has been published [15]. After extensive scoping of the subject area, a pragmatic decision was made to search MEDLINE (through Ovid) in order to make the review more time-efficient and align with available resources. MEDLINE was exclusively searched from inception to 30th June 2020 for peer-reviewed articles of school-based CRTs. The search strategy (Table 1) was developed in consultation with information specialists, based on a sensitive MEDLINE search strategy for identifying CRTs [16]. Cluster design-related terms ‘cluster*’, ‘group*’ and ‘communit*’ were combined with the terms ‘random’ and ‘trial’, along with the ‘Schools’ Medical Subject Heading (MeSH) term. The search was limited to English language.

Table 1 Systematic review search strategy

Inclusion and exclusion criteria

The systematic review included school-based definitive CRTs of the effectiveness of an intervention versus a comparison group that evaluated health outcomes on pupils. The population of interest was children in full-time education in the UK. Studies that took place outside the UK were excluded. The pragmatic decision was made to limit the population to educational settings within the UK as it made the review more focussed and applicable to a specific setting. Eligible studies included pupils in pre-school, primary school and secondary school. The types of eligible clusters included schools themselves, year groups, classes, teachers or any other relevant school-related unit. All school types were eligible, including special schools. Any health-related intervention(s) and control groups were considered. The primary outcome had to be related to pupils’ health. Studies for which the primary outcome was not health-based (e.g., academic attainment) were excluded. All types of CRT design were eligible including parallel group, factorial, crossover and stepped wedge studies.

If more than one publication of the primary outcome result for an eligible CRT was identified, a key study (index) report was designated and used for data extraction. Papers that did not report the primary outcome were excluded along with pilot/feasibility studies, protocol/design articles, process evaluations, economic evaluations/cost-effectiveness studies, statistical analysis plans, commentaries and mediation/mechanism analyses.

Sifting and validation

Two reviewers (KP and OU) independently screened the titles and abstracts of all references (downloaded into Endnote [17]) for eligibility against the inclusion criteria. Any studies for which the reviewers were uncertain of for inclusion were taken to full text screening. Full-text articles were evaluated by the same reviewers based on the inclusion criteria using a pre-piloted coding method. Any discrepancies which could not be resolved through discussion were sent to a third reviewer (ZMX) for a decision.

Data extraction and analysis

For each eligible study, data were extracted using a pre-piloted form in Microsoft Excel. Data were extracted by two reviewers (KP and OU), and any discrepancies that could not be resolved through discussion were sent to a third reviewer (ZMX) for a final decision. Missing information that was not available in the index papers was sought from corresponding protocol papers and other “sibling” publications.

The items of information extracted are listed as follows:

  • Publication details: year of publication and journal name.

  • Setting characteristics: country/region, school level and type of school.

  • Intervention: health area and intervention type.

  • Primary outcome: name, health area, reporter of outcome and method of data collection.

  • Study design and analysis methods: unit of randomisation (i.e., type of cluster), justification for using the cluster trial design, method used to sample schools, method used to balance the randomisation, length and number of follow-ups, design of follow-up (cohort versus repeated cross-sectional design) and method used to account for clustering in the analysis.

  • Sample size calculation: target sample size (i.e., number of clusters and pupils) and assumptions underlying the sample size calculation (e.g., assumed ICC, percentage loss to follow-up).

  • Ethics and consent procedures: activities covered by the consent agreements and use of “opt-out” consent.

  • Other study characteristics of methodological interest: number of clusters and pupils that were recruited and lost to follow-up, estimate of the ICC of the primary outcome.

Study characteristics were described using medians, interquartile ranges (IQRs) and ranges for continuous variables, and numbers and percentages for categorical variables, using Stata software [18]. Formal quality assessment was not performed as it was not an objective of this review to estimate intervention effects in the included studies. Some information relevant to the quality of CRTs was, however, extracted and summarised as part of the review.

Results

Search results

After deduplication, 3103 articles were identified through MEDLINE, 159 were full-text screened and 64 were included in the review [19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82]. Of 95 excluded studies, 88 did not meet the inclusion criteria, and 7 studies met inclusion criteria but were subsequently excluded because they were sibling reports of an index paper. The PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flow diagram is in Fig. 1.

Fig. 1
figure 1

PRISMA flowchart summarising the results of the literature search and screening for eligibility

Study characteristics

The included papers were published in 36 different journals, including: British Medical Journal (n = 9 papers); BMC Public Health (n = 4); International Journal of Behavioural Nutrition and Physical Activity (n = 4); Archives of Disease in Childhood (n = 3); BMJ Open (n = 3); Journal of Epidemiology and Community Health (n = 3); Public Health Nutrition (n = 3); and The Lancet (n = 3). The CRT design has been increasingly used in the UK school setting to evaluate health interventions for pupils since the first paper was published in 1993 (Fig. 2). Twenty three papers were published between 2001 and 2010, compared to 37 between January 2011 and June 2020.

Fig. 2
figure 2

Published cluster randomised trials indexed in MEDLINE from inception to 30th June 2020 (N = 64)

Table 2 summarises the characteristics of included studies.

Table 2 Characteristics of included studies (N = 64)

Setting

Almost three quarters of the studies were conducted exclusively in England (n = 47; 73%); most studies (50 of the 52 studies that provided the data) took place in one or two geographic regions (e.g., West Midlands). Just over half the studies (56%) were based exclusively in primary schools (age 5–11 years), and 38% were exclusively in secondary schools (age 11–16 years). Of the 44 studies that reported information on the types [83] of schools recruited, 93% included state-funded schools.

Intervention type

Eighteen (28%) studies evaluated interventions that targeted nutrition, 15 (23%) physical activity, 15 (23%) socioemotional function and its influences, 7 (11%) dental health, 5 (8%) smoking and 5 (8%) injury, amongst others. Physical health interventions are increasingly prominent (13 published since 2011 in contrast to just 2 prior to then). Of the 15 studies targeting socioemotional function and its influences, 13 were published since 2011, highlighting increasing use of the CRT design in this area. Of the 7 CRTs related to dental health, the most recent one was published in 2011. The vast majority of interventions were in primary prevention (94%).

In 53 (83%) studies, the intervention had at least one component that necessarily had to be administered to entire clusters (“cluster–cluster” interventions [1]). Such components often included educational lessons (e.g., classroom-based lessons [23], physical activity [43] and gardening [25]). Other less common components included breakfast clubs [46, 73], funding/resources [37], change in school policy [50] and advertisements [40]. Eleven (17%) studies had intervention components that directly targeted individual pupils (“individual-cluster” interventions [1]), such as the use of fluoride varnish [72]. Thirty three (52%) studies had “professional-cluster” interventions [1]: in 30 (47%) studies the teacher was either trained in or provided with guidance to deliver components of the intervention, in 3 studies pupils were trained to deliver peer-led intervention components [21, 26, 42], and in 1 study the school nurse was trained [66]. Half the studies (n = 32) had “external-cluster” interventions [1] where people external to the school delivered intervention components (e.g., researchers [23], trained facilitators [53], dental professionals [51], dance instructors [41] and student volunteers [47]).

Two studies [53, 78] had 2 control groups (one “usual care” and one active) and 16 (25%) used a delayed intervention (waitlist) design.

Primary outcome

Health areas assessed by the primary outcomes are summarised in Table 2. In 53% of the studies pupils reported the primary outcome, with researchers reporting primary outcomes in 20%, teachers in 8%, and parents in 8%. In 28% of the studies the primary outcome reporter was blind to allocation status (some authors specifically commented on the challenges of blinding trial arm status [33, 36, 56, 60]), and 22% measured the outcome using an objective method.

Study design and analysis methods

Explicit justification for use of the CRT design was only provided in 17 (27%) studies; the most common reason was to avoid contamination (13 studies altogether). Most studies (n = 56; 88%) randomised school clusters, while classes and year groups were allocated in 6 (9%) and 2 (3%) studies, respectively. Two authors said that in order to maintain power, classes were randomised instead of schools and that this may have led to contamination between the intervention and control arms [22, 28]. Nearly all studies used a parallel group design (n = 61; 95%); the remaining 3 used a factorial design [21, 37, 39]. Of the 46 studies with sufficient information to establish the approach used to sample schools, 33 initially invited all potentially eligible schools to participate, 5 used random sampling, 4 used purposive sampling, 3 used convenience sampling, and 1 used a mixed random/convenience sampling approach.

Eighty percent of studies reported using a restricted allocation method to balance cluster-level characteristics between the trial arms. Most commonly a measure of socio-economic status (SES) was balanced on (48%), with a third of studies (21/64) specifically balancing the allocation on the percentage of pupils eligible for free school meals. Other commonly-used balancing factors are described in Table 3. Few studies gave justification for their choice of balancing factors.

Table 3 Cluster-level characteristics used to balance the randomisation (N = 64)

One of the challenges of CRTs is to avoid recruitment bias that might occur if participants are recruited after the clusters are randomised [88, 89]. One third (33%) of studies avoided this by recruiting pupils before the clusters were randomised; furthermore, 25% collected baseline data before randomisation. This information, however, was unclear in many studies (41% and 33%, respectively). Generally, insufficient information was provided on whether recruitment bias was avoided in studies where pupils were recruited after randomisation of clusters. A notable exception was one study [57] where recruitment bias was avoided because allocation was not revealed to the schools until after recruitment and baseline assessment.

Nearly all studies used the cohort design as their method of follow-up (n = 62, 97%), where the same pupils provided data at each study wave. One study used a repeated cross-sectional design where different pupils provided data at each wave [46], and one used an a priori mixed design incorporating elements of the cohort and repeated cross-sectional designs, with only a subset of participating pupils providing data at each wave [49].

Seventy two percent of studies analysed their data using individual-level methods that allow for clustering, 16% used cluster-level analysis methods, and 12% did not allow for clustering in their analysis.

Sample size calculation

Seventy eight percent of studies accounted for clustering in their sample size calculation and 72% reported the ICC or coefficient of variation [90] that was assumed for the outcome. None of the studies made a degrees of freedom correction to the sample size calculation. Only two studies [57, 63] allowed for unequal cluster sizes in their sample size calculation, and only one of these [57] specified the anticipated variation in the number of pupils across clusters. The median (range) assumed ICC for school clusters was 0.05 (0.005 to 0.175) based on the 37 studies that provided these data. Of the 3 studies that specified the coefficient of variation of the outcome, 2 assumed it to be 0.2 [42, 60] and 1 assumed it to be 0.25 [19]. The median (range) assumed design effect was 2.21 (1.22 to 8.11). The median targeted sample size was 30 and 964 clusters and pupils, respectively. Most studies (94%) did not state whether their sample size calculation allowed for loss to follow-up of clusters.

Ethics and consent procedures

Information regarding consent procedures was not well reported and consent for the participation of the cluster was often implied rather than explicitly detailed. In 63% of studies it was stated that both parents/guardians and pupils provided consent or assent for study participation. Forty five percent of studies reported that opt-out consent [14] from either the parent/guardian and/or the pupil was used for participation.

Other study characteristics of methodological interest

A median (IQR) of 31.5 (21 to 50) clusters, 29 (15 to 50) schools and 1308 (604 to 3201) pupils were recruited. The CRT studies that used a cohort design and reported both targeted and achieved recruitment figures at the cluster (n = 45) and pupil (n = 43) levels achieved those recruitment targets in 89% and 77% of studies, respectively. Some authors noted challenges with recruitment at the cluster [45, 47, 50] and pupil [24, 55] levels. Based on the 33 studies that provided data, the median (IQR) percentage of pupils categorised as “White” was 76.8% (51.5% to 86.2%). Thirty out of 62 (48%) studies that provided information reported that at least one cluster was lost to follow-up. Missing data resulting from entire school drop-out was highlighted as a problem in some reports (e.g., [42, 48, 54]). The median follow-up at the pupil level was 79.9%.

Only 26 (41%) studies overall, and 18 of the 37 (49%) studies published after 2010, reported the ICC from the analysis of the primary outcome; the specific ICC values are reported in Table 4. The median (range) ICC for school clusters was 0.028 (0.0005 to 0.21). For many studies that reported both values there was a marked difference between the observed school-level ICC in the study data and the corresponding assumed value of the ICC in the sample size calculation (Fig. 3). The median (range) of the differences between the observed ICC and the assumed ICC was -0.006 (-0.117 to 0.16) indicating that: on average, the observed ICC was slightly smaller than the assumed ICC; at one extreme, the observed ICC in one study was 0.117 smaller than the assumed value [25]; and at the other extreme, the observed ICC in one study was 0.16 larger than the assumed value [68]. The intra-class correlation coefficient of agreement between the observed and assumed ICCs was 0.24.

Table 4 Reported intra-cluster correlation coefficients for primary outcomes (N = 26)
Fig. 3
figure 3

Observed ICC for primary outcomes versus ICC assumed in sample size calculation (N = 20)

Seven studies [24, 26, 44, 59, 68, 71, 74] that reported ICCs had a binary primary outcome, but none of these stated whether the ICC was calculated on the proportions scale or the logistic scale [3]. It is possible that five of these studies [24, 26, 68, 71, 74] that used mixed effects (“multi-level”) models [91] to analyse the data reported the ICC on the logistic scale, which could potentially account for some of the differences between the observed and assumed ICCs. Further scrutiny of the data, however, revealed marked differences for only two of the aforementioned studies: 0.21 for the observed ICC versus 0.05 for the assumed ICC in Mulvaney and colleagues [68], and 0.028 versus 0.1, respectively, in Obsuth and colleagues [71].

Discussion

The number of UK school-based CRTs evaluating the effects of interventions on pupil health outcomes has increased in recent years, reflecting growing recognition of the role that schools can play in improving the health of children [10, 92,93,94,95]. The findings of this systematic review indicate a number of methodological considerations that are worthy of reflection.

Interpretation

Seventy two percent of the studies reported the level of clustering assumed in their sample size calculation, a little more than the 62% observed in a 2015 review of the reporting of sample size calculations in CRTs [96]. Our review found that the observed ICC in the study data often differed markedly from the ICC assumed in the sample size calculation. This will be partly due to sampling variation and adjustment for prognostic factors in the analysis, but it may also reflect the lack of availability of good estimates of the ICC at the time of sample size calculation. Knowledge of the ICC for pupil health outcomes in the school setting is less well established than for patient health outcomes in the primary care setting where general practices are allocated as clusters [1, 97]. It has been reported that general practice-level ICCs for health outcomes are generally less than 0.05 [98]; in our review, only 13 of 23 studies that randomised school clusters and reported observed ICCs had values that were less than 0.05. School-based ICC estimates are widely available for educational outcomes [99], but these are markedly higher than those reported in this review for pupil health outcomes; this is to be expected given that the primary role of the school is to provide education. The importance of reporting ICCs from study data for planning future similar CRTs has long been established [100] and the 2012 CONSORT extension to CRTs includes a specific reporting item for this [101]. Only two-fifths (41%) of studies in this review, however, reported the ICC for the primary outcome; this figure rises to 48% (16/33) for studies published after 2012. Improved reporting of the ICC in the increasing number of CRTs in the school-based setting, and further papers written specifically to report ICCs [102, 103], will provide valuable knowledge. This review focussed on CRTs in the UK setting; a useful area to investigate is the extent to which school-based ICC estimates for health outcomes from other countries (e.g., [102, 104]) are similar to those in the UK.

Representativeness of school and pupil characteristics in school-based trials is important for external validity and inclusiveness. For most studies in this review, schools were recruited from only one or two geographic regions/counties. A median 23% of participating pupils were in a minority ethnic group, lower than the national percentages reported by the UK Department for Education (33.5% of primary school pupils and 31.3% of secondary school pupils) [105]. The study reports generally provided little information on specific aspects of the recruitment process, such as why some schools declined to participate and details of their characteristics. Many of the studies evaluated interventions that involved classroom lessons and necessitated teachers being trained to deliver the intervention. Additionally, the teachers reported pupil outcomes in some studies [32, 34, 60, 73, 82]. Insufficient school resources to deliver the intervention and the wider trial may be a barrier to participation and result in lack of representation of certain types of schools.

Eighty percent of the studies used some form of restricted allocation to balance the randomisation on cluster-level characteristics, which is higher than previous methodological reviews of CRTs [106,107,108,109]. The percentage of pupils in the school that are eligible for free school meals was often used as a balancing factor, perhaps partly because this information is readily available from the UK Department for Education [110]. School characteristics that are predictive of the study outcomes, account for within-cluster correlation or influence effectiveness of the intervention are candidates on which to balance the randomisation [1, 111]; previous school-based CRTs could be used to identify such factors.

Strengths

This systematic review used a defined search strategy tailored to identify school-based CRTs. The strategy was developed following an iterative process and allowed us to achieve the right balance of sensitivity and specificity relevant to our available resources. Identifying reports of CRTs is a challenge given that many articles do not used the term ‘cluster’ in their title or abstract. Therefore, a search strategy was used which included terms such as ‘group’ and ‘community’ to improve sensitivity. The ‘School’ MeSH term was also used to identify publications that randomised any type of school-related unit. The piloting of our screening procedure and data extraction were conducted by two independent reviewers, improving accuracy. The review identified school-based CRTs with interventions spanning a variety of different health conditions/areas.

Limitations

A potential limitation of the review is that the search was limited to one database. MEDLINE was used because the focus of the review was on describing the characteristics of trials that evaluate the impact of health interventions on pupil’s health outcomes, but it is possible that we have not identified eligible publications that are not indexed in MEDLINE. Translating our search in the EMBASE, DARE, PsycINFO and ERIC databases for potential includes published in the last 3 years, however, revealed only one additional eligible school-based CRT.

Given resource constraints, we focussed the review on the UK, making the decision to collect rich data on CRT methodology in a single education system. As a result, the findings are readily applicable to a specific context. Despite being focussed on the UK, the findings of this review will be of global interest. Other high income countries, such as Australia, have a similar school system to the UK, and many of our findings may be applicable in those settings. Furthermore, some of the methodological challenges in the design of CRTs will be similar across different settings.

Future directions

The results provide a summary of the methodological characteristics of school-based CRTs with pupil health outcomes in the UK. To our knowledge, there has been no systematic review of the characteristics of school-based CRTs for evaluating interventions for improving education outcomes, despite the fact that the use of the CRT design is more established in that area. A comparison of methodology between health-based CRTs and education-based CRTs in the school setting would be valuable to both areas. The results in our review indicate that better information on the ICC is needed to design school-based CRTs with health outcomes. Cataloguing of ICCs from previous studies will help researchers choose better values for the assumed ICC when calculating sample size.

Conclusions

CRTs are increasingly used in the school setting for evaluating interventions for improving children’s health and wellbeing. The emerging pool of published trials in the UK provides investigators and methodologists with relevant experiential knowledge for the design of future similar studies. This review of school-based CRTs has highlighted the need for more information on the ICCs to calculate the required sample size. Better reporting of the recruitment process in CRTs will help to identify common barriers to obtaining representative samples of schools and pupils. Finally, previous school-based CRTs may provide a useful source of data to identify the school-level characteristics that are strong predictors of pupil health outcomes and, therefore, potentially good factors on which to balance the randomisation.