Background

To combat the worldwide childhood overweight and obesity epidemic, a large variety of healthy eating and physical activity promotion programmes targeting youth have been developed [1]. Schools are regarded as a suitable setting for obesity prevention programmes, as they provide access to almost all children, regardless of their ethnicity or socio economic status [2]. School-based obesity prevention programmes that target healthy eating, physical activity and sedentary behaviour seem promising in reducing or preventing overweight and obesity among children [1, 3, 4]. However, when those evidence-based programmes are implemented in real world settings, their effectiveness is often disappointing [5]. One of the reasons is that programmes are not implemented in the same way as intended by programme developers, which could be labelled as a ‘lack of fidelity’ or ‘programme failure’ [5]. On the other hand, some degree of programme adaptation is inevitable and may actually have beneficial effects [6]. A better understanding of implementation processes is important to determine if and when disappointing effects can be ascribed to programme failure.

Although there is growing recognition for conducting comprehensive process evaluations, most studies are still focused on studying programme effectiveness and process evaluations are often an afterthought [7]. We believe that it is not only important to evaluate if a programme was effective, but also to understand how a programme was implemented and how this affected programme outcomes starting in the early stages of intervention development and evaluation [8]. The literature distinguishes the following four evaluation phases: 1) Formative evaluation: evaluate the feasibility of a health programme, 2) Efficacy evaluation: evaluate the effect of a health programme under controlled conditions (internal validity), 3) Effectiveness evaluation: evaluate the effect of a health programme under normal conditions (internal and external validity) and 4) Dissemination evaluation: evaluate the adoption of a large, ongoing health programme in the real world [9, 10]. Although there are differences in the scope of each evaluation phase, process evaluations can support studies in each phase by providing a detailed understanding about implementers’ needs, the implementation processes, specific programme mechanisms of impact and contextual factors promoting or inhibiting implementation [8, 10].

One of the aspects captured in process evaluations is fidelity [5, 11,12,13]. Fidelity is an umbrella term for the degree to which an intervention was implemented as intended by the programme developers [11]. Measurements of fidelity can inform us what was implemented and how it was done as well as what changes were made to the programme (i.e. what adaptations), and how these adaptations could have influenced effectiveness [11]. Until recently many claimed that all forms of adaptation indicate a lack of fidelity and, therefore, a threat to programme effectiveness [11, 12, 14, 15]. However, there is increasing recognition for the importance of mutual adaptation between the programme developers (e.g. researchers) and programme providers (e.g. student, teacher, school director) [11, 16]. Mutual adaptation indicates a bidirectional process in which a proposed change is modified to the needs, interests and opportunities of the setting in which the programme is implemented [11]. Wicked problems such as obesity require that programme developers are open to (major) bottom-up input and adjustments without controlling top-down influence [17]. Hence, this implies the need to accurately determine whether programmes were adapted and to measure fidelity and relate this to programme outcomes [10].

Several frameworks have been proposed and applied to measure fidelity in health-promotion programme evaluations [11, 18, 19]. For example, the Key process evaluation components defined by Linnan and Steckler [19] and the Conceptual framework for implementation fidelity by Carroll et al. [18]. Yet, there is much heterogeneity in the operationalisation and measurement of the concept of fidelity [20]. While some focus on quantitative aspects of fidelity, such as the dose delivered to the target group expressed in a percentage of use, others focus on if the programme was implemented as intended, often also operationalised as ‘adherence’ [11, 12, 18, 21]. Moreover, methods to measure fidelity may vary significantly in quality. While some focus on the use of teacher self-reports conducted at programme completion, others focus on the use of observer data during the implementation period which appears to be more valid [11]. No standardised operationalisation and methodology exists for measuring fidelity, partially due to the complexity and variety of school-based obesity prevention programmes.

In 2003, Dusenbury et al. reviewed the literature on implementation fidelity of school-based drug prevention programmes spanning a 25-year period [11]. According to their review, fidelity can be divided into five components: 1) adherence (i.e. the extent to which the programme components were conducted and delivered according the theoretical guidelines, plan or model); 2) dose (i.e. the amount of exposure to programme components received by participants, like the amount or the duration of the lessons that were delivered); 3) quality of programme delivery (i.e. how programme providers delivered the programme components, for example the teacher’s enthusiasm, confidence or way of telling); 4) participant responsiveness (i.e. the extent to which the participants are engaged with the programme, for example their enthusiasm, their interest in the programme or their willingness to participate); and 5) programme differentiation (i.e. the identification of essential programme components for effective outcomes) [11].

Until now, there is no clear overview of how fidelity is assessed in school-based obesity prevention programmes. In order to move the field of obesity prevention programmes forward, we reviewed the literature to identify the current methods used to operationalise, measure and report measures of fidelity. Building on the conceptualization of the elements of fidelity by Dusenbury, the aims of this review are to: 1) identify which fidelity components have been measured in school-based obesity prevention programmes; 2) identify how fidelity components have been measured; and 3) score the quality of these methods.

Methods

Literature search

A literature search was performed by RO, RS and FvN, based on the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA)-statement, see Additional file 1. To identify all relevant publications, we performed systematic searches in the bibliographic databases PubMed, EMBASE.com, Cinahl (via Ebsco), The Cochrane Library (via Wiley), PsycINFO (via EBSCO) and ERIC (via EBSCO) from January 2001 up to October 2017. Search terms included controlled terms (MeSH in PubMed and Emtree in Embase etc.) as well as free text terms. We used free text terms only in The Cochrane Library.

The search strategy focused on search terms standing for the target setting (e.g. ‘school’), in AND-combination with terms for measures of fidelity (e.g. ‘adherence’ OR ‘dose’), the target group (e.g. ‘child’), and on health promotion interventions (e.g. ‘health promotion’ OR ‘program’). This search was enriched with OR-combination with terms for at least one energy balance-related behaviour (e.g. ‘sitting’ OR ‘physical activity’ OR ‘eating’) or obesity (e.g. ‘obesity’ OR ‘overweight’). The full search strategy for all databases can be found in the appendix, see Additional file 2.

Selection process

The following inclusion criteria were applied: 1) a population of school-aged children (4–18 years), 2) a school-based intervention that prevents obesity (sitting, nutrition and/or physical activity), 3) at least one fidelity component of Dusenbury et al. [11] as an outcome measure, 4) evaluating fidelity with quantitative methods – i.e. questionnaires, observations, structured interviews or logbooks. The following exclusion criteria were applied: 1) a population of younger (< 4 years) or older children (> 18 years); 2) evaluating programmes that were only implemented after school time; 3) evaluating fidelity with qualitative methods only. Only full text articles published in English and all studies describing elements of fidelity were included, irrespective of the maturity of the study phase (i.e. Formative evaluation, Efficacy evaluation, Effectiveness evaluation and Dissemination evaluation) [9, 10]. Two reviewers (RS, FvN) independently checked all retrieved titles and abstracts and independently reviewed all selected full text articles. Disagreements were resolved until consensus was reached. Finally, the reference lists of the selected articles were checked for relevant studies.

Data extraction

Data extraction was performed to identify which and how fidelity components were measured. In the data extraction information was collected on author, programme name, year of publication, country of delivery, programme characteristics, setting, target group, programme provider, theoretical framework and measures of implementation fidelity. We used the classification of Dusenbury et al. [11] to review measures of implementation fidelity in school-based obesity prevention programmes. For each measured fidelity component (i.e. adherence, dose, quality of delivery, responsiveness, and differentiation), we extracted the following data: definition of fidelity component, data collection methods and timing, subject of evaluation (i.e. student, teacher or school), a summary of the results (i.e. mean value or range) and the relation between a fidelity component and programme outcome. One reviewer (RS) performed the data extraction. Thereafter, the data extraction was checked by a second reviewer (FvN). Disagreements between the reviewers with regard to the extracted data were discussed until consensus was reached. Results are reported per study, this means that if two articles reported about the same study trial, we merged the results. Studies describing the same programme but evaluated in different study trials were separately reported in the data extraction tables. Moreover, if the study referred to another publication describing the design or other relevant information about the study in question, the additional publication was read to perform additional data extraction. Three different authors were contacted by email to ask for additional information on the data extraction.

Quality assessment

The quality assessment was performed to score the quality of methods used to measure each specific fidelity component (i.e. adherence, dose, quality of delivery, responsiveness and differentation) for each study. Two reviewers (RS and FvN) independently performed the methodological quality assessment. Disagreements were discussed and resolved. Given the absence of a standardized quality assessment tool for measuring implementation fidelity, the quality assessment was based on a tool that was used in two reviews also looking at process evaluation data [7, 22]. The quality of methods was scored by the use of 7 different criteria (see Table 1). As a first step, we assessed if the 7 criteria had been applied to each of the measured fidelity components in each study. Per fidelity component, each criterion was scored either positive (+) or negative (−) (see Table 1). However, this was slightly different for the fidelity components adherence and quality of delivery. For those components, the score not applicable (NA) was applied on criterion number two (i.e. level of evaluation), as it is not possible to evaluate these fidelity components on two or more levels – i.e. only on teacher level.

Table 1 Criteria list for assessment of the methodological quality of fidelity components

Secondly, based on the scores of the 7 criteria, a sum quality score (percentage of positive scores of the 7 criteria) was calculated for each fidelity component for each study, resulting in a possible score of 0% (7 criteria scored negative) to 100% (7 criteria scored positive). The scoring procedure was slightly different for the fidelity components adherence and quality of delivery. Adherence and quality of delivery scored not applicable on criterion number two. Consequently, for these two fidelity components a sum quality score was calculated on the basis of 6 criteria. Which means that these components received 0% if 6 criteria were scored negative and 100% if 6 criteria were scored positive. Finally, each fidelity component was rated as having a high (> 75% positive), moderate (50–75% positive) or low (< 50% positive) methodological quality.

Results

In total 26,294 articles of potential interest were retrieved: 9408 in EMBASE, 6957 in PubMed, 2504 in CINAHL, 3115 in PsycINFO, 2227 in Cochrane and 2083 in ERIC. After further selection based on first title and abstract and subsequently full text, 73 published articles [21, 23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94] reporting on 63 different studies were included. Reasons for excluding articles are reported in Fig. 1.

Fig. 1
figure 1

Flow diagram of study selection process

Study characteristics

In the following section we will describe the study characteristics and describe which fidelity components have been measured in the included studies (Aim 1). Table 2 provides an overview of the study characteristics and measured fidelity components (i.e. adherence, dose, quality of delivery, responsiveness or differentiation) including references. Studies were conducted in 12 different countries, but were mostly conducted in the United States (N = 31) and in the Netherlands (N = 11). Two studies were conducted in a combination of multiple European countries. All studies were aimed to prevent obesity, wherein 16 studies targeted physical activity, 27 studies targeted healthy eating, 12 studies targeted both physical activity and healthy eating, three studies targeted both physical activity and sedentary behaviour and five studies targeted physical activity, healthy eating and sedentary behaviour. In total 45 studies were conducted in primary schools, 12 in secondary schools and six both in primary and secondary schools. The use of a theory, framework or model for designing process evaluations was applied in 17 studies. Nine different theories, framework or models were used; Key process evaluation components defined by Steckler and Linnan, Concepts in process evaluations by Baranowski and Stables, Conceptual framework of process evaluations by Mcgraw, How to guide for developing a process evaluation by Saunders, Logic model by Scheirer, RE-AIM framework by Glasgow, Probabilistic mechanistic model of programme delivery by Baranowski and Jago, Taxonomy of outcomes for implementation research by Proctor and Theory of diffusion of innovations by Rogers.

Table 2 Study characteristics

In total, 63 studies reported 120 fidelity components. Dose was measured most often (N = 50), followed by responsiveness (N = 36), adherence (N = 26), and quality of delivery (N = 8). The fidelity component differentiation was not measured in any of the studies. In 24 studies only one fidelity process component was measured. Two components were measured in 22 studies and three components were measured in 16 studies. Only one study measured four fidelity components, meaning that no study measured all five fidelity components.

In the following sections, each of the measured fidelity components are described in more detail. We describe how fidelity components were measured (Aim 2) and the quality scores for these methods (Aim 3). Tables 3 and 4 provide an overview on the measurements of the fidelity components including references. The full data extraction and quality assessment can be found in the appendix, see Additional files 3 and 4.

Table 3 Characteristics of methods used to measure fidelity component
Table 4 Quality of methods used to measure fidelity components

Adherence

The component adherence was reported in 26 studies (see Table 2). A definition of adherence was provided and defined or operationalised in 15 studies. Adherence was mostly defined as fidelity, programme carried out, delivered or intended as planned. Measurements of adherence were mostly conducted with observations (N = 15), followed by logbooks (N = 10), questionnaires (N = 7) and structured interviews (N = 2) (see Table 3). Teachers were in almost all cases the subject of evaluation, while schools (i.e. school leaders) were reported in one study. Regarding frequency of data collection, 22 studies measured adherence on more than one occasion (see Table 4). Adherence was related five times to programme outcomes.

Quality scores of adherence ranged from 17 to 83% and the mean score was 46%. Adherences scored 11 times low, 11 times moderate and four times high methodological quality (see Table 4). Adherence scored the highest on criterion six (i.e. frequency of data collection) and the lowest on criterion seven (i.e. fidelity component related to programme outcome), as in only five studies the relation between a fidelity component and programme outcome was assessed.

Dose

The component dose was reported in 50 studies (see Table 2). A definition of dose was defined or operationalised in 28 studies. Dose was mostly defined as the proportion, amount, percentage or number of activities or components that were delivered, used or implemented. Measurements of dose were conducted by means of questionnaires (N = 27), logbooks (N = 24), observations (N = 7) and structured interviews (N = 6) (see Table 3). In one study the method was not reported [23]. In 40 studies, the subject of evaluation was a teacher, in ten studies a student and only in one study a school. Regarding frequency, 34 studies measured dose on more than one occasion (see Table 4). Dose was related 17 times to programme outcomes.

Quality scores of dose ranged from 0 to 86% and the mean score was 37%. Dose scored 34 times low, 14 times moderate and two times high methodological quality (see Table 4). Dose obtained the highest scores on criterion number six (i.e. frequency of data collection) and the lowest on criterion number two (i.e. level of evaluation), as in only two studies the level of evaluation was on two or more levels.

Quality of delivery

Quality of delivery was measured in eight studies (see Table 2). Five of those studies that defined or operationalised quality of delivery, referred to it as the quality of the programme. For example, studies measured if teachers were prepared for their lessons, their way of communicating to children or their level of confidence to demonstrate lessons. Conducted measurements methods were questionnaires (N = 3), logbooks (N = 3) and observations (N = 2) (see Table 3). All studies were performed on teacher level and almost all studies, except for two, performed measurements on multiple occasions (see Table 4). Quality of delivery was related twice to programme outcomes.

Quality scores of quality of delivery ranged from 17 to 67% and the mean score was 46%. Quality of delivery scored three times low and five times moderate methodological quality (see Table 4). Quality of delivery obtained the highest score on criterion number six (i.e. frequency of data collection) and obtained the lowest score on criterion number four (i.e. data collection methods), because only one study reported the use of more than one technique for data collection.

Responsiveness

Responsiveness was measured in 36 studies (see Table 2). The definition of responsiveness was defined or operationalised in 13 studies. Responsiveness was generally defined as satisfaction, appreciation, acceptability, or enjoyment of the programme. Measurements of responsiveness were mostly conducted with questionnaires (N = 34), followed by observations (N = 5), logbooks (N = 3) and structured interviews (N = 1) (see Table 3). The subject of evaluation was 31 times on student level and 17 times on teacher level. Hereby, measurements were in 14 studies on multiple occasions (see Table 4). Responsiveness was related seven times to programme outcomes.

Quality scores of responsiveness ranged from 0 to 86% and the mean score was 34%. Responsiveness scored 29 times weak, 6 times moderate and 1 time high methodological quality (see Table 4). Responsiveness scored the highest on criterion number five (i.e. quantitative fidelity measures) and scored the lowest on criterion number four (i.e. data collection methods), because only five studies reported the use of multiple data collection methods.

Discussion

This review aimed to gain insight in the concepts and methods employed to measure fidelity and to gain insight into the quality of measuring fidelity in school-based obesity prevention programmes. The results of this review indicate that measurements of fidelity are multifaceted, encompassing different concepts and varying operationalisation of fidelity components. Moreover, methods were conducted in a range of different ways and mostly conducted with a low methodological quality.

The studies included in our review used different ways to define fidelity and its components in school-based obesity prevention programmes. One of the main concerns is that definitions of fidelity components were used interchangeably and were inconsistent. The same definition was used for different fidelity components. For example, adherence and dose were both defined as ‘implementation fidelity’. Definitions, if provided, were rather short and 59 of the 120 fidelity components were not accurately defined, meaning that only the ‘name’ of the component was mentioned and no further explanation was provided on how authors defined the measured component. This lack of consistency in fidelity components definitions is in line with other reviews in implementation science [11, 95] and may be due to the small amount of studies that base their process evaluation on a theory, framework or model [96].

Related to that issue, no standardised theory, framework or model exists for the guidance of measuring fidelity in school-based obesity prevention programmes [96]. As a result, process evaluation theories, frameworks or models from different research areas were implemented, which may also contribute to inconsistent fidelity component definitions. For example, Concepts in process evaluations by Baranowski and Stables originated from health promotion research [65], while the Taxonomy of outcomes for implementation research by Proctor originated from quality of care research [13]. Thus, researchers in implementation science need to agree upon the definitions employed for fidelity components, base their design for fidelity measurements on a theory, framework or model and describe this design and included fidelity components thoroughly.

According to Dusenbury et al. [11], fidelity has generally been operationalized in five components; 1) adherence, 2) dose, 3) quality of delivery, 4) responsiveness and 5) differentiation. In line with their findings, the current review showed that the amount and type of fidelity components measured in the included studies varied a lot, with most studies only assessing one to three components. Moreover, the majority of the studies only investigated dose, which reflects often primary interest of researchers in actual programme delivery and participation levels, rather than an interest in how a programme was delivered (i.e. quality of delivery). None of the studies measured the unique programme components, operationalized as differentiation by Dusenbury et al. [11]. One explanation could be that it is very complex to measure differentiation in school-based obesity prevention programmes, as they usually include many different interacting programme components which as a whole contribute to programme outcomes.

The finding that no study assessed every component of fidelity as defined by Dusenbury et al. [11] confirms previous findings [20]. Although, it can be argued that measurements of all fidelity components can provide a full overview of the degree to which programmes are implemented as planned [12]. This may often be unfeasible as the choice for which fidelity components to include in process evaluations is partially dependent on the context, resource constraints and measurement burden [10]. As such, measurements for fidelity are determined through both a top-down and bottom-up approach, which include programme developers, and programme providers. Additionally, the claim that measurements of all fidelity components is of importance, could also be debated by the fact that it is still unknown which of the fidelity components contributes most to positive programme outcomes. Subsequently, often too much focus is given on measuring as much fidelity components as possible, which could decrease the quality of fidelity measurements. Therefore, we may learn most from carefully selecting and measuring most relevant components with high quality. This may bring us one step further in learning more about the relation between fidelity components and positive programme outcomes.

Most components scored low methodological quality (i.e. 77 fidelity components). Especially methods employed to measure a fidelity component scored low. While measurements of fidelity components were performed with a wide variety of techniques (i.e. observations, questionnaires, structured interviews or logbooks), the majority of the studies lacked data triangulation and the employed methods were often not measured on two or more levels (i.e. teacher, student or school). A possible explanation could be that process evaluations of high quality may not always be practical to fulfil in real-world settings [97]. Herein, researchers need to balance between high quality methods and keeping the burden for participants and researchers low. Therefore, the complexity of school-based obesity prevention programmes may play a role in the quality of methods that are conducted in process evaluations [10]. Another explanation could be that many studies focus more on effect outcomes than on process measures. A multifaceted approach, encompassing both outcome and process measures, is needed as implementation of school-based obesity prevention programmes is a complex process [10]. Thus, we need to stimulate researchers to focus more on process measures encompassing more high-quality methods with a focus on data triangulation and measurements conducted on different levels.

Studies barely discussed the validity or reliability of measures used for measuring fidelity. This may be due to the fact that process evaluations are scarcely validated and are often adapted to the setting in which a programme is implemented. Identification and knowledge of strong validated instruments in the field of implementation science is limited. In a response to this lack of knowledge, the Society for Implementation Research Collaboration (SIRC) systematically reviewed quantitative instruments assessing fidelity [98]. Their review is a relevant and valuable resource for identifying more high-quality instruments. In addition, the SIRC also provides information on the use of these instrument in certain contexts, which is of importance for implementation research in real-world settings. Hence, to move the field of implementation science forward, future studies need to report in detail which methods were used for measuring fidelity and the inherent strengths and limitations of these methods and even more of importance is that journals support publication of these types of articles.

The importance for relating fidelity to programme outcomes in school-based obesity programmes is increasingly recognised [10, 20]. The majority of studies included in this review did not investigate the relation between components of fidelity and programme outcomes, which is in line with conclusions of other studies [20]. The investigation of this relation is of importance, as high fidelity simply does not exist in real practice; programmes that were adapted to the setting in which they were implemented (i.e. mutual adaptation) were more effective than programmes that were implemented as intended (i.e. high fidelity) [11]. More flexibility in programme delivery could address the needs of the target group or context in which the programme is implemented and may increase the likelihood that the programme will be adopted in real practice and, thereby, result in more positive outcomes [11, 99]. Though, process evaluations included in our review were often part of an RCT (i.e. implementation during controlled conditions) and were mainly focused on investigating the effectiveness of a programme as the main outcome. Randomisation is however often not desirable nor feasible in real world obesity prevention approaches [100]. Instead, more research is needed looking at the degree of implementation in real world settings and to what extent adaptations to the programme have been made, and how this impacted the implementation and sustainability of changes.

Strengths and limitations

To our knowledge, this is the first study conducting a systematic review to provide an overview of methods used for measuring fidelity in school-based obesity prevention programmes. Other strengths are the use of the framework of Dusenbury et al. [11] to conceptualise fidelity, systematically select studies, data extraction and quality assessment performed with two reviewers independently. However, there are also some limitations of this review that should be acknowledged. First, formative evaluation or effectiveness evaluation studies may not have reported their process evaluations in the title or abstract, as implementation is rarely a key focus of school-based obesity prevention programmes. As a result, some relevant studies may have been missed. Nevertheless, we included a large number of studies, therefore, we assume that this review provides a good overview of fidelity in school-based obesity prevention programmes. Another limitation is the possibility that we have overlooked relevant data or misinterpreted the data, when conducting the data extraction and quality assessment. We tried to minimise this bias by having two researchers conducting the data extraction and quality assessment in order to obtain more accuracy and consistency, and authors were contacted for clarification, when needed.

Conclusions

There is no consensus on the measurements of fidelity in school-based obesity prevention programmes and the quality of methods used is weak. Therefore, researchers need to agree upon the operationalisation of concepts and clear reporting on methods employed to measure fidelity and increase the quality of fidelity measurements. Moreover, it is of importance to determine the relation between fidelity and programme outcomes to understand what level of fidelity is needed to ensure that programmes are effective. At last, more research is needed looking at the degree of implementation in real world settings. As such, researchers should not only focus on top-down measurements. In line with mutual adaptation approaches in intervention development and implementation, a bidirectional process should be part of process evaluations, wherein researchers examine whether and under which conditions adaptations to the programme have been made, whilst still being effective and sustainable in real world settings.