The search returned 23,372 papers. 7815 duplicates were removed, 14,741 were excluded based on title and abstract and a further 765 were excluded based on fulltext (see Fig. 1). The remaining 51 papers met inclusion criteria, and one additional paper  was identified as a reference within one of the included studies. The final review, therefore, included 52 papers covering 35 separate studies. The complete list of included papers and their characteristics is presented in Table 1, with further detail available in Online Resource 2.
Characteristics of identified studies
The client groups in the included studies were children and adolescents with behaviour problems including offending (N = 18, 51%), substance abuse (N = 9, 26%), emotional disorders (N = 7, 20%) or autism spectrum disorder (ASD) (N = 1, 3%). Interventions were primarily family therapy (N = 16, 46%), or cognitive behavioural therapy (CBT) (N = 7, 20%), as well as parenting (N = 4, 11%), youth non-CBT intervention (N = 3, 9%) and non-CBT intervention with youth and parent components (N = 6, 17%). Detail on the specific interventions within each category is available in Online Resource 3.
Measures of therapist adherence, competence and composite fidelity differed in content and complexity. Some studies measured the frequency of implementation of certain core strategies , whilst others also incorporated an evaluation of thoroughness of their use . There was also variation in the number and timing of assessments from which overall scores were based such that some interventions measured implementation measures at every session , some at regular intervals , or specific pre-planned sessions , whilst others took ratings from one or more randomly selected sessions over the course of therapy [46, 47].
The relationship between adherence and outcome was measured in 29 studies across 43 papers [7, 8, 10,11,12, 41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78]. Adherence was rated most frequently by observers (N = 13, 45%), as well as therapists (N = 4, 14%), clients (N = 4, 14%), supervisors (N = 4, 14%), or a composite of client and therapist ratings was used (N = 4, 14%). A significant relationship between adherence and at least one youth outcome was reported in 24 studies (83%), while five (17%) reported no relationship.
The relationship between competence and outcome was measured in nine studies across ten papers [11, 12, 46, 51, 53, 59, 64, 74, 79, 80]. Competence was rated mostly by observers (N = 7, 77%), and in two instances by supervisors. A significant relationship between competence and at least one youth outcome was reported in five studies (56%), while four (44%) found no relationship.
The relationship between composite fidelity and outcome was measured in five studies across seven papers [81,82,83,84,85,86,87]. Composite fidelity was rated mostly by observers (N = 4, 80%), and in one instance by supervisors. A significant relationship between composite fidelity and at least one youth outcome was reported in two studies (40%), while three (60%) found no relationship.
A small number of studies considered potential moderators of the strength or direction of the relationship. Sexton and Turner  found an interaction for youth risk, such that adherence more strongly predicted outcome in the presence of high peer risk. Schoenwald et al.  found an interaction between therapist adherence and organisational structure and climate, such that therapist job satisfaction predicted improvements in behaviour only when therapist adherence was low. Two studies measuring alliance found no interaction with the relationship between therapist adherence or fidelity and outcome [47, 82], whilst one study found the relationship between therapist competence and outcome became non-significant when controlling for alliance . In the two studies in which it was investigated, an interaction between adherence and competence was not found in the prediction of outcomes [11, 12].
Complete details of study quality ratings are available in Online Resource 4. 43% of studies (N = 15) reported a low rate of uptake of the intervention or research study by eligible cases, and 40% of studies (N = 14) reported risk of bias due to dropout or missing data from the study. 46% of studies (N = 16) also failed to report clear and valid selection criteria. Finally, low power or small sample size was identified in 34% of cases (N = 12).
Risk of bias in the reliability and validity of the adherence or competence measures was identified in 66% of studies (N = 23). For most studies, this risk of bias related to using a non-independent informant (i.e. therapist, client or supervisor) rather than an observer (17 studies). Where observer ratings were used, and inter-rater reliability was reported, the majority of studies met the inter-rater reliability thresholds, indicating that the reliability of observer ratings was rarely a concern. Inter-rater reliability scores for each study are presented in Online Resource 2. However, only 11% of studies (N = 4) reported outcome measures at risk of bias. Those that did use subjective ratings such as therapist reported criminal behaviour  or adapted forms of validated measures , rather than validated questionnaires.
Baseline symptom severity was controlled for in 83% of studies (N = 29), such that any relationship between implementation and outcome was independent of the influence baseline severity may have on both factors. However, 43% of studies (N = 15) failed to report consideration of any other confounding variables, and amongst those which did so there was considerable variation in controlled variables. These included demographic or treatment variables such as age or gender (N = 8), dosage, time in treatment or assessment interval (N = 5), parent marital status (N = 4), income (N = 3) and therapeutic alliance (N = 3).
Two adherence studies and one competence study with insufficient data to compute effect sizes were excluded from the analysis [71, 72, 79]. Two papers reported outcomes for two independent samples [8, 12].
The 29 adherence-outcome effect sizes ranged from − 0.070 to 0.444 (see Fig. 2), and a small but statistically significant relationship between therapist adherence and outcome was identified, r = 0.096 (95% CI = 0.058, 0.134), z = 4.938, p < 0.001 (see Table 2). Variance in effect sizes was significantly greater than would be expected by sampling error alone [Q (28) = 62.352, p < 0.001] and, therefore, likely affected by differences between studies, although no significant moderation effect was identified (see Table 3). However, consideration of individual effects for each moderator group suggested a small number of circumstances under which adherence was not significantly associated with outcome. These were youth non-CBT intervention (r = 0.006, 95% CI = − 0.145, 0.158, z = 0.082, p = 0.935), and where client informants were used to rate adherence (client informant r = 0.040, 95% CI = − 0.034, 0.113, z = 1.049, p = 0.294; client and therapist composite: r = 0.119, 95% CI = − 0.048, 0.280, z = 1.400, p = 0.162). All other moderator categories tested were significantly associated with outcome (clinical group categories: r = 0.071–0.127; intervention type categories: r = 0.089–0.169; informant categories: r = 0.105–0.148).
The nine competence-outcome effect sizes ranged from 0.000 to 0.173 (see Fig. 3); however, competence did not have a statistically significant association with outcome, r = 0.026 (95% CI = − 0.020, 0.073), z = 1.119, p = 0.263 (see Table 2). There was no significant variance in effect sizes [Q (7) = 2.595, p = 0.957] indicating the studies likely represent a common population mean. Although there were insufficient levels to test informant as a moderator, there was no significant moderation effect for clinical group or intervention modality (see Table 3).
The five fidelity-outcome effect sizes ranged from − 0.273 to 0.213 (see Fig. 4); however, composite fidelity did not have a significant association with outcome, r = 0.06 (95% CI = − 0.070, 0.191), z = 0.9153, p = 0.360 (see Table 2). There was no significant variance in effect sizes (Q (4) = 7.700, p = 0.103), indicating the studies likely represent a common population mean, although there were insufficient levels to test any moderation analysis for this effect.
Using a subsample of 22 adherence effect sizes with the lowest risk of methodological bias, the overall size of effect remained very similar to the effect seen in the main analysis when all effects were included (r = 0.097 (95% CI = 0.052, 0.141), z = 4.253, p < 0.001). Tests for heterogeneity also indicated that variance in effect sizes remained significantly greater than would be expected by sampling error alone [Q (21) = 44.579, p < 0.01], although the extent of observed variation was reduced.
The Begg and Mazundar  random effects rank correlation test (see Table 4) indicated no risk of publication bias, although the test may have lacked power to detect an effect with the relatively small sample size. Sensitivity analysis indicates that correction for moderate publication bias would reduce the strength for all effects, and correction for severe publication bias could reverse the direction (see Table 4).