Background

Health care providers (HCPs) are critical to increasing coverage of health interventions and producing better health outcomes. However, inadequate performance of HCPs in low-income and middle-income countries (LMICs) is common [1, 2]. Unsafe and ineffective medical care in LMICs has led to a considerable burden in terms of reduced productivity, disability-adjusted life years lost, and death [2,3,4].

For more than 40 years, supervision has been recommended as a strategy to support health programs and improve HCP performance in LMICs, where HCPs often work in isolated settings [5,6,7,8,9,10,11]. Kilminster et al. defined supervision as: “the provision of guidance and feedback on matters of personal, professional and educational development in the context of a trainee’s experience of providing safe and appropriate patient care” [12]. Bosch-Capblanch et al. emphasized that supervision helps connect peripheral health units and the district center [13]. Supervision typically includes a range of activities, usually at the supervisee’s workplace, such as problem-solving, reviewing records, and observing clinical practice [13, 14]. Many terms have been used to label supervision, such as routine supervision [13], managerial supervision [13], primary health care supervision, [14] enhanced supervision [15], supportive supervision [16, 17], and facilitative supervision [18].

Substantial resources have been used to support supervision. For example, the World Health Organization’s Programme for the Control of Diarrheal Diseases conducted supervisory skills training to health staff in more than 120 countries [19]. From 2017 to 2019, 69 of 72 country-specific annual plans of the President’s Malaria Initiative funded supervision [20,21,22]. For the 2018–2020 funding cycle, grant recipients of The Global Fund to Fight AIDS, Tuberculosis, and Malaria (Global Fund) budgeted US$311 million for supervision (personal communication; Rasheed Raji; Global Fund; June 1, 2021).

Although HCP supervision activities are ubiquitous in LMICs, evidence about their effectiveness is unclear. A recent analysis of data from an extensive systematic review of studies from LMICs (the Health Care Provider Performance Review, HCPPR) found that supervision had a moderate effect on HCP practices, widely varying across studies (median improvement: 14.8 percentage-points (%-points); range: –6.1, 56.3; interquartile range (IQR): 6.2, 25.2) [1]. That analysis, however, combined different strategies (e.g., routine supervision, audit with feedback, and peer review) with diverse implementation approaches (e.g., varying frequency and support for supervisors) into a single supervision category. Other reviews of supervision have key limitations, such as: providing only non-quantitative, narrative summaries; including few studies, few studies from LMICs, or studies with weak designs; or not accounting for the effect of non-supervisory co-interventions [13,14,15,16, 23, 24].

We performed a secondary analysis of HCPPR data to: (1) characterize the effectiveness of different supervision strategies, and (2) identify attributes associated with greater effectiveness of routine supervision. We present evidence that can help decision-makers select and design more effective supervision strategies, and we reveal important knowledge gaps about supervision effectiveness.

Methods

This report uses the same methods as those used in an HCPPR-based analysis of training strategies [25]. We analyzed data from the HCPPR (PROSPERO registration CRD42016046154). Details of the HCPPR’s inclusion criteria, literature search, data abstraction, risk of bias assessment, effect size estimation, and assessment of publication bias have been published elsewhere [1, 26]. A summary is presented below.

Inclusion criteria

The HCPPR included published and unpublished studies from LMICs in the public and private sectors that quantitatively evaluated a strategy to improve HCP performance. HCPs were broadly defined as hospital-, health facility-, or community-based health workers; pharmacists; and shopkeepers who sell medicines. Eligible study designs included pre- versus post-intervention studies with a randomized or non-randomized comparison group, post-intervention only studies with a randomized comparison group, and interrupted time series (ITS). The HCPPR included studies on any health condition in any language.

For this report, we only included studies that tested strategies with an HCP supervision-related component, although many strategies also had other intervention components. Additionally, we only analyzed HCP practice outcomes (e.g., patient assessment, diagnosis, treatment, counseling, and referral). These outcomes, which are typically the focus of supervision, were the most frequent ones studied in the HCPPR.

Literature search and data abstraction

The HCPPR searched 52 electronic databases for published studies and 58 document inventories and websites for unpublished studies from the 1960s to 2016. The literature search also involved screening personal libraries and bibliographies from previous reviews and asking colleagues for unpublished studies.

To identify eligible reports, titles and abstracts were screened, and when necessary, a report’s full text was reviewed. Data were abstracted independently by two investigators or research assistants using a standardized form. Discrepancies were resolved through discussion. Data elements included HCP type, improvement strategies, outcomes, effect sizes, and risk of bias domains. Study investigators were queried about details not available in study reports.

Strategy definitions

The HCPPR coded the presence of 207 strategy components for each study arm exposed to an improvement strategy and grouped them into categories. For HCP supervision strategies, six specific categories were created (e.g., routine supervision) (Box 1, top part). Eleven other categories were more general (e.g., training, group problem-solving).

Risk of bias assessment

Risk of bias was categorized at the study level as low, moderate, high, or very high, based on guidance from the Cochrane Effective Practice and Organisation of Care Group [27]. Randomized studies, ITS, and non-randomized studies were initially categorized as low, moderate, and high risk of bias, respectively. Risk of bias domains (e.g., dataset completeness, balance in baseline outcome measurements, etc.) were then assessed. A study’s risk-of-bias category was dropped by one level for every applicable domain that was “not done” and for every two applicable domains that were “unclear”.

Estimating effect sizes

The primary outcome measure was the effect size, which was defined as an absolute %-point change in an HCP practice outcome and calculated such that positive values indicate improvement. For study outcomes that decreased to indicate improvement (e.g., percentage of patients receiving unnecessary treatments), we multiplied effect sizes by –1. For non-ITS studies with percentage outcomes (e.g., percentage of patients treated correctly), effect sizes were calculated using Eq. 1. Effect sizes were based on the baseline value closest in time to the beginning of the strategy and the follow-up value furthest in time from the beginning of the strategy:

$${\text{Effect size = }}\left( {{\text{follow-up}} - {\text{baseline}}} \right)_{{{\text{intervention}}}} - \left( {{\text{follow-up}} - {\text{baseline}}} \right)_{{{\text{control}}}}$$
(1)

In non-ITS studies, for unbounded continuous outcomes (e.g., average consultation time per patient in minutes), effect sizes were calculated with Eq. 2. If the baseline value for either the intervention or control group equaled zero, the effect size was undefined and thus excluded:

$${\text{Effect size }} =100\% \times\left[ {\left( {\frac{{{\text{follow-up}} - {\text{baseline}}}}{{{\text{baseline}}}}} \right)_{{{\text{intervention}}}} - \left( {\frac{{{\text{follow-up}} - {\text{baseline}}}}{{{\text{baseline}}}}} \right)_{{{\text{control}}}} } \right]$$
(2)

For ITS studies, segmented linear regression modeling was performed to estimate a summary effect size that incorporated both level and trend effects [28]. The summary effect size was the outcome level at the mid-point of the follow-up period as predicted by the regression model minus a counterfactual value that equaled the outcome level based on the pre-intervention trend extended to the mid-point of the follow-up period.

Analysis

For objective 1 (characterize effectiveness of supervision strategies), we analyzed five types of study comparisons (Box 2). To estimate strategy effectiveness, the effect size for each study comparison was defined as the median of all effect sizes (MES) within the comparison. For example, if a study had three outcomes (e.g., percentages of patients correctly assessed, diagnosed, and treated) and one effect size per outcome, the MES was the median of the three effect sizes. For each supervision strategy, the MES distribution was described with a median, IQR, minimum, and maximum. Results were stratified by outcome scale (percentage versus continuous), HCP cadre (professional [generally health facility-based health workers] versus lay [generally community health workers]), whether the supervision was combined with other intervention components, and study type (equivalency versus non-equivalency).

Results

Literature search

The HCPPR screened 216 483 citations and included 2272 reports (Additional file 1: Figure S1). Of those, 165 reports were eligible for this analysis. These reports presented 338 effect sizes from 90 comparisons in 81 studies (see Additional file 1: Tables S1–S4 for sample size details and Additional file 2 for study-level details and citations). These studies were conducted in 36 LMICs and represented a diversity of methods, geographical settings, HCP types, work environments, health conditions, and practices (Additional file 1: Tables S7–S10). Only one of the 81 studies involved an equivalency comparison that included a gold standard control group (Box 2, footnote d; Additional file 1: Table S1, Note). Nearly two-thirds of studies (63.0%) had randomized designs, and 42.0% had a low or moderate risk of bias. The median follow-up time per study was 6.0 months (from 74 studies that reported follow-up time; IQR: 3.0–11.5), median number of health facilities per study was 23 (from 64 studies reporting health facility sample size; IQR: 11–75), and median number of HCPs per study was 92 (from 45 studies reporting HCP sample size; IQR: 43–168). Most studies (81.5%) were published since 2000. We found no evidence of publication bias (Additional file 1: Figure S2).

Effectiveness of supervision strategies (objective 1)

Table 1 presents the effects of supervision strategies on the practices of professional HCPs. As most results are based on one or two study comparisons and thus have limited generalizability, our discussion focuses on strategies tested by at least three study comparisons (i.e., at least three comparisons with percentage outcomes or at least three comparisons with continuous outcomes). The following findings are supported by low-quality evidence primarily because many studies had a high risk of bias.

Table 1 Effectiveness of supervision strategies on the practices of professional health care providers

For routine supervision alone, for percentage outcomes and when compared to controls, the median improvement in HCP practices was 10.7%-points (Table 1, row 1; Fig. 1). For example, for a percentage outcome with a typical baseline performance level of 40% and supervision effect of 10.7%-points, the post-supervision performance level would be 50.7%. Furthermore, supervision effects were very heterogeneous. One-quarter of MES values were relatively small (≤ 6.9%-points) and one-quarter were relatively large (27.9–67.8%-points). The marginal effect of routine supervision when added to other non-supervision strategy components was 4.1%-points (Table 1, row 3).

Fig. 1
figure 1

Effectiveness of supervision strategies for professional health care providers in low- and middle-income countries, as assessed with outcomes expressed as percentages. N = number of study comparisons. Red indicates results from a single study, which should be interpreted with caution. The numbers next to each spoke are the median of median effect sizes, in percentage-points, and (in parentheses) the number of study comparisons. For each comparison, the arrow points toward the study group with greater effectiveness. For example, routine supervision was more effective than controls by a median of 10.7 percentage-points. aThese are non-supervision strategy components (e.g., training) that could vary among study comparisons, but are the same for any two arms of a given study comparison (e.g., routine supervision plus training versus training)

Audit-and-feedback alone, compared to controls, typically had an effect similar in magnitude to that of routine supervision. For percentage outcomes, the median effect of audit with in-person feedback alone was 15.0%-points (Table 1, row 6). For six study comparisons involving audit with in-person feedback alone or combined with written feedback (Table 1, rows 6 and 12), the median effect was 10.1%-points (IQR: 6.2, 23.7; Table 1, footnote e).

We found only four eligible studies (one with a moderate risk of bias, 3 with a high risk of bias) of supervision strategies to improve lay HCP practices (Additional file 1: Table S11). Most findings were supported by low-quality evidence. The effect of routine supervision was difficult to characterize, as results varied widely from the few studies that tested this strategy.

Attributes associated with effectiveness of routine supervision (objective 2)

All results on supervision attributes are supported by low-quality evidence because many studies had a high risk of bias. See Additional file 1: Tables S5–S6 for sample sizes and risk-of-bias categories for all three modeling databases. Modeling of the “supervision alone” database included 9 comparisons from 9 studies, which are the same 9 comparisons in Table 1, row 1. Modeling of the supervision alone database found no supervision attribute with a univariable p-value < 0.10; thus, results of this database are not discussed further. Adjusted R2 values of the models for the other two databases (supervision/training and supervision/other) ranged from 0.11 to 0.27, indicating that they explain only a small amount of the variation in effect sizes.

Modeling of the supervision/other database showed that the mean effect of supervision in which supervisors received supervision was 8.8 to 11.5%-points higher than when supervisors had not received supervision (p-values: 0.051 to 0.097). The effect of supervisors participating in problem-solving with HCPs was large (14.2 to 20.8%-points, p-values: 0.032 to 0.098).

The effects of supervision frequency (i.e., number of visits per year) and dose (i.e., the number of supervision visits during a study) were unclear. One head-to-head study of lay HCPs with a low risk of bias found that monthly supervision was somewhat more effective than supervision every two months, by 7.5%-points. [31] However, the modeling results from studies of routine supervision among professional HCPs compared to no-intervention controls did not show a relationship between supervision dose and improvement in HCP practices  (from univariable models from the supervision only, supervision/training, and supervision/other databases: effects of –0.4 to –1.5%-points per additional supervision visit, p values: 0.12 to 0.50).

Training for supervisors, supervisor's use of a standard checklist, and explicit inclusion of feedback during supervision visits were not associated with effectiveness of routine supervision. Univariable modeling results for the effect of training for supervisors from the supervision only, supervision/training, and supervision/other databases were 14.3%-points (p = 0.17), 6.4 to 8.0%-points (p-values: 0.19 to 0.41), and –0.03 to 0.4%-points (p-values: 0.94 to 0.99), respectively. Univariable modeling results for the effect of a standard checklist from the three databases ranged from 4.1 to 5.8%-points (p-values: 0.27 to 0.54). Univariable modeling results for the effect of explicit inclusion of feedback from the supervision only and supervision/training databases ranged from 5.9 to 9.4%-points (p-values: 0.22 to 0.32), while multivariable models from the supervision/other database showed effects of 4.1 to 5.6%-points (p-values: 0.14 to 0.27; Additional File 1: Table S13).

Modeling of the supervision/training database showed that the mean effect of supervision increased by 0.68 to 0.91%-points per month (p-values: 0.012 to 0.081) since the beginning of supervision. However, modeling of the supervision/other database found that the effect of time was smaller (0.18%-points per month, p = 0.52). Increasing baseline HCP performance was consistently associated with decreasing supervision effectiveness, by 0.094 to 0.23%-points per 1%-point increase in baseline performance.

Cost of routine supervision

Among 67 study arms from 62 studies of professional HCPs exposed to routine supervision, data on cost or from an economic evaluation of any type were available for only 25 arms, or 37.0%. Only 6 arms from 5 studies had data that allowed us to calculate cost per HCP per supervision visit. These 5 studies were from Africa, Asia, and Latin America. The median number of supervisory visits per arm was 2.5 visits (range: 1, 4). The median cost per HCP per supervisory visit was $46 (IQR: 25, 72; range: 10, 343). Cost was not related to study year, which ranged from 2001 to 2012.

Among 6 study arms from 5 studies among lay HCPs exposed to routine supervision, data on cost or from an economic evaluation of any type were available for 2 arms. Only 1 study arm had data that allowed us to calculate cost per HCP per supervision visit: $77 (from a 1992 study in Paraguay, a middle-income country).

Discussion

In LMICs, health programs’ use of supervision to improve HCP performance is widespread and well-resourced, especially in recent years. We analyzed HCPPR data to compare the effectiveness of different supervision strategies and identify attributes associated with routine supervision effectiveness. Strengths of this study are that the data came from an extensive systematic review of evidence from LMICs, and we used multiple analytical approaches to gain a more comprehensive understanding of supervision effectiveness.

For professional HCPs, routine supervision was associated with moderate improvements in HCP practices when used as a sole intervention (median: 10.7%-points) and small marginal improvements when combined with other intervention components (median: 4.4%-points). Audit with feedback had similar effects. These findings were generally consistent with those from other reviews. The effect of supervision from Holloway et al., with all studies from LMICs, was 7.1%-points (personal communication from Kathleen Holloway, June 5, 2020) [32]. A review on audit with feedback (with 4 of 140 studies from LMICs) found a median effect of 4.3%-points for dichotomous outcomes [24]. It is likely no coincidence that the effects for supervision and audit with feedback are similar: although the labels for these strategies sound distinct, the intervention activities largely overlap.

The effect of benchmarking alone is unclear, as all studies with this strategy included other supervision-related intervention components. The effects of peer review and non-supervisory support for HCPs also are uncertain, as these strategies were tested in only 1 and 2 studies, respectively.

The effect of routine supervision for lay HCPs was difficult to characterize because few studies existed, and effectiveness in those studies varied considerably. A review by Gangwani et al. concluded that supervision “may enhance the quality of community health workers’ work” [17]; however, this review included some studies that were ineligible for our analysis because of weak study designs. Results from two trials from the Gangwani review that would have been eligible for our analysis (but were not included because they were published after the HCPPR literature search had ended), had they been added, would not have changed our conclusions (see Note on Additional file 1: Table S11).

We found two attributes associated with higher effects of routine supervision for professional HCPs: supervisors received supervision, and supervisors participated in problem-solving with HCPs. Providing supervision is difficult, with supervisors facing many challenges, such as inadequate management skills, non-supervision duties that leave insufficient time for supervision, and loss of effective supervisors due to staff turnover [13,14,15, 33]. These challenges remind us that supervisors are health workers too [33], and they need regular supportive guidance and feedback to help overcome barriers to effective implementation of supervision.

Involving HCPs in problem-solving, as in the “improvement collaborative” approach, has been associated with large improvements in HCP performance [1, 34,35,36], and joint problem-solving between a supervisor and supervisee is considered a helpful behavior [12, 14]. A review by Bailey et al. however, noted that problem-solving during supervision “did not necessarily translate into consistent improvements in clinical practice, unless the supervisor was considered as friendly and supportive” [16].

We found inconclusive results on the effects of supervision frequency and dose. Our analysis, however, was limited by: missing data on supervision frequency, potential reverse causality or confounding if supervisors made more visits to health facilities where improvements were more difficult to achieve, and potential dilution of effect if HCPs exposed to supervision were not the same HCPs surveyed [15]. Nevertheless, our results seem to reflect the current state of the literature. Two studies that performed within-study analyses found that increasing supervision dose was associated with better performance [37, 38], and a review of audit with feedback concluded that feedback might be more effective if it is provided more than once [24]. However, another review found that more intensive supervision (e.g., with more frequent visits) is not necessarily more beneficial [13].

Our results did not corroborate one review’s recommendation that training for supervisors would increase effectiveness [15]. Univariable modeling from several databases consistently found weak statistical evidence for the effect of training for supervisors.

Regarding the effect of supervision over time, we found improvements of 0.18 to 0.91%-points per month. Another analysis of HCPPR data by Arsenault et al. that examined the effect of time in a more nuanced fashion (using multiple follow-up time points per study) found inconsistent time trends for supervision: some analyses found positive time trends (mean improvements of 0.82 to 0.88%-points per month), while a key sensitivity analysis showed no improvement over time [34].

Our study’s finding about the association between baseline HCP performance and the effectiveness of routine supervision agreed with a review of audit with feedback, which concluded that feedback might be more effective when baseline performance is low [24].

The overall strength of evidence on supervision strategies to improve HCP practices is weak, and substantial knowledge gaps remain. Our understanding of supervision would benefit from additional studies using more rigorous designs and standardized methods to replicate key results (an essential part of the scientific method), investigate promising new supervision strategies, identify the optimal frequency of supervision, and expand the evidence base for lay HCPs (Box 4). Future studies should report details on supervision frequency, cost, context, and—of particular importance—the specific activities of the supervision process. Such process details could be used to classify and compare strategies more precisely in future reviews and thus facilitate decision-making by programs. Non-standardized strategy labeling is a challenge with quality improvement research in general, and researchers and implementors would be wise to move beyond the vague descriptors that are too often used for strategies such as supervision.

Conclusions

Although the evidence is limited, our study has characterized the effectiveness of several supervision strategies in LMICs and supports supervising supervisors and having supervisors engage in problem-solving with HCPs for more effective supervision. We also developed evidence-based recommendations for strengthening future research on supervision strategies. Supervision’s integral role in health systems in LMICs justifies a more deliberate research agenda to identify how to deliver supervision to optimize its effect on HCP practices, health programs, and health outcomes.