Background

Globally, there is recognition that quality care is a human right and improving quality services is critical to obtaining universal health coverage and helping countries meet the 2030 Sustainable Development Goals (SDG) [1,2,3]. Tremendous progress in family planning has been made in the last 50 years, but the unmet need for contraceptives remains unacceptably high especially in low-and-middle income countries (LMICs) [4]. A time trend analysis of data in over 70 LMICs found annual increases in proportion of women with demand satisfied and reductions in inequalities by wealth and geographic area [5]. However, only 44 of 62 LMICs included in the analysis were projected to meet the SDG of greater than 75% demand for family planning services satisfied by modern methods [5]. This coverage gap varies by region: West and Central Africa have the lowest percent of demand satisfied (33%) while South Asia, Latin America, and the Caribbean have the highest percentage of demand satisfied (70%) [6]. Improving the quality of service provision may help close this gap.

The World Health Organization (WHO) and the Institute of Medicine (IOM) have outlined domains of health service quality that include safety, effectiveness, timeliness, efficiency, equity, accessibility, and patient-centeredness [7, 8]. Judith Bruce’s 1990 seminal family planning quality of care framework includes elements of service such as method choice, services appropriateness, and continuity of care. This and other frameworks define technical competence (i.e., provider knowledge and skills) and interpersonal relationships (i.e., client-provider interactions and client experience) as interrelated elements. Later frameworks redefined quality of family planning care with more a client-centered, right-based lens such as the 2014 World Health Organization recommendation for scaling up rights-based contraceptive program, the 2016 International Planned Parenthood Federation framework, and the 2018 Jain commentary [9,10,11].

The classic Donabedian quality framework defines what is needed to measure quality: structures, processes, and outcomes [12]. Structural quality is the setting of care; processes include the standards and elements of care delivery; and outcomes are the client-level health, behavior, knowledge, and satisfaction effects of the processes of care [12]. In general, we aim to measure process quality of care. Structural quality is a prerequisite but does not guarantee quality services and outcomes are difficult to measure and attribute to process/structural quality, likely due to the heterogenous measures, tools and definitions [7]. For instance, Weidert, et al. found positive associations between counseling on contraceptive methods and provider supervision with long-acting contraceptive use in Togo, and Chang et al. found significant but inconsistent associations of facility-level quality measures with method continuation across two sites in Pakistan and Uganda [13, 15].

To improve quality services, we need well-defined quality measures with a clear linkage to population-level impacts, and tools that reliably and accurately measure them. Many tools have been developed and implemented to measure process quality of care but no agreed-upon standardized tool or method has emerged [16, 17]. Process quality of care can be measured by assessor observations of client-provider interactions in clinical settings (also known as “direct observation”), interviewing clients after their family planning visits (“exit interviews”), provider interviews/quizzes or clinical vignettes on knowledge of quality care practices (“provider knowledge assessment”), medical record review, or simulated client assessments involving either trained staff or women recruited from the communities to act as “mystery” clients. These tools measure different elements: provider knowledge, provider practices and client perspectives of quality care but they all aim to capture the same construct of process quality.

Program implementers in LMICs could measure each of these elements on a routine basis to ensure quality services, but it may not be feasible to conduct a comprehensive quality of care assessment in some low-resource settings. Direct observation or simulated client methods are considered “gold standard” tools because they capture quality of care through direct observation of provider practices—either by an assessor during direct observation or covertly through a simulated client but these methods are not feasible to implement regularly. Direct observation may not be feasible due to the expense of field-based data collection and the time it takes to administer them and simulated clients may not be feasible due to the time and skill required to recruit and train the simulators. Other tools like client exit interviews or provider knowledge assessments may provide a proxy but there is little information on how these measures compare to the “gold standard” (validity). Information on validity or reliability (i.e., how well the tools measure the same element of care) will help implementers select the most appropriate tool for their program, the quality assessment aims, and the context.

We aim to synthesize the literature on validity and reliability of family planning quality of care measurement tools through a systematic review. We define validity as how well a tool measures provider actions during a family planning consultation compared to a “gold standard” assessment, usually an observed or simulated client-provider interaction—otherwise known as concurrent validity. Validity gives us information how well these tools proxy actual processes of care. Reliability describes how different tools measure the same quality indicator which gives us information on how to interpret the data they produce. The findings of this review can help refine and delineate best practices for quality of family planning care measurement.

Methods

This study is a quantitative systematic review with no meta-analysis due to the heterogeneous methods, analysis and outcomes of the studies found during the review.

Search strategy and selection criteria

We developed three search concepts: family planning, quality of care, and comparability of tool metrics with a filter for LMICs. We searched PubMed, Embase, Popline, Global Index Medicus and SCOPUS. The terms were pretested to see if the search would identify two relevant articles. Additional file 1: Appendix S1 lists all search terms used for the PubMed database. For the initial search, we extracted all relevant citations up through 17th October 2017, with follow-up searches done through 31st March 2019 and 16th March 2021. The Popline database was retired in September 2019, therefore only papers up until 31st March 2019 (the second search) were included. We searched ProQuest’s Dissertation and theses database for additional, relevant studies and identified authors currently working in family planning quality of care measurement and contacted them for any unpublished findings.

We included studies that quantitatively measured quality of care for family planning in LMICs (Box 1). The studies must have compared at least two tools (e.g. direct observation and client exit interviews) and reported a quantitative measurement of the comparison. We did not require that the authors report uncertainty measures such as standard errors or 95% confidence intervals. Google translate was used to screen French and Spanish language titles and abstracts, and a staff person with language fluency was recruited for full text review/data extraction of manuscripts not written in English. We restricted the studies to quality of care measurement tools for health providers (excluding school or peer-based assessments) and included all papers published through 16th March 2021.

Results

The search yielded 8172 articles with 908 duplicates (Fig. 1). The full-text review identified 37 articles for exclusion: 21 did not provide comparison on the measurements generated from two or more tools, seven did not include a quantitative comparison, seven studies did not take place in a LMIC setting or did not measure family planning quality of care. We were unable to find full text documents for two studies: one was meeting minutes from a 1995 meeting at University of North Carolina and the second was a 1998 report from the International Planned Parenthood Federation. We included eight studies in our final analysis [25,26,27,28,29,30,31].

Fig. 1
figure 1

PRISM flow chart. Quality of care (QoC); family planning (FP); low-and-middle-income country (LMIC)

The studies were a mix of single country and multi-country studies with data from Sub-Saharan Africa, South Asia and Central and South America (Table 1). They were published from 1998 to 2020, six were published in peer reviewed journals and two were not peer reviewed. Three were secondary data analysis of Service Provision Assessment (SPA) data (Choi 2018), Quick Investigation of Quality (QIQ) data (Bessinger 2001) or Situational analysis data (Ndhlovu 1998) and five were primary data collection studies. Direct observation (n = 5) and simulated client (n = 3) protocols were identified as the “gold-standard” or the tool in which other comparisons were evaluated against. Comparator tools included client exit interviews, provider interviews and medical record review. One study (Tumlinson 2014) compared direct observation to simulated clients as the “gold standard” since providers do not know they are being assessed during a simulated client encounter.

Table 1 Description of the seven studies included in the summative analysis

A variety of test statistics were used to compare the tools and the sample size varied as well (range: 6429–49 clients), the secondary analysis studies pooling multi-country data tended to have larger samples compared to those conducting primary data analysis. Only three studies mentioned the provider type and four mentioned the type of contraceptive used in their quality assessment tools. All studies evaluated quality related to counseling, information given to the clients, interpersonal relationships, or respectful care. We identified one study as low risk of bias, six as medium risk and one as high risk of bias for the reported findings (Additional file 1: Appendix S2).

Four of the eight studies reported validity measures using a gold standard: (Choi 2018 and Hermida 1999) using direct observation as gold standard and (Tumlinson 2014 and Ratanajamit 2001) using simulated client as gold standard) (Fig. 2) The four remaining studies reported kappa statistics (Bessinger 2001 and Ndhlovu 1998) or percentage point difference (Tavrow 1997 and Thongmixay 2020).

Fig. 2
figure 2

Quantitative measures comparing client exit interviews and medical record review with direct observation

Most of the studies (n = 5) compared client exit interviews to direct observation (Fig. 2). Three of these studies reported comparability for counseling on method selection and use. The measurements varied from very poor agreement (kappa: 0.23) for “Methods discussed during consultation” to high agreement (kappa: 0.8) for “Client received her method of choice” (Fig. 2). Thongmixay (2020) reported percentage point (pp) differences from 20 to 60 pp depending on the indicator. Four studies investigated the comparability of side effects counseling as reported during the client interviews and as assessed during the direct observation. Again, these ranged widely based on the specific indicator. Choi, 2018 reported good sensitivity (90.7–73.6 country ranges) for “Gives information on side effects” while Ndhlovu, 1998 reported very low agreement (kappa: 0.23). Thongmixay (2020) found approximately a 60 pp discrepancy between client report and direct observation regarding whether the client was given information on the side effects.

Only Bessinger (2001) reported on the comparability of privacy and respectful care with high agreement (ranging from 0.75 to 0.98 kappa). We included a domain of “Other general counseling” for measurements that could not be easily grouped with other studies (Fig. 2). Hermida (1999) reported high sensitivity (96.2) and medium specificity (57.1) for whether the client received any counseling on family planning, Bessinger (2001) found high agreement on whether the client received accurate information on the method they received, and Ndhlovu (1998) found low agreement on whether the clients were told it was possible to switch methods. Hermida (1999) compared medical record review to direct observation and found high sensitivity (100%) and low specificity (14%) for whether the client received any family planning counseling (Fig. 2).

Most of the data using simulated client as the “gold standard” comes from one study (Fig. 3). Tumlinson (2014) compares client exit interviews, provider interviews and direct observations with simulated clients focusing on specificity as a validity measure. For instance, using direct observation as an example, did the health worker provide the same level of care during the direct observation when they knew they were being assessed as they did with the simulated clients when they (presumably) did not know it was a quality assessment. If specificity was lower, the health workers provided higher quality of care during the direct observation. Generally, they found low specificity except for two measurements: provider “helped client select a method” and “discussed warning signs”. For the first indicator, specificity was higher, 67% of the providers who did not help the client select a method for the simulated clients also did not do this for actual clients (as reported during the exit interviews). Similarly, for the second indicator, if the providers did not counsel the simulated client on danger signs, they did not mention this during the knowledge quizzes.

Fig. 3
figure 3

Quantitative measures comparing client exit interviews, provider interviews, and direct observation with simulated client

Tavrow, 1997 reported some degree of concurrence between simulated client and exit interviews as reflected in a low percentage point differences in the proportion of providers describing side effects (12.6 pp difference) and explaining how to manage them (8.5 pp difference) (Fig. 3). But a much higher proportion of simulated clients (83.6%) interviewed indicated that the provider showed the client how to use the method compared to what clients recounted during the exit interviews (47.1%) (36.5 pp difference). When comparing provider interviews quizzes and direct observation to simulated clients, Tumlinson, 2014 found poor specificity but higher positive predictive values depending on the indicator indicating provider have higher quality performance on quizzes and observational assessments compared to a simulated client, when they do not realize they are being assessed (Fig. 3). Similarly, Ratanajamit (2001), found high sensitivity (94%) and low specificity (20%) of the provider quizzes compared to simulated clients, indicative in the 2001 study that providers demonstrated higher level of knowledge than practiced.

Discussion

Through our systematic review, we found only eight studies comparing measurements from family planning quality of care tools used in LMIC settings. These studies were heterogenous in their methods and in the quality measurements they defined and compared so there was insufficient data to estimate overall summary measures of validity or other comparison measures of the tools.

The problem with this heterogeneity is twofold. One, without better standardization of tools, indicators, and methods, it is difficult to understand program and policy impacts on quality care, especially for cross-country comparisons and time-trend analyses. Two, the heterogeneity means there is little data on the validity or reliability to guide tool selection for measuring process quality. For instance, it may not be feasible to conduct routine direct observations to monitor quality but other, less intensive protocols such as client interviews or clinical vignettes could be used as a proxy, depending on the program, the assessment aims, and the setting. More evidence is needed on the validity and reliability of these tools to guide tool selection for family planning program monitoring and evaluation.

Despite the difficulty comparing the tools, some patterns emerged within the limited information available. Overall, the specificity of measurements from client exit interviews was low, even when the sensitivity was high, clients were reporting health provider quality activities not recorded by the gold standard observation (n = 5 studies). Clients may have been reporting their general knowledge of family planning or experience with provider(s) they saw previously instead of during the visit being evaluated. Program implementers and evaluators should be cautious when using client exit interviews to measure provider quality actions. Other studies have found that women have difficulty accurately reporting on more technical aspects of quality care, particularly for delivery and newborn care. Two 2016 related studies in Kenya and Mexico found that women could not accurately report on some process quality delivery and newborn care measurements [32, 33]. A 2021 study in Bangladesh, Nepal and Tanzania, comparing direct observation and patient exit interview data found similarly low levels of validity [34]. This evidence suggests exit interviews may be more appropriate to measure client experience, perceptions, and general knowledge rather than technical quality. Although, exit interviews on experiential quality should be carefully interpreted since clients may respond more positively about their experience when interviewed exiting a facility compared to when interviewed at home (known as courtesy bias) [35, 36].

We have very limited information on the validity of medical record review and provider interviews. Hermida, 1999 found medical records to be an adequate tool for identifying providers that did not counsel on family planning (100% sensitivity for identifying performance failures) but it performed poorly for indicating whether the provider counseled the client since the providers would counsel the client but presumably forget to document this in the register (Fig. 2). Tumlinso (2014) and Ratanajamit (2001) found provider knowledge of quality activities to be higher than observed performance of those activities (low specificity). This discrepancy of higher provider knowledge related to lower performance is also known as the “know-do” gap and has been reported elsewhere in LMIC health systems [37,38,39]. Tumlison (2014) compared direct observation to simulated client and found little comparability of the methods where sample size was sufficient for comparison (Fig. 3). Summarizing from the eight studies, specificity for many quality measurements is low, evidence that provider may change their behavior due to the assessor observation, also known as reactivity bias [40].

Most of the quality measurements compared in these studies focus on counseling and interpersonal relationships. None reported more clinical, technical competencies like sterile technique or correct application of contraceptives, understandable since client exit interviews and simulated clients are non-clinicians and cannot accurately assess these clinical competencies. Even within each tool, the quality measurements across the studies are different. Some are subjective (“Helped client select a method” or “Treats client with respect”) where others are more objective (“Asks client preference” or “Mentions HIV/AIDS”). It is likely that the more subjective measurements vary by tool and have high inter-rater variability.

During the screening phase of the review, we found many studies using multiple tools for measuring family planning process quality that did not report any compatibility measures. And protocols used globally such as the SPA, QIQ, and other facility-based assessments include exit interview, provider interview and direct observations tools that could be compared to gauge reliability of the data. When performing a quality assessment, researchers should consider including some key quality measurements in multiple tools to test the reliability of the data, particularly studies that focus on client exit interviews for their quality of care measurements.

One limitation of this study is that we selected studies that specifically reported on validity or comparability of tools. It is possible many studies evaluated this but did not report it. We contacted several authors for unpublished data and searched grey literature databases to address this, but there still may be publication bias present.

Conclusion

To measure family planning quality of care consistently and accurately in LMICs, a standardized suite of tools is needed along with an established method of combining them for a comprehensive picture of quality care. Heterogenous tools and metrics make it difficult to measure intervention or policy impacts on quality and to clearly describe the association of quality (e.g., counseling completeness) with outcomes (e.g., contraceptive continuation).

Family planning quality of care tools were crafted to measure different aspects of process quality: provider assessments measure knowledge, direct observations/simulated client protocols or medical record reviews measure provider practice, and exit interviews measure client knowledge, satisfaction, and experience. More research is needed on how well these tools proxy the actual processes of care provided to clients under everyday clinical conditions, especially since those interested in routine quality measures (annually or more frequently) are unlikely to have the time and resources to implement multiple tools. Selecting one or two methods is more feasible but there is little data on how the various tools compare and little guidance on which would be most appropriate for their context.

Although the number of studies in this review is small, there is emerging evidence of important differences in the same quality measurement produced by different tools. Though we have the most data on comparability of client exit interviews, they are a poor proxy of actual processes of care received and should be used primarily to measure client experience and knowledge. It is likely there are other important differences, but more studies of validity and comparability are needed.

Improving the reliability and accuracy of the methods used to measure quality of care will allow governments and program implementers to better monitor, understand, and improve quality and access of family planning services. As LMICs continue to scale-up quality-focused family planning services, accurate and timely measures of quality will inform and improve programs to reduce the unmet need for contraceptives meeting the goals laid out in the SGDs, the promises of universal health care coverage, and access to quality care as a basic human right [11].