Background

The growing focus on patient-centered outcomes in health care has been accompanied by increasing interest in targeted, individualized recommendations for clinical care, including screening, preventive interventions, and treatment. This is driven in large part by the rising recognition that medical interventions do not affect all patients in the same way, a situation referred to as heterogeneity of intervention (treatment) effects [1]. Guideline developers and other users of systematic reviews seek information about whether a preventive or medical intervention is likely to benefit some patients more (or less) than the average and to understand which patients are at greatest (or least) risk of intervention-related harm [2]. As such, there has been a need to develop methods for dealing with heterogeneity of intervention effects to help address concerns about the inappropriate clinical application of average effects and to aid guideline developers in making recommendations tailored to specific subpopulations of patients when appropriate. Two recent surveys of existing practices and published guidance for considering clinical heterogeneity in systematic reviews found that there is little consensus and limited clear guidance to support consistent approaches to this important issue, although these are very much needed [2, 3].

To this end, we have developed and piloted a process for including patient subpopulation considerations into all phases of systematic reviews with two explicit goals: (1) to provide a consistent, systematic assessment of the evidence base for specific subpopulations within a given systematic review and (2) to provide the U.S. Preventive Services Task Force (USPSTF) with the information necessary to inform judgments about the appropriateness of general population versus subpopulation-specific clinical practice recommendations. The approach described in this paper was developed to provide practical guidance for the consistent application of subpopulation considerations in systematic reviews conducted by Evidence-based Practice Centers (EPCs) funded by the Agency for Healthcare Research and Quality (AHRQ) to support primary care clinical preventive service recommendations made by the USPSTF.

Background work conducted for this project included an examination of available information on how major guideline developers and groups setting standards for systematic reviews address subgroup analyses and subpopulation issues. We chose ten groups with particular relevance to primary care preventive services in the USA or internationally recognized for their well-developed methods (see Table 1). In November 2015, we reviewed their websites (and manuals or written procedures, where available) for descriptions of methods used to evaluate subgroup-specific evidence and address subpopulation issues.

Table 1 Summary of subgroup-specific information addressed by select guideline developers and groups setting standards for systematic reviews

Based on our 15 years’ experience conducting systematic reviews for the USPSTF, and informed by relevant literature discussing subgroup analysis and subpopulation issues, we developed tools and methods for incorporation of subpopulation considerations into each of four phases of the systematic review process: (I) topic scoping and work plan (protocol) development, (II) data abstraction and critical appraisal, (III) data analysis and synthesis, and (IV) reporting and interpretation. We presented these tools and processes to the Subpopulation Workgroup of the USPSTF and revised our draft approach based on feedback from workgroup members. We refined our proposed methods based on pilot testing the approach on three reviews conducted for the USPSTF: Aspirin for the Primary Prevention of Cardiovascular Events, Screening for Lipid Disorders in Adults, and Screening for Obstructive Sleep Apnea in Adults [46].

Main text

Key concepts and definitions

We use the terms “subgroup” and “subpopulation” to refer to distinct elements, such that “subpopulations” refer to groups of individuals that are the target of policy or practice recommendations, and “subgroups” refer to specific types of analyses undertaken on a subset of participants (see Table 2). In the context of systematic reviews, differences between studies (i.e., heterogeneity) must be considered to appropriately summarize a body of evidence, including making decisions about whether or not to quantitatively combine results [7]. Heterogeneity considerations should inform decisions made about the review scope and methods during protocol development, including planned approaches to data abstraction and data synthesis, as well as final interpretation of review findings. The set of included studies for a systematic review question can differ somewhat or substantially in dimensions underlying heterogeneity: the populations studied, the interventions investigated, and the outcomes measured, as well as in the methods underpinning each study’s findings. These differences can be understood as clinical, methodological, and statistical heterogeneity, all of which inform the synthesis of evidence within a systematic review (see Table 2) [7].

Table 2 Definitions of terminology used

Clinical heterogeneity reflects variation between studies in the populations enrolled, in the active interventions and comparison interventions they receive, or in selection and timing of measured outcomes [2]. When there are variable intervention effects across studies, investigating how these differences may be related to effect variation can inform targeting or tailoring research information to specific populations, situations, or circumstances. Clinical heterogeneity is the type of heterogeneity most related to subgroup and subpopulation issues and therefore of major interest to clinical and policy-level decision-makers [3]. Methodological heterogeneity reflects differences in study design and conduct, including risk of bias, across studies in the systematic review [7]. Methodological heterogeneity can lead to differences in measured intervention effects, but these reflect artifacts of the research process rather than clinically relevant differences. Finally, statistical heterogeneity is revealed through statistical testing as to whether measured differences in intervention outcomes between studies in the body of evidence are greater than would be expected due to chance (generally p < 0.05) [2]. The job of the systematic reviewer is to understand the interrelationships of these factors, to control for (or investigate) them in assembling and analyzing a body of evidence to answer a particular question, and to summarize and communicate their implications for decision-makers.

Existing guidance

Both guideline developers and systematic reviewers working on behalf of guideline developers have a strong interest in specifying credible, relevant methods for fairly and consistently considering subgroup findings from primary studies and subpopulation differences in intervention effects. To inform our work, we therefore examined how selected prominent guideline developers or groups setting standards for systematic reviews address these two issues. Table 1 summarizes the subgroup-specific information addressed by selected guideline and review groups for each phase of a systematic review. Our review revealed that prominent guideline developers typically lack detailed information about how to plan and use subgroup analyses. While some of the review or guideline groups addressed the issue of handling subgroup data conceptually or in detail for a specific aspect of the systematic review, no group outlined a comprehensive approach to integrating subgroup considerations and analyses into all phases of the systematic review process.

GRADE provided the most comprehensive guidance on inclusion of subgroups in all phases of systematic reviews and is the only group that addressed credibility assessment of subgroup analyses [810]. The Cochrane Collaboration Handbook also included guidance on the use of subgroup analyses in reviews and addressed a priori selection of a small number of study characteristics for subgroup analyses that are supported by scientific evidence, how to analyze subgroup data to investigate heterogeneity, and interpretation of subgroup analyses, including caveats such as the potential for bias since subgroup comparisons are not usually accounted for by the randomization approach [3, 7]. In their description of methods for child health reviews, the Cochrane Child Health team also provides questions regarding age-based treatment effects to guide the planning of a priori subgroup analyses [11].

Selected other groups included in our scan (i.e., the Institute of Medicine (IOM), the National Institute for Health and Care Excellence (NICE), the Community Preventive Services Task Force (CPSTF), and the Canadian Task Force on Preventive Health Care (CTFPHC)) touched briefly on subgroup considerations for one or two of the systematic review phases. Description of subgroup methods during review scoping and work plan development was generally limited to a series of questions to guide the inclusion of subgroups [12, 13] or to specifying the standard that reviewers should describe and justify a priori any planned subgroup analyses, in the case of the IOM [14]. Information from these groups on reporting and interpretation of subgroup findings consisted of a few elements that should be reported by reviewers (e.g., clinical and methodological characteristics of studies) [7, 14]. Related efforts on equity-focused reviews and clinical guidelines by the PRISMA-Equity Bellagio group [15] and NICE [16], respectively, highlight the importance of addressing health disparities in systematic reviews. The PRISMA-Equity Bellagio group focuses on improved reporting in equity-focused systematic reviews, including items such as presentation of subgroup analyses [15]. The NICE guideline on addressing equality issues includes scoping discussions about inequalities in prevalence, risk factors, or severity and a priori identification of relevant subpopulations [16].

The professional society websites we reviewed (i.e., American Academy of Family Physicians (AAFP), American Academy of Pediatrics (AAP), American Congress of Obstetricians and Gynecologists (ACOG), American College of Physicians (ACP)) did not include any information about how subpopulations are considered in their guidelines or address how subgroup considerations are incorporated in the reviews of the evidence on which their guidelines are based.

Proposed approach

Below we describe the methods we developed for incorporating subpopulation considerations into the four major phases of a systematic review: (I) topic scoping and work plan (protocol) development, (II) data abstraction and critical appraisal, (III) data analysis and synthesis, and (IV) reporting and interpretation (Fig. 1). We developed these approaches primarily to support our work conducting systematic reviews for the USPSTF given their need to judge the appropriateness of general population versus subpopulation-specific clinical practice recommendations. The process of translating the subpopulation evidence presented in systematic reviews into clinical practice recommendations is described in another manuscript [17]. Below we provide examples for many of our processes and tools based on our systematic review experiences with the USPSTF.

Fig. 1
figure 1

Major phases of systematic reviews and corresponding subpopulation processes

Phase I: topic scoping and work plan development

Topic scoping

Decisions about which subpopulations will be investigated in a systematic review should be based on understanding of the existing evidence base [18]; therefore, the first step in exploration of important subpopulations during topic scoping involves targeted literature searches informed by clinical consultation as necessary. These literature searches include:

  • How other guideline groups have recently handled subpopulation considerations for the topic

  • How other recent, well-conducted systematic reviews have handled subpopulation considerations for the topic

  • Data on incidence, prevalence, morbidity, and mortality for the condition of interest by age, sex, race/ethnicity, and important topic-specific clinical characteristics

The collected information is used to identify presumptive subpopulations of interest, understand the issues within the literature related to relevant subpopulations, and develop a set of questions for key informant interviews consisting of two to four clinical and content experts in the systematic review topic area. Key informant candidates may include, for example, previous reviewers for the specific content area, authors of validated risk assessment tools, principal investigators of large trials that include subgroup analyses, leaders in professional societies relevant to the clinical topic, or members of clinical guideline panels. The purpose of conducting key informant interviews is to learn which subpopulations experts would be most concerned about being given a general population screening and/or treatment recommendation, as opposed to a subpopulation-specific recommendation, and why.

Key informants can help determine what is known about sources of heterogeneity of intervention effects (e.g., prior subgroup analyses, dose-response relationships, or differences in outcomes) and known or concern about potential subpopulation differences for the topic. Candidate patient-level variables to define subpopulations include age, sex, race, ethnicity, comorbidities, baseline disease risk, disease severity or other important disease features, genetic variants, or psychosocial variables with a clear scientific rationale as a treatment effect modifier [19]. Key informant questions can confirm or query important issues on potential mechanisms of preventive services heterogeneity within specific subpopulations (e.g., differing baseline risk of disease-related outcomes, competing risks/limited life expectancy, varying risk(s) of intervention harm(s), variable responsiveness to the preventive intervention, differential impact of time to benefit or to harm, primary and differing values for patient-important outcomes). Table 3 shows how these mechanisms might affect questions about heterogeneity for different types of clinical preventive services to support development of questions for key informants. Experts can help identify epidemiological data to support potential mechanisms of heterogeneity as well as validated risk assessment tools or large multivariable analyses showing the combined impact of potential subgroup factors on outcomes for the condition of interest. Table 4 provides sample questions to guide reviewers in developing topic-specific questions for obtaining feedback from key informants.

Table 3 Potential drivers of heterogeneity of intervention effects for different types of clinical preventive services
Table 4 Key informant interview sample questions

In our experiences with implementing this approach, we confirmed the value of eliciting expert input into our subpopulation considerations early in the review process. These experts can often provide guidance about important resources (e.g., presentations from relevant professional society meetings) or ongoing research that would otherwise take considerable effort to locate. By efficiently helping us understand the perspectives of clinical and research experts, we could more quickly focus on subpopulations with sufficient prior evidence or controversy to guide our protocol development. We also found that a conference call format (conducted one-on-one or with a few individuals) may be more conducive to gathering detailed expert perspectives with accompanying rationale and allows for easy clarification of complex statements. Eliciting expert feedback via email, however, can still provide valuable information with limited time and effort expended.

Work plan development

Work plan development for a systematic review includes drafting an analytic framework, research questions, and inclusion/exclusion criteria that specify the logic and scope of the review, including the populations, interventions, comparators, and outcomes of interest. An analytic framework is a graphic representation of linkages between interventions and outcomes that helps to identify the questions that the review is addressing [7, 2022]. The background searches and key informant interviews described above help determine whether and how relevant subpopulations will be incorporated into the analytic framework, research questions, and inclusion/exclusion criteria that guide the literature searches, data abstraction, and analysis processes in later phases of the systematic review.

We developed a summary table to assist reviewers in presenting the findings from the topic scoping process, including the key informant interviews, and outlining recommendations for incorporation of specific subpopulations into the work plan for consideration and approval by AHRQ and the USPSTF. Table 5 provides an example of a completed summary table for a review on aspirin for the primary prevention of cardiovascular events [5]. The primary purpose of the table was to support the a priori selection of a limited number of patient subpopulations to be examined in the systematic review and to provide the rationale for inclusion of these subpopulations. The six columns in the table are defined as:

Table 5 Summary of work plan guidance for subpopulation considerations—example. Aspirin for prevention of cardiovascular events
  1. (A)

    Previous systematic review’s approach: How each subpopulation of interest was addressed in the previous systematic review, if at all.

  2. (B)

    Separate recommendation statement: Whether the guidance included a separate recommendation statement for each subpopulation of interest.

  3. (C)

    Importance: Initial summary rating of the importance of each subpopulation relative to others suggested for inclusion in the systematic review to inform parsimonious selection.

  4. (D)

    Rationale: Summary of information that supports each subpopulation as important and relevant to the systematic review (e.g., epidemiological trends, biological plausibility), including how key informant input supports the rationale for each subpopulation.

  5. (E)

    Policy context: How recent reviews, meta-analyses, and clinical practice guidelines address preventive services recommendations for each identified subpopulation, including any disagreement across guidelines and reviews and how key informant input supports the policy importance of each subpopulation.

  6. (F)

    Proposed work plan approach: Whether each subpopulation is proposed to be one of the a priori subpopulations for this review, and potential approaches to including it in the work plan, including hypothesized direction of effect, impact on net benefit, and mechanisms of action, if known.

As a result of the application of this process, the review designated age and sex as the a priori subgroup analyses and subpopulations of highest importance for the systematic review to update evidence addressing both benefits and harms [5]. We listed other cardiovascular disease (CVD) risk factors (including smoking, diabetes, blood pressure, and peripheral artery disease (PAD)) as important to examine for potential effect modification in terms of aspirin’s benefits, in particular, and listed selected medications, including selective serotonin reuptake inhibitors and non-aspirin non-steroidal anti-inflammatory drugs, as modifiers of potential harms of treatment only. A focused, a priori approach can be criticized for not being comprehensive; however, it conforms to guidance for parsimonious selection of a priori subgroups [19] and is important for feasibility. It does not preclude exploratory findings, when noted as such, or limit the span of issues that can be addressed in future updates.

Phase II: data abstraction and critical appraisal

Data abstraction

Data abstraction is one of the most important and time-consuming steps of a systematic review [7]. The data collection instrument (e.g., evidence table, database, web-based systematic review software) is designed to extract critical and relevant data from eligible studies, including the details of the study design and conduct, characteristics of the population, specific outcomes assessed at specific times, intervention details, types of comparators, and, when appropriate to the topic, baseline risk levels of the study population. These components may be further categorized and summarized during data analysis and synthesis to allow for investigation of variability in methodological or clinical factors (see phase III: data analysis and synthesis).

In order to capture specific types of subgroup analyses conducted in each study, reviewers can make note during routine data abstraction of which a priori subpopulations identified in the work plan had subgroup-specific analyses reported. For the purposes of tracking the types of subgroup data available in studies, reviewers may also make note of which other subpopulations not specified a priori in the work plan had subgroup analyses reported.

After initial data abstraction, a working table (Table 6) can be used to audit the availability of subgroup-specific analyses in the body of evidence to determine whether it is feasible or worthwhile to further investigate a priori subpopulations of interest. Results from the audit (Table 6—Column 5) provide the rationale for whether or not to pursue further investigation of subgroup analysis results and can later be reported in the methods section of the evidence synthesis report. Within the working table, it is helpful to track the number of studies reporting subgroup analyses for the subpopulation of interest out of the total number of included studies in the review. As warranted, relevant subpopulation-specific summary tables can be developed during data analysis and synthesis.

Table 6 Audit for decision support

Critical appraisal

Another essential step in systematic reviews is critical appraisal of the design and conduct of studies [7, 23]. In addition to rating the quality of individual studies, assessing the credibility of subgroup analyses reported in studies is necessary when addressing subpopulation considerations in a review [8]. Many subgroup-specific claims made in trial reports are not credible, and key criteria for credibility should be addressed by the study authors [24, 25]. These criteria consider type I errors (spurious findings due to chance or confounding) and type II errors (failure to detect effects due to power).

General study quality issues (e.g., differential attrition) may affect the interpretation of subgroup-specific findings. Similarly, issues that affect subgroup validity may impact overall ratings of study quality. Subgroup analyses from poor-quality studies are at high risk of bias regardless of the credibility of the subgroup analyses.

Systematic reviewers can assess the credibility of subgroup findings for a priori subpopulations using Tables 7 and 8. Reviewers may consider collecting the data necessary for evaluating the credibility of subgroup analyses (e.g., a priori specification of analyses, interaction testing) during data abstraction to obviate the need for another close reading of the article. Using Table 7, for each study, reviewers can enter a row for each a priori subpopulation that specifies whether a subgroup effect was detected (based on interaction testing or point estimates and confidence intervals) and provide assessments of three domains related to credibility: (1) the likelihood that positive subgroup effects are spurious, (2) the potential for confounding in a subgroup analysis by another study variable (relevant to positive or negative subgroup findings), and (3) whether a trial was powered to detect subgroup differences, which is primarily relevant to a finding of no subgroup differences.

Table 7 Credibility assessment of subgroup analyses
Table 8 Framework for assessing credibility of subgroup analyses

Table 8 outlines specific questions about spurious findings, confounding, and power limitations to assist reviewers in their credibility assessment of subgroup-specific analyses for a priori subpopulations. Based on responses to the questions outlined in Table 8 and whether observed subgroup effects are biologically plausible and consistent with evidence from related studies [8, 24], systematic reviewers can assess the credibility of each subgroup analysis reported by the study by judging each of the three domains (spurious, confounding, power) as very likely, somewhat likely, unlikely, or unclear—usually due to inadequate reporting (Table 7). The spurious effects domain would also include a “not applicable” option when indicating credibility assessment for situations when a study does not detect a difference in subgroup effect.

Reviewers can summarize their study-level subgroup analysis-specific credibility assessment with an overall rating (e.g., low, medium, high, or uncertain) that incorporates the results of each relevant domain (Table 7). This overall subgroup analysis credibility rating represents a summary judgment as to the credibility of the subgroup-specific analyses conducted in each study of interest and is therefore taken into consideration within the larger context of the study’s internal validity (risk of bias) from the critical appraisal process.

Finally, studies that only enrolled an a priori subpopulation (e.g., 100% female) can be assessed for quality as part of the routine quality rating process for all studies. Ancillary reports from included studies reporting relevant subgroup analyses can also be assessed for credibility using Tables 7 and 8.

Phase III: data analysis and synthesis

Investigating potential sources of heterogeneity at the body of evidence level

During the data analysis and synthesis phase of systematic reviews, reviewers summarize the body of evidence, appropriately considering differences between studies in terms of clinical, methodological, and statistical heterogeneity (Table 2). Guided by a priori considerations, reviewers can supplement their systematic consideration of the similarities and differences across trials in the body of evidence using the PICOTS (population, intervention, comparator, outcome, timing, and study design) rubric (Table 9) [26]. Population factors may drive important clinical heterogeneity based on issues such as baseline study group risk for intervention-related benefits or harms. In contrast, between-study differences in the study design or conduct can represent methodological heterogeneity that is not clinically meaningful, while intervention and comparator differences may or may not be clinically relevant. The consistency and variability in the body of evidence may not be evident when abstracting data from individual studies, so reviewers should consider the consistency and variability in all factors across included studies at this point in the process.

Table 9 Framework for reviewing factors influencing heterogeneity across included studies

When looking at the body of evidence, reviewers should consider variability across studies in the baseline population risk for the primary outcome for which the intervention is intended since this is one of the primary drivers of heterogeneity, along with variable risk for intervention-related harms or presence of competing risks. Even when an intervention has the same relative effects across subpopulations, the absolute benefits will vary, producing much larger beneficial effects in those at higher baseline risk. Thus, understanding the range of baseline risks represented across the body of evidence can be important to interpret findings, whether represented by absolute or relative effect measures.

Observed variation in population risk (as sometimes approximated by control group event rates) across studies may reflect not only different patient populations with variability in baseline risk among selected groups but also other factors such as length of study follow-up [27]. The control group event rate can also be viewed as a study-level proxy for disease severity, concomitant treatments, and follow-up duration [28]. Visual inspection of scatter plots or inspection of data in a spreadsheet to consider the extent of variability in baseline population risk (or any factor) across the body of evidence can be a useful initial assessment of heterogeneity [29].

For example, Fig. 2 shows a scatter plot of control group event rates for the primary outcome of sexually transmitted infections by follow-up time [30]. The broad range of control group rates across 3 to 24 months of follow-up suggests potential population differences in baseline risk of sexually transmitted infection. Scatter plots may also be used to consider the extent of variability in intervention-related risks (control group event rates for harms by follow-up time) or the relationship between baseline risk and absolute benefit (intervention group event rates by control group event rates). Reviewers may also use forest plots to investigate heterogeneity of intervention effects, stratified by population type or other important variables for appropriate time points.

Fig. 2
figure 2

Scatter plot of control group event rates for the primary outcome of sexually transmitted infections (by longest time points for each study) [30]

Systematic reviewers consider whether intervention effects are relatively homogeneous or appear to show variable effects on primary outcomes, including benefits and harms. This examination includes, but is not limited to, using appropriate statistical methods to examine the consistency and precision of the overall findings. The decision to mathematically combine data depends on critical judgment [31]. A meta-analysis should only be conducted when a group of studies is considered homogeneous enough in terms of population, interventions, and outcomes that combining would produce a meaningful summary [7]. The underlying biology should suggest that it is plausible that the magnitude of effect on the key outcomes should be more or less the same across the range of patients and interventions [9, 26]. If meta-analyses are deemed appropriate given the body of evidence, systematic reviewers should determine appropriate statistical methods for meta-analyses and explorations of heterogeneity by first consulting respected scientific literature and statisticians when necessary. A detailed discussion of these methods is beyond the scope of this paper.

Systematic reviewers should formally assess potential heterogeneity using common statistical approaches to detect and quantify the degree of heterogeneity (i.e., Cochran’s Q test, I 2 index) [32, 33]. If reviewers determine that statistical heterogeneity is present, further exploration is needed to investigate the potential sources of heterogeneity. Even when statistical heterogeneity is not present, a priori factors may still need to be explored [19], particularly since lack of statistical heterogeneity does not confirm lack of either clinical or methodological heterogeneity and statistical tests are generally considered to be underpowered to detect differences in subgroup effects [34]. Common approaches for such investigations include stratified meta-analyses, sensitivity analyses, and meta-regression [35]. For example, Fig. 3 is a stratified meta-analysis that provides pooled estimates for subgroups defined by sleep apnea severity at baseline for the effect of continuous positive airway pressure (CPAP) on sleepiness as measured by the Epworth Sleepiness Scale [6]. This type of approach provides information on the degree to which effect sizes differ between groups of studies and also shows whether a substantial portion of the statistical heterogeneity was caused by combining sets of studies into one meta-analysis. When conducting these types of analyses, reviewers must consider potential limitations, such as confounding, inadequate variability, ecological fallacy, and power [8, 11, 17]. Reviewers may also employ graphical methods that more broadly identify potential sources of heterogeneity, being careful to distinguish a priori from post hoc factors [36].

Fig. 3
figure 3

Forest plot of the effect of continuous positive airway pressure (CPAP) on sleepiness (by obstructive sleep apnea (OSA) severity at baseline) [6]

If meta-analyses are not appropriate given the body of evidence, reviewers should provide a narrative synthesis of results, stratified by potential sources of heterogeneity identified a priori. Systematic reviewers should describe the individual study results in the context of the apparent heterogeneity (or lack thereof) in the evidence. In the absence of formal quantitative synthesis, forest plots may still be used to display intervention effects, stratified by population type or other important variables for appropriate time points, to enhance communication.

Summarizing findings at the subpopulation level

During this phase, reviewers also consider the findings from relevant subgroup data abstracted during phase II for each subpopulation. This requires summarizing whether subgroup-specific findings were available from individual studies and how credible they were, as well as their overall coherence across studies. Considered together with results from examining the body of evidence for important heterogeneity of intervention effects, these findings will carry forward to inform judgments by the guideline developer about the possible need for subpopulation-specific clinical practice recommendations.

In order to summarize subgroup findings for examination, systematic reviewers can complete Table 10 for each a priori subpopulation as appropriate after reviewing the credibility and availability of subgroup analyses abstracted across all of the included studies during phase II. This can be most useful when there are a sufficient number of studies reporting subgroup-specific or related analyses for an a priori subpopulation of interest (e.g., age-, sex-, or race-specific). If there are few subgroup analyses reported, text descriptions will usually suffice. If there are extensive subgroup analyses reported, reviewers may want to limit the analyses abstracted to those with at least a moderate overall credibility rating. Additionally, if reviewers have noted a consistently reported set of subgroup analyses for an important subpopulation, or studies targeting that same subpopulation, but the subpopulation was not identified a priori, it may be appropriate to summarize this information post hoc in text or in a summary table, with clear labeling that these represent exploratory findings.

Table 10 Subpopulation-specific summary table with example

The synthesis of subpopulation-specific findings considers the (1) volume and credibility of subgroup analyses, (2) overall coherence of findings, and (3) limitations. The volume and credibility of subgroup analyses will depend on the total number of participants represented and the number of studies reporting subpopulation-specific results out of the total number of included studies, as well as the quality of the evidence, judged by threats to credibility of available subgroup-specific study results and availability of within-study versus between-study subpopulation comparisons. The overall coherence of findings can be assessed by reviewing the consistency of subgroup/subpopulation findings across trials [26], the way subgroups are defined, credibility of subgroup analyses, comparability of studies focused on the specific subpopulations within the body of evidence in terms of PICOTS, number of studies reporting results for each subgroup by outcome, and comparison of within-study to between-study subgroup results. Finally, systematic reviewers should summarize the limitations of the evidence, including potential confounders in individual study subgroup analyses, potential confounders in the study designs, and gaps or deficiencies in the subpopulation-specific results.

Reviewers may create summary plots for outcomes of interest to facilitate considerations of net benefit. These should always include both benefits and harms. After transformation to allow statistically combined estimates to reflect the appropriate direction for a finding (i.e., toward benefit or toward harm), summary estimates can be reflected on a plot (Fig. 4).

Fig. 4
figure 4

Example summary plot for relevant subpopulation-specific outcomes

Phase IV: reporting and interpretation

Reporting

The value of a systematic review depends on the methods, findings, and clarity of reporting [37]. Transparency and consistency are keys to any systematic approach, with methods and the rationale for decisions and subjective judgments clearly articulated. As such, systematic reviewers should clearly communicate the approach taken in the review to assess heterogeneity, the types of data available, judgments about the presence or absence of important clinical heterogeneity, and appropriate limitations and caveats, as determined by thorough investigation of data at both the overall body of evidence and subpopulation levels. Findings must be sufficiently clear to inform judgments about the adequacy of the evidence base for specific subpopulations and appropriateness of general population versus subpopulation-specific clinical practice recommendations, as well as allow for incorporation into future research considerations. A structured approach to reporting facilitates interpretation as well as communication of data throughout this phase.

A list of elements to include when reporting patient subpopulation findings in a systematic review is displayed in Table 11. Authors should adhere to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) Statement [37] when reporting a systematic review [19]. We have added elements specific to subpopulations and heterogeneity of intervention effects to augment this suggested reporting approach.

Table 11 Summary of elements to include when reporting a systematic review

The most informative approach to summarizing the results of subgroup analyses may be to use the review’s overall summary (or strength) of evidence table and stratify the body of evidence by subpopulation within the appropriate key question(s), especially when the subgroup data may be the basis for considering a subpopulation-specific recommendation or clinical consideration. Using the summary of evidence table allows reviewers to consistently and transparently present summary evaluations of each evidence domain (e.g., consistency, precision, reporting bias, body of evidence limitations, strength of evidence, applicability) for important subpopulations [38]. The summary of evidence table can show how the subpopulation-specific information fits within the overall body of evidence and organization of the topic. The level of stratification used for subpopulations depends on the way a topic is conceptualized; for example, some topics may be stratified by intervention type first, with the subpopulation as the second order of stratification. For other topics, subpopulation evidence may only vary for specific domains so would only be presented for a particular domain (e.g., precision).

For example, a review of screening for obstructive sleep apnea (OSA) assessed whether benefits of treatment with CPAP differ for subpopulations defined by OSA severity (among other subpopulation questions considered) [39]. The review conducted subgroup meta-analyses by OSA severity categories. One approach to presenting those findings in an overall summary of evidence table would be to enter data in separate rows for the full sample and for each of the subpopulations, such that treatment with CPAP has a row for overall findings (for the full population) and also has rows for each subpopulation (e.g., mild OSA, moderate OSA, and severe OSA). Such an approach might be most useful when there are significant differences for multiple evidence domains between the overall population and subpopulations (such that reviewers want to highlight the details of similarities and differences). The main conclusions of credibility assessments from phase II would contribute to subpopulation domain entries for quality/risk of bias and body of evidence limitations. Alternatively, depending on how the topic was conceptualized and the specific review findings, the results for subpopulations might be highlighted (1) only in the applicability domain or (2) within a single row dedicated to the effects of treatment with CPAP that first shows effects for the overall population for each domain and then (below the overall findings) describes any differences for subpopulations that were identified and the credibility of those findings.

Interpretation

Systematic reviewers must consider how to interpret the overall credibility of subgroup analyses reported by the studies included in a review. Considerations for judging the credibility of subpopulation findings at the body of evidence level include:

  • Are the subgroup analyses upon which any subpopulation analyses are based credible and consistent across studies and outcomes?

  • Do subpopulation findings avoid ecologic fallacy (i.e., are they based upon meta-regression involving only appropriate study-level variables or using appropriate individual participant data meta-analyses for patient-level variables)?

  • Were the subpopulation analyses in the systematic review specified a priori in a specific hypothesized direction?

  • Was the total number of subpopulation investigations in the systematic review limited to a small number?

  • Does statistical analysis suggest chance is an unlikely basis for subpopulation differences?

  • Are subpopulation findings supported by within-study findings rather than, or in addition to, between-study comparisons?

  • To what extent are subpopulation findings biologically plausible? [10]

Table 12 provides caveats to assist systematic reviewers in their interpretation and understanding of the available subpopulation-specific data. The caveats stress the importance of caution in the interpretation of subgroup analyses due to the risk of false positive or false negative subgroup effects. Guidelines based on spurious subgroup analyses could result in subpopulations of patients receiving inappropriate treatment or being denied beneficial treatment. When data are not definitive, the average intervention effect is considered the best estimate [40, 41]. Pilot testing confirmed that this phase IV guidance provides useful caveats to explain the limitations of subpopulation findings and ensures that clear reporting of subpopulation evidence is not neglected.

Table 12 Caveats for interpreting and understanding subpopulation-specific data

Conclusions

In our work conducting systematic reviews for the USPSTF, we increasingly face the need to provide information on how treatment effects differ for some groups of patients to inform decisions about the appropriateness of subpopulation-specific clinical practice recommendations. Therefore, among a set of reviewers working across Evidence-based Practice Centers—and in conjunction with the USPSTF Subpopulation Workgroup—we developed the guidance described in this paper for addressing subpopulation considerations in systematic reviews. We would welcome engaging in an international consortium effort to develop consensus methods as a next step.

Rigor and comprehensiveness are important to good systematic review methods, but reviewers have to work within the time and resource constraints imposed by those commissioning the review and the guideline developers or others who will use the results. Therefore, it is essential to consider whether the value of adding the subpopulation processes detailed here justifies the additional time and effort expended. The additional work necessary to define subpopulations of interest a priori during initial planning can actually reduce the time and effort spent in later stages of the review by limiting subpopulation examinations to those of most significance to a particular topic. For some topics, early investigations of subpopulations during topic scoping may result in a conclusion that further consideration of subpopulations is not warranted. Systematic investigation of potential heterogeneity in a body of evidence, along with quantitative and narrative analysis and synthesis of subgroup data, represents considerable time and effort and adds a substantial amount of work to the overall review process. The net value of this process is therefore contingent on the effectiveness of earlier phases of the review in identifying the most important subpopulations for a topic and determining the availability of credible subgroup data.

Understanding how treatment benefits and harms differ across patient populations is necessary for optimal patient care and is increasingly focused on through “precision medicine”; therefore, methods to incorporate subpopulation considerations, including credible subgroup analyses, into systematic reviews and clinical practice guidelines are increasingly important. Our proposed approach is intended to allow systematic reviewers to more robustly and routinely provide information about which subpopulations differ enough in the likelihood of benefits (and/or risk of harms) from a preventive intervention that they may warrant different clinical preventive recommendations. Gaps in the evidence on important subpopulations identified by applying this process in systematic reviews can also suggest future research needs. Although the processes we describe here were developed for systematic reviews to support recommendations made by the USPSTF, they are likely generalizable to systematic reviews in other clinical and policy contexts with minimal modification. We anticipate that this approach will undergo further refinement with additional use in reviews for the USPSTF and may require revisions to provide utility to the producers and users of systematic reviews beyond the context of the USPSTF and to broaden its application to reviews of evidence from non-randomized studies.