Background

Systematic reviews (SRs) aim to answer a specific clinical question by identifying, selecting, appraising and synthesizing all relevant primary studies using explicit and well-defined methods [1]. The number of published SRs is constantly increasing [2]. To help manage this information overload, overviews of reviews (overviews) have emerged as an increasingly popular knowledge synthesis product. Overviews use explicit and systematic methods to integrate information from multiple related SRs to provide a comprehensive synthesis of all SR evidence related to a specific clinical question [3]. As a result, overviews are broader in scope than any individual SR, and often examine evidence from multiple SRs to assess the efficacy or effectiveness of multiple interventions for preventing or treating one specific clinical condition. Overviews can include both SRs published in and outside of the Cochrane Database of Systematic Reviews (CDSR; referred to as “Cochrane SRs” and “non-Cochrane SRs”, respectively). An estimated 48-86% of published overviews include both Cochrane and non-Cochrane SRs, while the remaining overviews include Cochrane SRs only [46].

There is consensus in the research community that researchers conducting overviews of healthcare interventions ought to assess and report the methodological quality of the SRs included in their overview [7]. These assessments should ideally be conducted by two independent reviewers, with a process for consensus, and reported transparently [3, 7]. However, researchers conducting overviews have indicated that assessing methodological quality of SRs may be difficult and time-consuming [7]. Studies have indicated that only 37-64% of published overviews assess and report the methodological quality of their included SRs, and among these overviews, there is variation in the methods used [46]. This variation is not surprising, as to date there is limited guidance regarding the specific methods that should be used to assess SR quality.

Quality assessments of SRs are important in overviews for two main reasons. First, quality assessments should be used by overview authors when making conclusions in overviews (e.g., to help contextualize the evidence by providing insight into whether and to what extent SR methods may have affected the comprehensiveness and results of overviews). However, it is not known whether and how existing quality assessment criteria need to be modified for use in overviews [7]. Assessing the quality of SRs in the context of overviews may pose unique challenges, and decision rules may be helpful to promote consistent assessments both within and across overview topics. Second, results of quality assessments may help inform inclusion decisions [7]. This may be especially relevant when including non-Cochrane SRs in overviews. On average, non-Cochrane SRs have lower methodological rigor than Cochrane SRs, and the methods and reporting of non-Cochrane SRs can vary widely [810]. Researchers conducting overviews have indicated that including lower-quality SRs in overviews can increase the complexity of the overview process because data may be missing, poorly reported, or inconsistently reported in the SRs, and it is unclear what to do in these situations (e.g., should overview authors refer back to the relevant primary studies, or attempt to use the poorly conducted and/or reported SRs?) [7]. However, existing methodological guidance on this topic is conflicting [7]. One potential solution proposed by researchers [7] and employed by overview authors [1118] is to use the results of methodological quality assessments to identify and exclude SRs with gross deficiencies in conduct and/or reporting that would be difficult to include and use in overviews. However, using results of quality assessments to inform inclusion decisions may introduce bias if the results and conclusions of these SRs differ systematically from other well-conducted and reported SRs.

A MeaSurement Tool to Assess systematic Reviews (AMSTAR) is the most frequently mentioned tool for assessing SR quality in overviews [7]. AMSTAR consists of eleven questions designed to assess the appropriateness of the methods used at different stages of the SR process, and it has been shown to be reliable, valid, and easy to use when assessing the quality of published SRs [1921]. The objectives of the present study were: 1) to examine methodological considerations involved in using the AMSTAR tool to assess the quality of Cochrane and non-Cochrane SRs in overviews of healthcare interventions, 2) to identify challenges involved when using AMSTAR in overviews and to develop potential decision rules to overcome these challenges, and 3) to examine the potential impact of considering methodological quality when making inclusion decisions in overviews. To achieve the above objectives, we examined AMSTAR assessments, inter-rater reliability of AMSTAR, the association between AMSTAR assessments and inter-rater reliability, and the association between AMSTAR assessments and results and conclusions of SRs, for both Cochrane and non-Cochrane SRs.

Methods

Sample selection

This descriptive study used a convenience sample of seven overviews of healthcare interventions that were selected from overviews conducted by the Alberta Research Centre for Health Evidence between 2010 to 2016. These overviews examined questions related to the efficacy or effectiveness of multiple interventions for preventing or treating clinical conditions related to pediatric health [2228]. For each overview topic, all published English-language Cochrane and non-Cochrane SRs that met the overview’s inclusion criteria were identified from the reference list of the published overview and included in the study sample. All seven overviews included Cochrane SRs, and four also included non-Cochrane SRs. For the three overviews that did not include non-Cochrane SRs [2325], we conducted additional literature searches to locate and include relevant non-Cochrane SRs. The literature searches were conducted by an information specialist using the inclusion criteria and search dates from each overview. Screening and inclusion were conducted independently by two reviewers, with discrepancies resolved by consensus or third party adjudication. For feasibility, we restricted the scope of one overview topic by population (outpatients only) [24]. Search strategies for all overview topics are available in published overviews and upon request.

AMSTAR assessments

Two reviewers used the AMSTAR tool to independently assess the methodological quality of each SR included in the sample. Each of the eleven questions in the AMSTAR tool was answered “yes”, “no”, “can’t answer”, or “unable to assess”, and discrepancies between reviewers for individual AMSTAR questions were resolved via consensus or third party adjudication. In accordance with other empirical studies assessing measurement properties of AMSTAR [20, 21, 2932], all items scoring “yes” received one point, and points were summed to a maximum of eleven for each SR.

When conducting AMSTAR assessments, reviewers also independently documented any challenges or issues that arose when assessing AMSTAR in the context of the overview of interest, including which question(s) of the AMSTAR tool were impacted by each challenge and potential reasons why each challenge posed difficulties. Reviewers also independently developed decision rules that could be used to address the challenges identified. Challenges and decision rules were discussed between reviewers until agreement was reached and were then summarized narratively.

Result and conclusion statement assessments

The following data about the results and conclusions of each included SR were extracted: the outcome data for the first outcome listed in the corresponding overview (see Table 1); and the authors’ conclusion regarding that outcome, as stated in the abstract, discussion and/or conclusion section of each SR. For SRs that did not contain results data for the overview’s first-listed outcome, data were extracted for the overview’s second or third-listed outcome, if available. For SRs that included more than one comparison, the outcome data and conclusion statement for the comparison that was listed first in the relevant overview were extracted (as a proxy for the most clinically relevant comparison). Outcome data from the SRs contained within the procedural sedation overview were not extracted, because data for the comparator group were often not available.

Table 1 Overview topics and their included systematic reviews

Results and conclusions from each SR were classified based on published criteria [33, 34]. Results were classified as “favourable” (p < 0.10 in favour of the intervention, or finding described as ‘significant’), “neutral” (p > 0.10, or finding described as ‘not different between groups’), or “unfavourable” (p < 0.10 in favour of the comparator, or finding described as ‘favouring non-intervention comparator’). Conclusions were classified as “positive-strong” (authors stated that there was clear evidence of effectiveness, and no further research was required), “positive-weak” (authors stated that there seemed to be evidence of effectiveness, but more research was required to confirm the findings), “neutral” (authors stated that there was no or insufficient evidence about whether the intervention was effective or not, and more research was required to reach a conclusion), “negative-weak” (authors stated that there seemed to be evidence against use of the intervention, but more research was required to confirm the findings), or “negative-strong” (authors stated that there was clear evidence against use of the intervention, and no further research was required). One reviewer extracted and classified results data, and a second reviewer verified the classifications. Two reviewers independently extracted and classified conclusion statements, and discrepancies were resolved by consensus.

Data analysis

AMSTAR assessments were summarized using means and standard deviations (SD), and independent samples t-tests were used to compare Cochrane and non-Cochrane SRs. Medians and ranges were also examined. The number and percentage of positive responses per AMSTAR question were calculated, and chi square tests were used to compare Cochrane and non-Cochrane SRs. For descriptive purposes, AMSTAR assessments were divided into categories using established criteria (AMSTAR assessments of 0-3, 4-7, and 8-11) [17, 35, 36].

Inter-rater reliability for AMSTAR was calculated using the alternative chance-correlated coefficient (AC1) statistic, with 95% confidence intervals (CIs) [37, 38]. The AC1 statistic was used in place of the kappa statistic in order to overcome the limitation of the “kappa paradox”, which occurs when high agreement between reviewers results in low kappa scores [39, 40]. Interpretation of the AC1 statistic is similar to the kappa statistic: AC1 ranges from -1.00 (perfect disagreement) to 1.00 (perfect agreement), with a value of zero indicating reliability equivalent to chance. Accordingly, inter-rater reliability was classified using criteria established by Landis & Koch: “less than chance” (<0.00), “slight” (0.00 − 0.20), “fair” (0.21 − 0.40), “moderate” (0.41 − 0.60), “substantial” (0.61 − 0.80), and “almost perfect” (0.81 − 0.99) [41, 42]. An additional level of classification, “perfect” (1.00), was also added. In addition, percentage agreement, with 95% CIs, was also calculated (overall, and per AMSTAR question), and chi square tests were used to compare Cochrane and non-Cochrane SRs.

Pearson correlation coefficients were used to correlate AMSTAR assessments and inter-rater reliability for Cochrane and non-Cochrane SRs. The strength of the resulting correlations was described using established criteria as “negligible” (0.00 − 0.30), “low” (0.30 − 0.50), “moderate” (0.50 − 0.70), “high” (0.70 − 0.90), or “very high” (0.90 − 1.00) [43]. For non-Cochrane SRs, a post-hoc regression analysis using a quadratic model was also examined (with AMSTAR assessments as the independent variable). The relationships were then depicted graphically.

The distributions of result and conclusion assessments were summarized using the number and percentage of SRs obtaining each classification. Due to the ordinal nature of the data, Mann-Whitney U-tests were used to examine differences in the breakdown of result and conclusion assessments for Cochrane compared to non-Cochrane SRs, and Spearman correlation coefficients were used to correlate AMSTAR assessments with result and conclusion assessments.

A narrative summary of challenges involved in using AMSTAR in overviews was also provided, and potential solutions were described. AgreeStat 2015.5 was used to calculate AC1 statistics (Advanced Analytics LLC., Gaitherburg, MD, USA). SPSS version 23 was used to analyze numerical data (SPSS Inc., Chicago, IL, USA).

Results

Study sample

The study sample included 95 SRs—30 Cochrane SRs and 65 non-Cochrane SRs—across seven overview topics (Table 1). A list of included SRs, along with their AMSTAR assessments, can be found in Additional file 1. The mean AMSTAR assessment (/11) for the 95 SRs was 6.8 (SD: 2.9), with ratings ranging from 1 to 11. The mean AMSTAR assessment was 9.6 (SD: 1.6) for Cochrane SRs and 5.5 (SD: 2.4) for non-Cochrane SRs. AMSTAR assessments were significantly higher for Cochrane compared to non-Cochrane SRs by a mean of 4.1 points (95% CI: 3.2, 5.1; p < 0.001). This pattern of results was consistent across all overview topics (with the exception of the procedural sedation topic, which had no Cochrane SRs), with mean AMSTAR assessments ranging from 1.4 points to 5.5 points higher per topic area for Cochrane compared to non-Cochrane SRs (Additional file 2, first table). Eighty-seven percent of the Cochrane SRs had AMSTAR assessments of eight or more, compared to 22% of non-Cochrane SRs. On the other hand, 22% of the non-Cochrane SRs had AMSTAR assessments of three or less, compared to 0% of Cochrane SRs (Fig. 1a). Although AMSTAR assessments were not normally distributed, mean and median assessments were very similar, and median AMSTAR assessments were higher for Cochrane compared to non-Cochrane SRs both overall and per topic area (Additional file 2, first table).

Fig. 1
figure 1

AMSTAR assessments (a), and inter-rater reliability (b), for Cochrane and non-Cochrane systematic reviews. * p < 0.001 in favour of Cochrane systematic reviews (independent samples t-test); † Mean inter-rater reliability was one level higher for Cochrane compared to non-Cochrane SRs (“almost perfect” vs. “substantial”)

On average, Cochrane SRs received more positive responses for each of the eleven questions of the AMSTAR tool than the non-Cochrane SRs. This difference was statistically significant for eight questions (Q1-Q7, Q11; p ≤ 0.045). For Cochrane SRs, all eleven AMSTAR questions received positive responses more than 50% of the time (range: 53-100% positive responses per question), compared to 5/11 questions for non-Cochrane SRs (range: 14-88% positive responses per question) (Table 2, first column).

Table 2 Positive responses and inter-rater reliability per AMSTAR question, for Cochrane and non-Cochrane systematic reviews

Inter-rater reliability

The mean inter-rater reliability between reviewers for the 95 included SRs, as classified using the Landis & Koch levels of classification [41], was “substantial” (AC1: 0.74; 95% CI: 0.70, 0.79), with inter-rater reliability per SR ranging from “slight” (AC1: 0.09) to “perfect” (AC1: 1.00). The mean inter-rater reliability was one level higher for Cochrane compared to non-Cochrane SRs: “almost perfect” for Cochrane SRs (AC1: 0.84; 95% CI: 0.77, 0.91) compared to “substantial” for non-Cochrane SRs (AC1: 0.69; 95% CI: 0.64, 0.75) (Fig. 1b). For the six overview topics that included both Cochrane and non-Cochrane SRs, mean inter-rater reliability ranged from 0.05-0.45 points higher per topic area for Cochrane compared to non-Cochrane SRs, and was at least one level higher for Cochrane SRs for two of the six overview topics (Additional file 2, first table). The same pattern of results was observed when examining percentage agreement; namely, agreement was higher for Cochrane compared to non-Cochrane SRs both overall, and per topic area (Additional file 2, first table).

Inter-rater reliability for the eleven individual questions of the AMSTAR tool ranged from “substantial” (Q2, Q8, Q10, Q11) to “perfect” (Q5, Q6) for the Cochrane SRs, and from “moderate” (Q8) to “almost perfect” (Q6) for the non-Cochrane SRs. Inter-rater reliability was at least one level higher for Cochrane compared to non-Cochrane SRs for 8/11 questions (Q1, Q3-Q9) (Table 2, second column). A similar pattern was observed when examining percentage agreement between reviewers: Cochrane compared to non-Cochrane SRs had higher agreement for 9/11 questions (Q1, Q3-Q9, Q11), and this difference was significant for 3/11 questions (Q3, Q5, Q7) (Additional file 2, second table).

Challenges involved when using AMSTAR in overviews, and potential decision rules

Four main challenges were identified when assessing AMSTAR in the context of overviews. These four challenges primarily affected the AMSTAR questions concerned with quality assessments and data extraction and analysis (i.e., Q5-Q10). Each challenge is described below, along with the decision rule (and rationale) that was developed to help address each challenge (Table 3).

Table 3 Description of challenges identified when using AMSTAR in overviews, with corresponding recommendations

First, many non-Cochrane SRs provided limited detail when reporting the characteristics and evaluating the quality of their included primary studies. This often made it difficult to determine whether certain AMSTAR criteria were met, and it was unclear whether deficiencies in SRs were related to methodological quality or reporting. Overview authors rely upon the information reported in the included SRs when conducting their overview; therefore, we recommend awarding points only if the amount and quality of information reported in the SR is sufficient for use at the overview level.

Second, some SRs that examined the same interventions for the same disorder analyzed outcome data in different ways and/or reached different conclusions. It was difficult to determine whether SR methods were appropriate when different SRs used different methodologies, and we were uncertain whether multiple similar SRs should be compared against each other when conducting AMSTAR assessments. In these instances, it may not be possible to objectively determine which conclusions or methods of analysis are most appropriate or valid; therefore, we recommend awarding points only if SR authors provide appropriate justification for why they chose a certain method of analysis and/or why they reached a certain conclusion.

Third, many SRs were broader in scope than the clinical question posed in the overview, meaning that not all primary studies included in the SRs were subsequently included in the overview. Reviewers were unsure whether to assess the quality of the SRs in their entirety or whether to assess the quality of only those components of the SRs that were relevant to the overview topic. However, attempting to isolate only the components of interest in SRs was unnecessarily difficult and time-consuming, and we agreed that it was important to capture information about the conduct of the SRs as a whole. Therefore, we recommend that overview authors assess the quality of the overall SRs, without trying to “piece apart” only those components that are relevant to the overview topic.

Lastly, difficulties were encountered when assessing the quality of non-Cochrane SRs that included both primary studies and other SRs. It was unclear whether and how to assess the quality of the SRs that were embedded within the original SRs, and we were uncertain whether the AMSTAR assessments of the original SRs should be affected by the quality of the embedded SRs. When conducting AMSTAR assessments, we found that it was often not possible, nor desirable, to integrate the quality of the embedded SRs into the AMSTAR assessments of the original SRs. Therefore, when SRs include both primary studies and other embedded SRs, we recommend that overview authors treat each embedded SR as an independent publication by retrieving and assessing the full text of that SR for inclusion into the overview. The AMSTAR assessments for the overview can then proceed as usual by assessing the quality of each included SR separately.

Association between AMSTAR assessments and inter-rater reliability

For Cochrane SRs, there was a significant positive linear correlation of “moderate” strength between AMSTAR assessments and inter-rater reliability (AC1), r(29) = 0.62, p < 0.001. Thus, inter-rater reliability increased as quality of Cochrane SRs increased. There was no evidence of a linear correlation between AMSTAR assessments and inter-rater reliability for non-Cochrane SRs (p = 0.38). However, visual examination of the scatterplot (Fig. 2) suggested a quadratic (curvilinear) relationship, with higher inter-rater reliability for non-Cochrane SRs that received both lower and higher assessments and lower inter-rater reliability for non-Cochrane SRs that received moderate assessments. Therefore, a quadratic model was examined. Though not statistically significant (p = 0.09), results suggest that inter-rater reliability may be lower for non-Cochrane SRs with moderate AMSTAR assessments and higher for non-Cochrane SRs with lower and higher ratings (Fig. 2).

Fig. 2
figure 2

Pearson correlation between AMSTAR assessments and inter-rater reliability, for Cochrane and non-Cochrane systematic reviews. Linear relationship (Cochrane): p < 0.001; Quadratic relationship (non-Cochrane): p = 0.09

Association between AMSTAR assessments and results and conclusions of systematic reviews

There was no significant difference in the distribution of the results assessments for Cochrane compared to non-Cochrane SRs (p = 0.14) (Table 4). There was also no significant evidence of correlation between AMSTAR assessments and the direction of effect for the main result of each SR when looking at all SRs combined (p = 0.53), Cochrane SRs only (p = 0.30), or non-Cochrane SRs only (p = 0.72).

Table 4 Distribution of result and conclusion assessments for Cochrane and non-Cochrane systematic reviews

The data indicated a significant difference in the distribution of the conclusion assessments for Cochrane compared to non-Cochrane SRs (p = 0.035). Specifically, Cochrane SRs reported significantly more “negative” conclusions, whereas non-Cochrane SRs reported significantly more “positive” conclusions (Table 4). Despite these group differences, AMSTAR assessments were not correlated with the direction and strength of effect for the main conclusion of each SR when looking at all SRs combined (p = 0.17), Cochrane SRs only (p = 0.68), or non-Cochrane SRs only (p = 0.80).

Discussion

The current study used a convenience sample of 95 SRs included across seven overview topics to provide empirical evidence on issues surrounding quality assessments of SRs in overviews. This study found that AMSTAR assessments and inter-rater reliability were higher for Cochrane compared to non-Cochrane SRs; these results were consistent within each overview topic and for many of the individual questions of the AMSTAR tool. Minor challenges were encountered when assessing quality of SRs in the context of overviews, but decision rules were developed and recommendations for overview authors were provided. Results also suggested that inter-rater reliability of Cochrane and non-Cochrane SRs may be lower for SRs with moderate AMSTAR assessments and higher for SRs that were assessed as strong; inter-rater reliability may also be higher for non-Cochrane SRs assessed as weak. Consistent with a previous study [33], we found that the conclusions, but not the results, of Cochrane and non-Cochrane SRs differed systematically; however, the current study found no evidence that AMSTAR assessments were correlated with results or conclusions of SRs.

Taken together, the results of the current study suggest that AMSTAR is a useful tool for assessing the quality of Cochrane and non-Cochrane SRs in overviews. Authors should be aware of some minor challenges they may face when applying AMSTAR in overviews. Specifically, there may be deficiencies in the reporting of some SRs; SRs examining similar topics may sometimes make different methodological decisions; the scope of some SRs may differ from, or be broader than, the scope of the overview; and some non-Cochrane SRs may include other SRs as well as primary studies. We recommend that overview authors use a priori decision rules, such as those presented in Table 4, to circumvent these challenges and help ensure consistent judgments across reviewers. In addition, Cochrane currently recommends that quality assessments of SRs in overviews be conducted independently, in duplicate, with a process for consensus [3], and the results of this study provide empirical evidence to support this recommendation. Specifically, Cochrane SRs showed some variation in AMSTAR assessments and inter-rater reliability, and non-Cochrane SRs had lower AMSTAR assessments and inter-rater reliability combined with higher variation for both of these variables. To promote transparency, we also recommend that overview authors provide breakdowns of individual AMSTAR questions for all included SRs.

The current study found that the AMSTAR tool can be used with high inter-rater reliability to successfully identify SRs with lower quality assessments that may be difficult to use in overviews due to gross deficiencies in conduct and reporting. This study also found that AMSTAR assessments were not correlated with results or conclusions of SRs. The lack of correlation may be due to a common criticism of AMSTAR— namely, that AMSTAR may actually assess quality of reporting as well as (or instead of) methodological quality [4446]. Quality of reporting may not necessarily be associated with SR results and conclusions. However, reporting is closely tied to usability of SRs in overviews, since overview authors cannot effectively use SRs in overviews if the data are missing, inadequately reported, or reported inconsistently [7]. The results of this study suggest that overview authors may consider using AMSTAR assessments to guide inclusion decisions (e.g., to identify and exclude poorly conducted and/or reported SRs that may be difficult to use in overviews). Using the AMSTAR tool to inform inclusion decisions may not introduce bias into the overview process since the results and conclusions of SRs assessed as weak and strong did not differ systematically. However, overview authors should use their judgment when deciding whether or not to use results of AMSTAR assessments to guide inclusion decisions. Factors to consider include the quality of the overall body of SR evidence and the purpose of the overview. For example, overview authors may not need to include poorly conducted and reported SRs when there are adequately conducted SRs that address all main interventions and outcomes of interest. In contrast, they may choose to retain poorly conducted and reported SRs when the overall body of SR evidence is generally poor or when the purpose of the overview is to describe the complete body of SR evidence on a topic. Though overall AMSTAR assessments may obscure effects of individual questions [47], SRs that score poorly across most AMSTAR domains likely have multiple serious limitations that subsequently make them difficult to include and use in overviews. To promote transparency, overview authors should establish a priori decision rules indicating whether and how quality assessments will be used to inform inclusion decisions; authors should also clearly indicate in their overview which SRs (if any) were excluded based on results of quality assessments or specific methodological deficiencies.

This study identified areas where authors can enhance the conduct and/or reporting of their Cochrane and non-Cochrane SRs. Previous research shows that Cochrane SRs generally have higher methodological rigour than non-Cochrane SRs [810], and the results of the current study extend this finding to SRs included in a sample of overviews. Non-Cochrane SRs showed room for improvement for all but two AMSTAR domains (Q6: study characteristics; Q9: methods to combine studies). However, not all Cochrane SRs received high assessments (13% were rated between 4-7 on AMSTAR), and two AMSTAR domains (Q10: publication bias; Q11: conflicts of interest) showed considerable room for improvement. This study also found that inter-rater reliability varied in conjunction with both AMSTAR assessments and type of SR (Cochrane, non-Cochrane). Inter-rater reliability was lower for SRs that obtained moderate AMSTAR assessments (as opposed to very weak or strong assessments); it was also lower for non-Cochrane SRs both overall and for many of the individual AMSTAR domains (e.g., Q3: search strategy; Q7: scientific quality; Q8: formulating conclusions). This may be because discrete “yes/no” judgments become more difficult when quality and/or reporting is mediocre, or when only some criteria for multi-part questions are addressed in the SR. The variable reporting of non-Cochrane SRs, combined with limits on manuscript length, may also contribute to difficulties conducting AMSTAR assessments. Adhering to accepted standards of conduct and reporting, such as the methods guidance contained within The Cochrane Handbook for Systematic Reviews of Interventions [48] and the Preferred Reporting Items for Systematic reviews and Meta-Analysis (PRISMA) reporting guidelines [49], can help ensure adequate quality and reporting of SRs (and may also increase the inter-rater reliability of quality assessments for SRs). This could, in turn, make SRs easier to assess, include, and use in overviews.

The current study used a convenience sample of overview topics that met specific inclusion criteria and shared certain characteristics (e.g., all overviews were conducted by our research group and examined interventions for disorders related to pediatric health). Though the researchers conducting this study were authors of all included overviews, they were only authors of three of the 95 included SRs, and duplicate independent quality assessments were conducted to mitigate the potential for reviewer bias. It is possible that the overview topics selected for this study may have influenced the results to some extent; however, to increase the generalizability of the knowledge gained from this study, overviews were selected that examined a range of populations (e.g., infants, children, adolescents), interventions (e.g., pharmacological, non-pharmacological), comparators (e.g., placebo, active comparators), research questions (e.g., prevention, treatment), and topic areas (e.g., acute respiratory infections, gastrointestinal diseases, skin disorders). In addition, we found that the AMSTAR assessments obtained for the SRs in our sample of seven overviews fell within the range of scores observed in a broader sample of overviews (all relevant overviews identified by [4] and [5] and contained within issue 12, 2016 of the CDSR). Our inter-rater reliability assessments were also similar to published data on agreement for AMSTAR [50]. Thus, the results of the current study, and subsequent recommendations for quality assessment of SRs in overviews, may generalize to a range of overviews examining healthcare interventions. However, results and recommendations should not be generalized to overviews that address broader or different clinical questions (e.g., diagnostic test accuracy, prognostic, or qualitative overviews).

It should be noted that there is debate surrounding whether or not overall AMSTAR scores should be calculated. The developers of AMSTAR addressed this concern by ensuring (through statistical analysis) that the component questions do not overlap and by validating the overall score against an external standard. Thus, they concluded that the overall score is meaningful [21]. However, overall quality scores assume that all questions are equal (which can be difficult to justify) [50, 51], summing individual items may artificially increase the precision of the assessment, and studies have shown that incorporating overall quality of primary studies into meta-analyses can alter effect estimates in SRs [52, 53]. Despite the uncertainty regarding use of summary scores, there is a precedent for calculating and reporting overall AMSTAR assessments both in empirical studies assessing measurement properties of AMSTAR [20, 21, 2932, 5456] and in overviews of healthcare interventions [16, 18, 35, 36, 5771], and incorporating overall quality of SRs into results of overviews has not been found to alter overview results [72]. Other potential limitations of AMSTAR may include difficulty meaningfully differentiating between several of the response options (“no”, “not applicable”, and “can’t answer”) and difficulty answering multi-part questions when only some criteria are met [44, 45]. There are also no questions in the AMSTAR tool examining whether appropriate methods were used in SRs to assess the quality of a body of evidence or to conduct subgroup and/or sensitivity analyses [44, 45], and in the context of overviews AMSTAR cannot capture potentially important differences in comprehensiveness and recency of searches across SRs. As previously mentioned, the AMSTAR tool may also assess aspects related to quality of reporting as opposed to methodological quality [4446]. Despite these potential limitations, the results of this study demonstrate that reviewers can conduct AMSTAR assessments with adequate inter-rater reliability, using decision rules to help overcome some of the above-listed challenges.

In addition to AMSTAR, other quality assessment tools exist. AMSTAR 2 is currently being developed in response to feedback from users of the original AMSTAR tool [73], and the Risk Of Bias In Systematic reviews (ROBIS) tool was recently published to assess issues related to the risk of bias (as opposed to the methodological quality) of SRs [74]. Methodological research examining the reliability, validity, and feasibility of AMSTAR 2 and ROBIS in overviews would be valuable. In addition, research comparing AMSTAR, AMSTAR 2 and ROBIS on important outcomes (e.g., quality assessments, inter-rater reliability, and time to complete assessments) and across important comparisons (e.g., Cochrane vs. non-Cochrane SRs, SRs with meta-analyses vs. narrative summaries, and publication year of SRs) may provide insight into the trade-offs involved in selecting one tool over another for use in overviews.

Conclusions

There is currently limited guidance available for researchers conducting overviews of healthcare interventions. This gap in guidance is most pronounced when examining methods for conducting the latter stages of the overview process (e.g., quality assessments and data extraction and analysis). The current study plays a role in addressing this gap in guidance. It contributes empirical evidence and recommendations regarding the use of the AMSTAR tool to assess the methodological quality of SRs in overviews. Based on the results of this study, we show that AMSTAR can be used successfully to assess the methodological quality of both Cochrane and non-Cochrane SRs included in overviews of healthcare interventions. When using AMSTAR in overviews, individual assessments should be reported for each of the eleven questions of the AMSTAR tool. Results of quality assessments of SRs can then be used alongside quality assessments of primary studies and outcome data to help contextualize the results and conclusions of overviews.