Background

A Measurement Tool to Assess Systematic Reviews (AMSTAR) is a commonly used tool to assess the methodologic quality of systematic reviews [1]. It has demonstrated satisfactory reliability and construct validity [2] for systematic reviews of randomized controlled trials of treatment interventions [3]. AMSTAR is widely used to assess the quality of systematic reviews, and some users state it is the most appropriate (and best) tool [46], while others have found it problematic [717] and therefore modified the tool [7, 11, 15, 1830]. In this commentary, we summarize our experience using AMSTAR along with the experiences of others, describe several key issues, and provide suggestions for improvement (Table 1).

Table 1 Concerns regarding AMSTAR items, instructions, responses, and suggested revisions

Main text

The stated objective of AMSTAR is to assess the methodological quality of systematic reviews [1] which refers to whether the authors of a study (or presumably a systematic review) did the best that they could [31]. The items of AMSTAR, however, largely address quality of reporting (e.g., items 5 and 6) [32] and risk of bias [33] (e.g., items 8 and 9) rather than the methodological quality. Several items should be amended to be consistent with the stated objective.

AMSTAR encompasses most of the key constructs that are relevant to the assessment of the methodological quality of systematic reviews; however, one critical construct is missing as noted also by other investigators [9, 3436]: an explicit and reproducible method for assessing the quality of the body of evidence for each important outcome (i.e., the confidence in the estimates of effect [37]). We suggest revising item 8 to focus on this construct, separating it from the assessment of the quality of individual studies (item 7) (Table 1). AMSTAR also lacks an item that assesses subgroup and sensitivity analyses [9, 36]. Subgroup analyses are important to decision-makers as treatment effects may differ across populations. Similarly, sensitivity analyses specified a priori help to assess the robustness of the review’s findings [31]. Items related to subgroups and sensitivity analyses should be added (new item 12, Table 1).

Some AMSTAR items and their instructions are unclear and need to be revised (Table 1). For example, item 4 regarding the “status of publication” might refer to either the inclusion or exclusion of gray literature. The instructions suggest that gray literature should be included; however, its relevance is closely related to the review question and may not always be necessary. In AMSTAR [1], foreign language publications are considered gray literature; however, this is not consistent with commonly used definitions [38].

The response options (yes, no, cannot answer, not applicable) are problematic [9, 3943]. For example, “cannot answer” can be difficult to interpret and distinguish from “no” when no information is provided. A common approach to quality assessment is to assume that if the authors did not report a step, then it did not happen; thus, “no” would be the appropriate response. The instructions, however, suggest that “cannot answer” should be used when the item is “relevant but not described,” which means a “no” response would rarely be used as authors seldom report explicitly that they did not do something. In addition, “not applicable” is only appropriate to two items (items 9 and 10) when these items are not possible or appropriate; all other items should always be addressed.

The guidance for scoring individual items and for obtaining a total score is unclear. In AMSTAR [1], if all criterion are met for an individual item (i.e., “yes”), it receives a score of “1” and the sum of all “yes” responses indicates the total score out of 11. Systematic reviews, however, often partially meet the item’s criteria such as listing the search databases and dates but, perhaps due to word limitations of the journal, do not provide the search strategies or keywords. To address the issue of evaluating multiple constructs within a single AMSTAR item, investigators have modified its scoring to allow points for partially fulfilled items [7, 9, 34, 35, 39]. Kung and colleagues developed R-AMSTAR [44], subdividing each item into four components with a score ranging from 11 to 44, where higher scores indicate better methodological quality. R-AMSTAR has been used by a number of investigators [5, 4550], and a comparison to AMSTAR concluded that R-AMSTAR provided greater guidance for each item and is more reliable and useful [51].

In addition, AMSTAR provides no guidance on how to combine individual item scores from multiple assessors other than stating that consensus should be reached for each item. We have averaged AMSTAR scores across assessors to encompass each independent evaluation [52]. Other investigators have used similar approaches such as averaging scores between two assessors when discordant by one or two points and involving a third assessor when scores differed by three or more points [53, 54].

AMSTAR was deliberately developed without guidance on how to translate the total score into categorical ratings for the overall assessment of the systematic review’s quality (e.g., good, fair, poor) [1, 55]. Various thresholds have been used by investigators to define categories for quality (e.g., 0–4 vs. 0–3 for poor quality), making it difficult to compare assessments across reviews. AMSTAR was also designed under the assumption that each item is of equal weight when considering the systematic review’s overall quality [2]. Other investigators have dealt with this issue by assigning different weights to items they consider more important [53, 5658]. For example, Jacobs and colleagues rated systematic reviews as high quality if items 3, 6, 7, and 8 were met regardless of the total score [57]. An additional problem with the current scoring method is the equivalence of “not applicable,” “no,” and “cannot answer” (all scored as zero) because an item rated as “not applicable” should not be taken into account in the total score. Clearer guidance about calculating a total score is needed along with an acknowledgement of the limitations of scoring across all items should users of AMSTAR choose to calculate a total score. We believe that obtaining a total score should be avoided as it has been shown to be problematic [59].

Conclusion

AMSTAR is a useful tool for assessing the quality of systematic reviews; however, some modifications would improve its usability, reliability, and validity. The issues discussed in this commentary are not limited to our own experiences but are shared across many investigators who have used this tool. We have provided suggestions for improving AMSTAR; however, any revised tool needs to be empirically tested for reliability and validity, and undoubtedly, additional refinements will be needed. We look forward to further dialog on AMSTAR and to subsequent revisions and evaluations.