Background

With the ever-growing amount of published data, systematic reviews (SRs) and meta-analyses (MAs) became recognised methods for summarising the evidence in support of evidence-based decision-making in healthcare [1,2,3]. High quality systematic reviews/meta-analyses (SR/MAs) are considered acceptable and important for decision-makers [4, 5]. However, with the increasing number of SR/MAs there are often issues of reliability, particularly when SR/MAs have conflicting results and suffer from extensive methodological shortcomings [1, 6, 7]. In the context of these findings, users of the literature must distinguish lower versus higher quality SR/MAs to support healthcare decision-making. Instruments to distinguish the quality of conduct of SR/MAs have been designed and validated.

Currently, two instruments, namely AMSTAR-2 (‘A Measurement Tool to Assess Systematic Reviews, version 2’) and ROBIS (‘Risk of Bias in Systematic Reviews’), are commonly used to formally assess the quality of conduct of SR/MAs. Both instruments provide a structured approach for readers to perform rapid and reproducible assessments of the quality, including a detailed evaluation of conduct and methodological rigour; however original constructs and specific details differ [8, 9]. AMSTAR-2 has been developed as a critical appraisal tool for SR/MAs that include randomised or non-randomised studies of health care interventions and is an updated version of previously widely accepted AMSTAR that has been in use for over a decade [10]. AMSTAR-2 is comprised of 16 items, of which seven were determined to be critically important to the validity of a review, while the other nine are considered not critically important. Users of AMSTAR-2 are asked to make an overall judgment of ‘high’, ‘moderate’, ‘low’, or ‘very low’ confidence in the results of SR/MA based on the assessment of critical and non-critical items [11].

ROBIS focuses intrinsically on the risk of bias (RoB) in the SR/MA and comprises three phases: assessment of relevance (optional), identification of concerns within the review process that put the SR/MA at RoB, and judgement of RoB. The second of the aforementioned phases is composed of four domains with 21 items highlighting specific issues that need to be considered. In the third phase a judgement of ‘low’, ‘high’, or ‘unclear’ RoB is assigned after consideration of assessments performed in the second phase [12].

Upon applying both instruments, users can determine that they are similar in their general approach; however, differences do exist. A number of studies have investigated the similarity of assessments between original AMSTAR and ROBIS tools [13,14,15]. Nevertheless, so far, only one study has investigated the comparability of both instruments in terms of their domains and corresponding items, demonstrating a satisfactory correlation between the overall ratings of AMSTAR-2 and ROBIS while highlighting the differences in the conceptual frameworks of both tools [16].

There has been a profusion of SR/MAs in the health sciences literature [1], with several studies having already investigated their quality [7, 17, 18]. Nutritional epidemiology is an area of scientific interest to the public, and while the quality of SR/MAs in the field has recently been shown to be sub-optimal [7], the related and burgeoning field of SR/MAs assessing nutrition for cancer prevention has not been systematically evaluated. In this study, performed within the context of the systematic survey addressing trustworthiness of SR/MAs assessing nutrition for cancer prevention, we aimed to compare the similarities, the inter-rater reliability (IRR) and any methodological gaps of instruments for assessing the quality of conduct of those SR/MAs.

Methods

The protocol for the systematic survey was prepared a priori and registered in PROSPERO with an identification number CRD42019121116.

Searches, eligibility, and sample selection

We systematically searched MEDLINE, Embase, and the Cochrane Library for SR/MAs published between January 2010 and November 2018 that examined the effects of any nutritional intervention/exposure for cancer prevention in the general population or in people at higher risk for cancer. Search strategies are provided in Supplementary file. We accepted studies labelled as SR/MAs as described in the title, abstract, or full text, which included, according to their eligibility criteria, primary studies comprising a comparator group (i.e., interventional studies with a control group such as randomised or non-randomised controlled trials, observational studies with participants categorized by intake or exposure level (e.g. lower versus upper quartiles)). The methods have been described in detail in the companion paper [19].

Screening and data extraction

Following a calibration exercise, pairs of two independent reviewers performed study selection, data extraction, as well as both AMSTAR-2 and ROBIS assessments, with conflicts resolved by discussion or consultation with a third reviewer. Each step was preceded with calibration exercises to ensure common understanding of inclusion criteria and to discuss any ambiguities. With respect to quality assessments, a number of authors have considerable experience in conducting SRs and assessing their methodological quality (MJS, DS, JZ, MK, BCJ, MMB), while the remaining authors (PT, WS, MG, AS, AW, KK, JBC) underwent training. AMSTAR-2 and ROBIS assessments were piloted on a set of three studies.

Quality of conduct and risk of bias assessment instruments

AMSTAR-2 consists of 16 items for which ‘yes (Y)’ or ‘no (N)’ judgments can be applied. For five items (2, 4, 7, 8, 9) in addition to ‘Y’ or ‘N’ responses, ‘partially yes (PY)’ can be selected. Items 11, 12, and 15 are not considered if a meta-analysis was not undertaken. Among the 16 items, seven are considered to be critical: ‘development of the study protocol’ (item 2); ‘comprehensiveness of the literature search strategy’ (item 4); ‘providing a list of excluded studies with reasons’ (item 7); ‘appropriate assessment of the RoB of individual included studies’ (item 9); ‘use of appropriate meta-analytical methods’ (item 11); ‘consideration of RoB when interpreting and discussing the results’ (item 13); and ‘assessment of the presence of publication bias and discussion of its impact on the results’ (item 15). The remaining nine items are considered non-critical. Subsequent to judging the 16 items, investigators can make an overall judgment of ‘high’, ‘moderate’, ‘low’, or ‘very low’ confidence in the results of the target SR/MA, as follows [11]:

  • High: no major flaws in critical items and ≤ 1 flaw in non-critical items;

  • Moderate: no major flaws in critical items and > 1 flaw in non-critical items;

  • Low: one major flaw in critical items with or without non-critical items;

  • Critically low: > 1 major flaw in critical items with or without non-critical items.

ROBIS consists of 21 items assigned to four domains (study eligibility criteria; identification and selection of studies; data collection and study appraisal; synthesis and findings), for which respondents can answer ‘yes (Y)’, ‘partial yes (PY)’, ‘partial no (PN)’, ‘no (N)’, or ‘no information (NI)’. The overall concerns associated with each of the four domains are then judged as ‘low’, ‘high’ or ‘unclear’. On the basis of the domain assessments, supported by consideration of correctness of SR/MA interpretation of findings, relevance of included studies to the SR/MA’ question, as well as fairness and thoroughness within presentation of the results, a final consideration is performed on whether the SR/MA as a whole is at ‘low’, ‘high’, or ‘unclear’ risk of bias [12].

Domain matching

For data collection and analyses we used Microsoft Excel (version 2016). After reviewing all items of each instrument, based on the ROBIS instrument we categorized the items under four main domains based on conceptual similarities:

  • Domain 1: Study eligibility criteria;

  • Domain 2: Identification and selection of studies;

  • Domain 3: Data collection and study appraisal;

  • Domain 4: Synthesis and findings.

After assessing the concept, approach and definitions for each item, we matched items from each instrument to produce 11 comparisons including 12 AMSTAR-2 and 14 ROBIS items. In some cases, two or more items from one of the instruments were combined within a single comparison (e.g. AMSTAR-2 item 4 was compared with ROBIS items 2.1, 2.2, 2.3, and 2.4). For 10 comparisons we judged items of both instruments as satisfactorily comparable with respect to concept, approach and definitions, while in the case of one comparison (examination of publication bias/robustness of the results) we judged the items from the instruments as only partially overlapping (i.e. robustness of SR/MA results includes an assessment of publication bias as well as other considerations). There were four items on AMSTAR-2 and seven items on ROBIS that did not sufficiently overlap in concept, approach, and description. Table 1 provides a summary of the overlapping and non-overlapping items.

Table 1 Comparison of matched AMSTAR-2 and ROBIS items

Reliability

For comparing the similar items across instruments, using the Gwet’s AC1 statistic (Gwet’s first-order agreement coefficient) we calculated the reliability between raters [20, 21]. In order to do so, pairs of reviewers independently assessed each SR/MA using AMSTAR-2 and ROBIS. When we found ambiguities in our assessments, we discussed, and if we could not come to consensus a third senior reviewer was consulted. Subsequently, items with consensus appraisals for each study were used to calculate the IRR. Assumptions for each comparison are provided in the footnotes of the Table. 1. Based on established guidance, we classified agreement as poor (≤ 0.00), slight (0.01–0.20), fair (0.21–0.40), moderate (0.41–0.60), substantial (0.61–0.80), and almost perfect (0.81–1.00) [16, 22].

Results

We identified 24,739 records, of which 20,413 were screened after duplicates were removed. Based on the eligibility criteria, we included 737 studies, of which a random sample of 101 articles was selected and analysed. The study flow is presented in Fig. 1 [23].

Fig. 1
figure 1

PRISMA 2020 flow diagram. PRISMA - Preferred Reporting Items for Systematic Reviews and Meta-Analyses

The 11 comparisons produced varying levels of agreement coefficients, presented below.

Domain 1: study eligibility criteria

Two comparisons were created within this domain. The comparisons addressed the comprehensiveness of eligibility criteria and the prospective publication of review methods (protocol), with an almost perfect agreement: 0.87 (95% CI, 0.78 to 0.96) and 0.99 (95% CI, 0.97 to 1), respectively.

Domain 2: identification and selection of studies

Two comparisons were discerned within this domain. One addressed the comprehensiveness of search strategies with a substantial level of agreement: 0.79 (95% CI, 0.74 to 0.85), and the other investigated duplicate study selection with an almost perfect level of agreement: 0.87 (95% CI, 0.77 to 0.96).

Domain 3: data collection and study appraisal

Three comparisons were formed within this domain. One addressed duplicate data extraction with an almost perfect level of agreement: 0.88 (95% CI, 0.79 to 0.98). A second comparison explored the comparability of items regarding the adequate description of characteristics of studies included in the review showing a moderate level of agreement: 0.6 (95% CI, 0.44 to 0.76). A third comparison addressed the use of appropriate RoB assessment methods showing an almost perfect level of agreement: 0.88 (95% CI, 0.79 to 0.98).

Domain 4: synthesis and findings

Four comparisons were created within this domain. Three were considered fully overlapping, while one was partially overlapping. One comparison, concerning an appropriate statistical combination of results, proved an almost perfect level of agreement: 0.81 (95% CI, 0.69 to 0.92). Two comparisons, one regarding assessment and interpretation of biases in included studies, and one concerning appropriate consideration of heterogeneity within the results, both showed substantial levels of agreement: 0.77 (95% CI, 0.64 to 0.89) and 0.73 (95% CI, 0.59 to 0.86), respectively. The fourth comparison addressing publication bias and robustness of the results (e.g. funnel plot or sensitivity analysed) was considered as partially overlapping and showed a slight level of agreement: 0.18 (95% CI, − 0.03 to 0.38).

Methodological gaps

In addition to documenting the similarities and IRR between instruments, we also noted major methodological gaps in both tools. Both instruments could be improved with respect to guidance and assessment of subgroup analysis, ideally based on an a priori publicly available study protocol detailing the planned assessment of effect modification [24]. We also noted that both instruments do not consider the presentation of results using of absolute estimates of effect (e.g. risk difference for all dichotomous outcomes) [25], nor do they have an item on the overall certainty of evidence (e.g. assessed using the Grading of Recommendations Assessment, Development and Evaluation (GRADE) approach) for each outcome [26].

Discussion

Our study aimed to compare the similarity and reliability of the AMSTAR-2 and ROBIS instruments based on 101 SR/MAs assessing nutritional interventions/exposures for cancer prevention. AMSTAR-2 is comprised of 16 items while ROBIS has 21 items, of which 12 and 14, respectively, were combined into 11 comparisons based on their conceptual similarities. Overall, we found that 70.3% (26/37) of items assess the same or similar methodological constructs. Ten comparisons were judged to fully overlap in concept and definitions, and one comparison was partially overlapping. A number of items from both tools (four in AMSTAR-2 and seven in ROBIS) were unique to each instrument and were not amenable for paired comparisons due to non-overlapping concepts, approaches, and descriptions. Both instruments do not address the reporting of absolute estimates of effect and the overall certainty of the evidence.

The study by Pieper et al. was the first to compare both instruments in terms of validity, reliability, and applicability [16]. The authors matched relevant AMSTAR-2 and ROBIS items into 12 comparisons, of which 10 were considered as fully overlapping, and two comparisons as partially overlapping (appropriateness of restriction of eligibility criteria and publication bias/robustness of the results). Our approach was similar; however, we dismissed the partially overlapping comparison between AMSTAR-2 item 3 ‘Did the review authors explain their selection of the study designs for inclusion in the review?’ and ROBIS item 1.4 ‘Were all restrictions in eligibility criteria based on study characteristics appropriate?’ as we believe these items are different constructs and are not similar enough based on underlying definitions and assessment guidance. Furthermore, while for data extraction we compared AMSTAR-2 item 6. ‘Did the review authors perform data extraction in duplicate?’ with ROBIS item 3.1 ‘Were efforts made to minimize error in data collection?’, Pieper et al. additionally considered ROBIS item 3.5 ‘Were efforts made to minimize error in risk of bias assessment?’ within this comparison. We did not include ROBIS item 3.5 into this comparison as we believe duplicate RoB assessment and duplicate data extraction should be assessed separately.

Before AMSTAR-2 was published, researchers attempted to compare the reliability of ROBIS and the original AMSTAR tool. The correlation coefficients ranged from moderate to substantial [13]. Generally, apart from one comparison of AMSTAR-2 item 8 with ROBIS item 3.2, our calculations resulted in higher coefficient values as compared to those reported by Pieper et al. [16]. Their calculated agreement levels concerning similar methodological constructs were reported to be perfect for one comparison, substantial in six comparisons, moderate in two comparisons, fair in one comparison, and slight in one comparison. By contrast, our calculations provided six items with almost perfect comparisons, three with substantial, one with a moderate, and one with a slight level of agreement. One possible explanation for these discrepancies could be the quality of included studies. In our sample of 101 articles published within the field of nutrition for cancer prevention, only 1% of SR/MAs were of high quality according to AMSTAR-2, and 3% were of low RoB according to ROBIS, which indicates mostly low scores in the majority of items of both instruments, which might result in high agreement coefficients’ values. Alternatively, unlike Pieper et al., it may be that our coefficients were higher because pairs of reviewers participated in calibration and consensus procedures, which ensured that differences in assessments were discussed, thus reducing the number of outlying assessments that might have occurred. In Pieper et al., no consensus procedure between reviewers was introduced and final judgement on items within each comparison was based on the judgments of most of the raters, resulting in the possibility of higher variation of assessments, and thus lower agreement scores.

After performing assessments using both instruments, we were surprised that both instruments did not have items devoted to the assessment of the magnitude of the effects based on absolute estimates (e.g. risk difference) for dichotomous outcomes, or the certainty of evidence for each outcome. Providing the information on these items is supported by the GRADE guidance, the Cochrane Handbook, and the Joanna Briggs Institute Manual [27,28,29,30]. Rating certainty of the evidence for each assessed health outcome improves the interpretation of SR/MA results and should be considered a vital characteristic of quality in reviews. Regarding the magnitude of the effects, authors commonly report effects as relative estimates such as risk ratios or hazard ratios while underreporting absolute measures such as the risk difference or number needed to treat [25]. Evidence suggests that reporting both relative and absolute estimates and their corresponding certainty of evidence allows for optimal interpretation of review findings [7, 25, 31,32,33]. Future updates of ROBIS and AMSTAR-2 instruments should consider adding these items, and interim users of the instruments might consider these items, particularly in nutrition for long term health outcomes, where the absolute effects may be small and uncertain [34,35,36].

We followed Cochrane guidance on systematic review methods strengthening the validity of our findings, including calibration exercises and duplicate screening, abstraction, and quality assessment. Furthermore, our methods followed an a priori study protocol and included a random sample of 101 nutrition studies, a large sample in the same healthcare field. With regard to weaknesses, first, many items in the AMSTAR-2 and ROBIS instruments were dissimilar and did not always allow for reliability comparisons, so our coefficients could be misleading to readers who may have the impression that the instruments are the same or close to the same for assessing SR/MA quality. That is, while there were many overlapping conceptual items (70.3%), there were a substantial number of dissimilar items (11/37), and so applying each instrument to the same study could result in important material differences with respect to the quality of conduct of a SR/MA. Second, while our team reported that ROBIS took longer than AMSTAR-2 assessments, we did not formally measure the time it took for reviewers to complete the assessments for each instrument. Comparisons have previously been reported indicating varying results, ranging from AMSTAR assessment taking slightly longer than ROBIS, to ROBIS assessment taking substantially longer than AMSTAR [13, 16, 37]. Third, we chose a random subsample of 101 of 737 identified studies, as completing assessments of all identified studies was not deemed feasible due to time constraints. Fourth, since the reliability of the majority of included studies assessed with AMSTAR-2 and ROBIS was critically low, the agreement coefficients between instruments in other fields of health care might differ from ours, particularly if there is more variability in the quality of SR/MAs or if higher quality SR/MAs are included.

Conclusions

AMSTAR-2 and ROBIS are instruments designed to facilitate the assessment of SR/MA quality. Among the instruments, 70.3% of items address the same or similar methodological constructs. While the IRR of these items was moderate to perfect in fully overlapping comparisons, and slight in partially overlapping, there are unique methodological items that each instrument independently addresses. Further investigation based on samples of SR/MAs from different fields of medicine and health science might further elucidate similarities and discrepancies between both tools. Notably, AMSTAR-2 and ROBIS do not address the reporting of absolute estimates of effect or the overall certainty of the evidence, both of which are important for the optimal interpretation of SR/MA findings. The choice to use one or both of the instruments should depend on the aim of the investigators or users’ of the SR/MAs (i.e. overall methodological quality versus RoB assessment only) and other factors such as experience with the instrument or time constraints. It has previously been suggested that both instruments have areas for improvement [16, 37], findings that our systematic survey corroborates. One pragmatic instrument that fully considers RoB together with other methodological quality items such as the presentation of both relative and absolute estimates and the certainty of these estimates would optimally help users’ of SR/MAs better assess and interpret a reviews overall quality and importance of reported results.