Background

There are significant concerns about the low reproducibility of scientific studies [1, 2], and the limited effectiveness of peer-review as a quality-control mechanism [3, 4]. Assessment of the quality of scientific studies is critical if they are to be used to inform policy decisions. Quality is a complex concept and the term has been used in different ways; a project using the Delphi consensus method with experts in the field of quality assessment of randomised controlled trials (RCTs) was unable to generate a single definition of quality acceptable to all participants [5]. Nevertheless, from the point of view of a policy-maker, the focus of assessing the quality of a study from environmental science is to establish (i) how near the ‘truth’ its findings are likely to be, and (ii) how relevant and transferable the findings are to the particular setting or population of policy interest. It is important to assess these aspects of study quality separately, as each component has different implications on how the findings of a study should be interpreted.

The first of these aspects of concern to policy-makers is referred to as the internal validity of the study; it is a measure of the extent to which a study is likely to be free from bias. A bias is a systematic error resulting from poor study design or issues associated with conduct in the collection, analysis, interpretation, and reporting of data [6]. Biases can operate in either direction, which if unaccounted for may ultimately affect the validity of the conclusions from a study [7]. Biases can vary in magnitude: some are small and trivial, and some are substantial such that an apparent finding may be entirely due to bias [7]. It is usually impossible to know the extent to which biases have affected the results of a particular study, although there is good empirical evidence that particular flaws in the design, conduct and analysis of studies lead to an increased likelihood of bias [8]. This empirical evidence can be used to support assessments of internal validity.

The second of these aspects of concern to policy-makers is referred to as the external validity of the study. The assessment of a study’s external validity, depends partly on the purpose for which the study is to be used, and is less relevant without internal validity [7]. External validity is closely connected with the generalisability or applicability of a study’s findings. Its assessment often considers the directness of the evidence (i.e. the level of similarity between the population/environment/ecosystem studied and the population/environment/ecosystem of policy interest, and the level of similarity between the intervention/treatment conditions and the temporal and spatial scales of the study in relation to the situation of policy interest), and the precision of the study (in this case imprecision refers to random error, meaning that multiple replications of the same study will produce different findings because of sampling variation) [7]. Another component of external validity is the statistical conclusion validity, i.e. the degree to which conclusions about the relationship among variables based on the data are correct or ‘reasonable’ [9]. Statistical conclusion validity concerns the features of the study that control for Type I errors (finding a difference or correlation when none exists) and Type II errors (finding no difference when one exists). Such controls include the use of adequate sampling procedures, appropriate statistical tests, and reliable measurement procedures.

Assessments of study quality can be made informally using expert judgement. However, experts in a particular area frequently have pre-formed opinions that can bias their assessments [1012]. In order to reduce the potential for reviewer bias and to ensure that the findings of a SR are transparent and reproducible, organisations such as the Cochrane Collaboration (who prepare, maintain and disseminate SRs on the effectiveness of healthcare interventions), the Campbell Collaboration (who prepare, maintain, and disseminate SRs on the effectiveness of social and behavioural interventions in education, social welfare, and crime and justice), and the Collaboration for Environmental Evidence (CEE - who prepare, maintain and disseminate SRs on environmental interventions), recommend the use of formal quality assessment tools, recognising that the merits of a formal approach outweigh the drawbacks.

Around 300 formal study quality assessment tools have been identified in the literature [13]. These tools are designed to provide a means of objectively assessing the overall quality of a study using itemised criteria, either qualitatively in the case of checklists or quantitatively in the case of scales [14]. However, perhaps unsurprisingly given the diverse range of criteria included within quality assessment tools, it has been empirically demonstrated that the use of different quality tools for the assessment of the same studies results in different estimates of quality, which can potentially reverse the conclusions of a SR and therefore potentially lead to misinformed policies [1517]. For example, in the healthcare field, a meta-analysis of 17 trials comparing the effectiveness of low-molecular-weight heparin (LMWH) with standard heparin for prevention of post-operative thrombosis, trials that were identified as ‘high quality’ by some of the 25 quality scales interrogated, indicated that LWMH was not superior to standard heparin, whereas trials identified as ‘high quality’ by other scales led to the opposite conclusions - that LMWH was beneficial [18].

It is therefore very important to consider carefully, the choice of quality assessment tool to be used in SRs [19]. At present, the CEE (2013) does not specify which quality assessment tool to use in SRs on environmental topics. Authors of CEE SRs are permitted to use any quality assessment tool as a basis for their specific exercise, but they should either explain why they used them as such (no modification, because not considered to be needed, and why), or adapt them to their own review, in which case the decisions made must be stated and justified [20]. This stance on the use of quality assessment tools may leave readers and users of their reviews to question the reliability of review findings. We argue that in order to satisfy their purpose, quality assessment tools should possess four features described in the Desirable features of a quality assessment tool section.

Desirable features of a quality assessment tool

(I) The tool should have construct validity (i.e. the included criteria measure what they purport to be measuring)

Some of the existing quality assessment tools have been criticised for (i) the lack of rationale for the criteria used, (ii) inclusion of criteria that are unlikely, or not proven, to be related to internal or external validity, and (iii) unjustified weighting for each of the criteria used [21, 22]. Inclusion of non-relevant criteria in assessment tools can dilute or distort the relevant criteria resulting in assessments that have minimal correlation with the aspects of study quality that matter to policy-makers (i.e. internal and external validity). Empirical evidence about the importance of study design features in reducing the risk of bias has accumulated rapidly since the late 1990s [8]. This evidence has mainly been obtained by a powerful but simple technique of investigating variations in the results of studies of the same intervention according to features of their study design [23]. The process involves first identifying substantial numbers of studies both with and without the design feature of interest. Results are then compared between the studies fulfilling and not fulfilling each design criterion, to obtain an estimate of the systematic bias removed by the design feature.

Assessment of the external validity of studies is more likely to require situation-specific criteria; for example the spatial and temporal scale of studies may be particularly important aspects to consider to determine the generalisability of environmental studies. Nevertheless, criteria should only be included if there is strong empirical evidence to support their implementation.

(II) The tool should facilitate good inter-reviewer agreement

The purpose of using a formal quality assessment tool, as opposed to informal expert judgement, is to reduce the potential for reviewer bias and to ensure that the assessment is transparent and reproducible. It is therefore essential that the tool used facilitates inter-reviewer agreement (i.e. that assessments are reproducible using different reviewers); otherwise there is little advantage over an informal expert judgement system. Inter-reviewer agreement should be tested across a range of studies during tool development, but also should be checked during the conduct of each SR to ensure that the reviewers are interpreting the tool correctly with regards to their specific review topic. Surprisingly, the inter-reviewer reliability of tools is not always assessed when tools are developed. For example, of the 60 ‘top tools’ identified by Deeks et al. [19], only 24 tools were tested for their inter-reviewer reliability during development. When inter-reviewer agreement is tested, common statistical measures include Kappa statistics (which adjust the proportion of records for which there was agreement, by the amount of agreement expected by chance alone), and/or correlation coefficients. The level of inter-reviewer agreement is influenced by the design of the quality assessment tool (clarity, relevance to the study being assessed, and degree of subjectivity), but is also dependent on the experience of the reviewers; de Oliveira et al. [24] achieved higher agreement for experienced reviewers, whereas Coleridge Smith [25] found higher agreement for two methodologists combined rather than clinicians alone or for a clinical/methodologist pairing.

(III) The tool should be applicable across study designs

If reviewers can only assess one study design type (e.g. RCTs) with a given tool, multiple tools will be required for any SR that includes multiple study design types. The use of multiple tools increases the complexity of the review process and complicates the interpretation of the findings of the SR for the readership, including policy-makers. Furthermore, if authors are free to pick and choose which tool to use for a given study design, this opens up the process to reviewer bias whereby the reviewer may select a tool which emphasises research or studies that meet their own opinions, prejudices or commercial interests. A solution to this is to use a tool that is capable of assessing quality across different study design types. Downs and Black [26] argue that although different study design types have fundamental differences, common factors are measured: the intervention, potential confounders, and the outcome. Many study designs test for an association between the intervention and the outcome and aim to minimise flaws in the design that will bias the measurement of an association. The vulnerability of each design to different biases varies but the kind of biases that the study designs seek to exclude are the same, and therefore it is possible for a quality assessment tool to be applicable across study designs. It is important that any ‘universal’ quality assessment tools have clear guidance to aid interpretation of the criteria and to prevent erroneous classification when applied to different study designs. Furthermore, it is useful if aspects of the tool can be ‘switched off’ if not relevant to a particular study design being evaluated. Generic tools can be frustrating to use if they do not apply to the case in hand and they may suffer from lower inter-reviewer reliability if significant subjective interpretation is required when applying generic criteria to specific cases.

(IV) The tool should be quick and easy to use

It is common for SRs to cite a significant number of studies. It is therefore important from a practical point of view, for any quality assessment tool that is designed for use in SRs, to be quick and easy to use. The time taken to make an assessment with a given tool depends on the number of criteria that the tool is comprised of, and the degree of decision-making required to answer each criterion (dichotomous response or a more complex written response). The number of criteria in the 193 quality assessment tools reviewed by Deeks et al. [19], ranged from 3 to 162, with a median of 12. The tools selected as the ‘top 60 tools’ or the ‘best 14 tools’ according to the benchmarks laid out by Deeks et al. [19], had a higher median number of criteria compared with the unselected tools, and took an average time to complete of between 5 to 30 minutes per assessment.

Current best practice in healthcare

The Cochrane Collaboration, who are internationally renowned for their SRs in healthcare, have recently developed an approach to quality assessment that satisfies these four criteria. Until 2008, the Cochrane Collaboration used a variety of quality assessment tools, mainly checklists in their SRs [27]. However, owing to knowledge of inconsistent assessments using different tools to assess the same studies [18], and to growing criticisms of many of these tools [28], in 2005 the Cochrane Collaboration’s Methods Groups (including statisticians, epidemiologists, and review authors) embarked on developing a new evidence-based strategy for assessing the quality of studies, focussing on internal validity [7]. The resultant product of this research was the Cochrane Collaboration’s Risk of Bias Tool (Risk of Bias Tool) [7]. The Risk of Bias Tool is used to make an assessment of internal validity, or risk of bias, of a study through consideration of the following five key aspects (domains) of study design that are empirically proven to control risk of bias. (1) randomization (minimises the risk of selection biasa) [2931], (2) allocation concealment and (3) blinding (minimises the risk of performance biasb and detection biasc due to participants’ or investigators’ expectations) [15, 22, 3237], (4) follow-up (high follow-up of participants from enrolment to study completion minimises the risk of attrition biasd) [33, 34], and (5) unselective reporting of outcomes (minimises the risk of reporting biase) [38, 39]. The Risk of Bias Tool provides criteria, shown in Additional file 1, to guide the assessment of each of these domains, classifying each domain as low, high or unclear risk of bias.

Reviewers must provide support for judgment in the form of a succinct free text description or summary of the relevant trial characteristic on which assessments of risk of bias are based. This is designed to ensure transparency in how assessments are reached. The Cochrane Collaboration recommends that for each SR, judgments must be made independently by at least two people, with any discrepancies resolved by discussion in the first instance [7]. Some of the items in the tool, such as methods for randomisation, require only a single assessment for each trial included in the review. For other items, such as blinding and incomplete outcome data, two or more assessments may be used because they generally need to be made separately for different outcomes (or for the same outcome at different time points) [7]. The classification for each domain is presented for all studies in a manner shown in Additional file 2. To draw conclusions about the overall risk of bias for an outcome it is necessary to summarise these domains. Any assessment of the overall risk of bias involves consideration of the relative importance of different domains. Review authors will have to make assessments about which domains are most important in the current review [7]. For example, for highly subjective outcomes such as pain, authors may decide that blinding of participants (i.e. where information about the test that might lead to bias in the results is concealed from the participants) is critical. How such assessments are reached should be made explicit and they should be informed by consideration of the empirical evidence relating each domain to bias, the likely direction of bias, and the likely magnitude of bias. The Cochrane Collaboration Handbook provides guidance to support summary assessments of the risk of bias, but a suggested simple approach is that a low risk of bias classification is given to studies which have a low risk of bias for all key domains; a high risk of bias classification is given to studies which have a high risk of bias for one or more key domains; and an unclear risk of bias classification is given to studies which have an unclear risk of bias for one or more key domains.

The Risk of Bias Tool was published in February 2008 and was adopted as the recommended method throughout the Cochrane Collaboration. A three stage project to evaluate the tool was initiated in early 2009 [7]. This evaluation project found that the 2008 version of the Risk of Bias Tool took longer to complete than previous methods. Of 187 authors surveyed, 88% took longer than 10 minutes to complete the new tool, 44% longer than 20 minutes, and 7% longer than an hour, but 83% considered the time taken acceptable [7]. An independent study (cross sectional analytical study on a sample of 163 full manuscripts of RCTs in child health), also found that the 2008 version of the Risk of Bias Tool took longer to complete than some other quality assessment tools (the investigators took a mean of 8.8 minutes per person for a single predetermined outcome using the Risk of Bias tool compared with 1.5 minutes for a previous rating scale for quality of reporting) [40]. The same study reported that inter-reviewer agreement on individual domains of the Risk of Bias Tool ranged from slight (κ = 0.13) to substantial (κ = 0.74), and that agreement was generally poorer for those items that required more judgment. An interesting finding from the Hartling et al. [40] study was that the overall risk of bias as assessed by the Risk of Bias Tool differentiated effect estimates, with more conservative estimates for studies assessed to be at low risk of bias compared to those at high or unclear risk of bias. On the basis of the evaluation project, the first (2008) version of the Risk of Bias Tool was modified to produce a second (2011) version, which is the version shown in Additional file 1.

The Risk of Bias Tool enables reviewers to assess the internal validity of primary studies. Internal validity is only one aspect of study quality that decision makers should be interested in. In order to assess the overall quality of a body of evidence, the Cochrane Collaboration use a system developed by the Grades of Recommendation, Assessment, Development and Evaluation (GRADE) Working Group [4144]. This approach is now used by the World Health Organisation (WHO) and the UK National Institute for Health and Care Excellence (NICE) amongst 20 other bodies internationally [7]. For purposes of SRs, the GRADE approach describes four levels of quality. The highest quality rating is initially for RCT evidence (see Table 1). Review authors can, however, downgrade evidence to moderate, low, or even very low quality evidence, depending on the presence of the five factors described in detail in Schünemann et al. [45]: (1) risk of bias across studies, assessed using the Cochrane Collaboration’s Risk of Bias Tool, (2) directness of evidence (i.e. the extent to which the study investigates a similar population, intervention, control, outcome), (3) heterogeneity in the findings, (4) precision of effect estimates, and (5) risk of publication bias (i.e. systematic differences in the findings of published and unpublished studies). Usually, quality rating will fall by one level for each factor, up to a maximum of three levels for all factors. If there are very severe problems for any one factor (e.g. when assessing limitations in design and implementation, all studies were unconcealed, unblinded, and lost over 50% of their participants to follow-up), evidence may fall by two levels due to that factor alone. Review authors will generally grade evidence from sound observational studies (see Table 1) as low quality. These studies can be upgraded to moderate or high quality if: (1) such studies yield large effects and there is no obvious bias explaining those effects, (2) all plausible confounding would reduce a demonstrated effect or suggest a spurious effect when results show no effect, and/or (3) if there is evidence of a dose–response gradient. Guidance and justification for making these assessments is provided by Schünemann et al. [45]. The very low quality level includes, but is not limited to, studies with critical problems and unsystematic observations (e.g. case series/reports - see Table 1).

Table 1 Classification of study designs

Additional file 3 presents the decisions that must be made in going from assessments of the risk of bias (using the Risk of Bias Tool) to assessments about study limitations for each outcome included in a ‘Summary of findings’ table. As can be seen from this table, a rating of high quality evidence can be achieved only when most of the evidence comes from studies that met the criteria for low risk of bias (as determined using the Risk of Bias Tool).

The developers of the GRADE approach caution against a completely mechanistic approach toward the application of the criteria for rating the overall quality of the evidence up or down; arguing that although GRADE suggest the initial separate consideration of five categories of reasons for rating down the quality of evidence, and three categories for rating up, with a yes/no decision regarding rating up or down in each case, the final rating of overall evidence quality occurs in a continuum of confidence in the validity, precision, consistency, and applicability of the estimates. Fundamentally, the assessment of evidence quality is a subjective process and GRADE should not be seen as obviating the need for or minimising the importance of judgment or as suggesting that quality can be objectively determined. The use of GRADE will not guarantee consistency in assessment. There will be cases in which competent reviewers will have honest and legitimate disagreement about the interpretation of evidence. In such cases, the merit of GRADE is that it provides a framework that guides reviewers through the critical components of the assessment and an approach to analysis and communication that encourages transparency and an explicit accounting of the assessments. Testing of inter-reviewer agreement between blinded reviewers using a pilot version of GRADE showed that there was a varied amount of agreement on the quality of evidence for individual outcomes (kappa coefficients for agreement beyond chance ranged from 0 to 0.82) [46, 47]. Nevertheless, most of the disagreements were easily resolved through discussion (GRADE assessments are conducted by at least two reviewers who are blinded before agreeing assessments). In general, reviewers of the pilot version found the GRADE approach to be clear, understandable and sensible [46]. On the basis of the evaluation research, modifications were made to the GRADE approach to improve inter-reviewer agreement [46, 47].

Applicability to environmental studies

The Cochrane Collaboration’s development of a single recommended tool to assess internal validity of primary studies (Risk of Bias Tool), combined with the use of the GRADE system to rate the overall quality of a body of evidence (considering factors affecting external validity), has promoted reproducible and transparent assessments of quality across the Cochrane Collaboration’s SRs. The question is, can the same be done for quality assessment of environmental studies?

Without significant trials it is difficult to know the extent to which tools, such as the Risk of Bias Tool and the GRADE Tool, could be adopted and applied, usefully, across different types of environmental studies. There are obvious similarities between healthcare studies and studies on animal and plant health that form part of the environmental evidence-base, but for other topics in environmental science, the similarities are perhaps less immediately obvious. Nevertheless, there are precedents set that demonstrate the potential transferability of tools from healthcare to a range of environmental studies. For example:

There are six environmental SRs [4853] that have used a hierarchy of evidence approach which was developed by Pullin and Knight [54], based on an adaptation of early hierarchical approaches used in healthcare. This approach is not too distant from the GRADE approach, although the Pullin and Knight [54] evidence hierarchy includes expert opinion as a form of evidence (GRADE does not do this), and in terms of application, the assessments with the Pullin and Knight [54] hierarchy are made on individual studies, not the overall body of evidence for each outcome (as is the case with GRADE). The use of an evidence hierarchy alone is not well justified: using this approach assumes that a study design at a higher level in the hierarchy is methodologically superior to one at a lower level and that studies at the same level are equivalent, but this ranking system is far too simplistic given that there are many design characteristics that can comprise a given study [55]. It is well known that an RCT (which under this system would be rated as the highest quality form of evidence) can suffer from bias if it has poor randomisation, no allocation concealment, no blinding of participants or investigators, uneven reporting of outcomes, or high unexplained attrition. These features can make an RCT more prone to bias than a well-designed observational study. This issue is explicitly recognised in the combined GRADE and Risk of Bias Tool approach used by the Cochrane Collaboration, but not explicitly recognised in the Pullin and Knight [54] approach described above.

There are a further ten environmental SRs [5665], that have used an approach to quality assessment that is similar to the current best practice in healthcare, albeit using the Pullin and Knight [54] hierarchy of evidence in combination with an assessment of some of the individual aspects of study design that are empirically proven to be controls on selection bias, performance bias, detection bias and attrition bias. Although these environmental versions of the Cochrane Collaboration’s current approach are: untested outside of each of the review teams, designed to be review-specific, use an evidence hierarchy with the limitations mentioned above, and make an assessment of overall quality for each study (rather than for the body of evidence overall for each outcome) - the general approach is very similar, demonstrating that it is feasible to capitalise on the research and development that the healthcare field has completed on quality assessment.

Nevertheless, some changes to the healthcare tools are necessary. Table 2 (Environmental-Risk of Bias Tool) provides an example of the criteria that could be used to guide the assessment of the internal validity of environmental studies. The original clinical terminology from the Cochrane Collaboration’s Risk of Bias Tool has been removed and replaced with scientific terminology where appropriate. We have also added definitions of all of the sources of bias, as subtitles in Table 2, to facilitate understanding by the environmental community.

Table 2 Criteria for the assessment of internal validity of a study, using the Environmental-Risk of Bias Tool, adapted from the Cochrane Collaboration’s Risk of Bias Tool

Table 3 (Environmental-GRADE Tool) provides an example of the criteria that could be used to guide the assessment of the external validity of environmental studies. We modified this from the original GRADE approach so that the tool now provides an assessment of the quality of individual studies rather than the overall body of evidence. This modification was deemed necessary owing to the apparent higher diversity of study designs used in environmental science, which could make it difficult to acquire a meaningful summary of quality across studies. On the basis of this change in the GRADE assessment, we argue that the consideration of (i) heterogeneity of study findings and (ii) publication bias, MUST be conducted separately during the SR process. Reviewers are advised to consider the use of forest plots and funnel plots to assess (i) heterogeneity of study findings and (ii) publication bias, respectively. The Cochrane Collaboration’s current Review Manager software (available free-of-charge from: http://tech.cochrane.org/revman/download), can create these plots relatively easily if the appropriate data are available. The assessment of study quality through use of the Environmental-Risk of Bias and Environmental-GRADE tools should aid reviewers in their interpretation of heterogeneity of study findings and publication bias, but not vice-versa.

Table 3 Criteria for assessing the overall quality of an environmental study, using the Environmental-GRADE Tool, adapted from the GRADE approach used by the Cochrane Collaboration (Balshem et al. [69]; Schünemann et al. [45]; Guyatt et al. [70])

As illustrated in Table 3, the highest quality rating is initially for RCT evidence. Review authors can, however, downgrade RCT evidence to moderate, low, or even very low quality evidence, depending on the presence of the three factors discussed below:

(I) The risk of bias within the study

This is assessed using the Environmental-Risk of Bias Tool, criteria to guide this assessment are provided in Table 2. For making summary assessments of risk of bias for each study, the low risk of bias classification is given to studies that have low risk of bias for all key domains; the high risk of bias classification is given to studies that have a high risk of bias for one or more key domains; the unclear risk of bias is given to studies in which the risk of bias is uncertain for one or more key domains. The principle of considering risk of bias for observational studies is exactly the same as for RCTs. However, potential biases are likely to be greater for observational studies. Review authors need to consider (a) the weaknesses of the designs that have been used (such as noting their potential to ascertain causality), (b) the execution of the studies through a careful assessment of their risk of bias, especially (c) the potential for selection bias and confounding to which all non-randomised studies are susceptible and (d) the potential for reporting biases, including selective reporting of outcomes.

(II) The directness of the study

The directness of the study refers to the extent to which the study: investigates a similar species, population, community, habitat or ecosystem as that of policy interest; investigates a similar application of the intervention/treatment as that of policy interest; investigates the phenomenon at a similar spatial scale as that of policy interest; investigates the phenomenon at a similar temporal scale as that of policy interest. Reviewers should make assessments transparent when they believe downgrading is justified based on directness of evidence. These assessments should be supported with significant empirical evidence.

(III) The precision of effect estimates

In this case imprecision refers to random error, meaning that multiple replications of the same study would produce different effect estimates because of sampling variation. When studies have small sample sizes (considering the amount of group variability and the reliability of the outcome measures) the risk of imprecision increases. In these circumstances reviewers can lower their rating of the quality of the evidence. In order for reviewers to make this assessment, they will need information about the uncertainties associated with the study design, including the precision of sampling and measurements. Related to this is the statistical conclusion validity, which as described in Background section of this paper, is the degree to which conclusions about the relationship among variables based on the data are correct or ‘reasonable’. Statistical conclusion validity concerns the features of the study that control for Type I and Type II errors. Such controls include the use of adequate sampling procedures, appropriate statistical tests, and reliable measurement procedures. The most common sources of threats to statistical conclusion validity are low statistical power, violated assumptions of the test statisticsf, fishing and the error rate problemg, unreliability of measuresh, and restriction of rangei. There are recognised statistical methods to test for common causes of violated assumptions of test statistics (e.g. normal distribution tests). If the study does not report the results of these statistical checks, it would be beneficial to try and obtain the statistical test results from the original authors. Reviewers should make assessments transparent when they believe downgrading is justified based on the precision of effect estimates. These assessments should be supported with significant empirical evidence and guided through consultation with a statistical expert.

Usually, quality rating will fall by one level for each factor, up to a maximum of three levels for all factors. If there are very severe problems for any one factor (e.g. when assessing limitations in design and implementation, all studies were unconcealed, unblinded, and lost over 50% of their study units to follow-up), RCT evidence may fall by two levels due to that factor alone. Reviewers will generally grade evidence from sound observational studies as low quality. These studies can be upgraded to moderate or high quality if (1) such studies yield large effects and there is no obvious bias explaining those effects, (2) all plausible confounding factors would reduce a demonstrated effect or suggest a spurious effect when results show no effect, and/or (3) if there is evidence of a dose–response gradient. Reviewers should make assessments transparent when they believe upgrading is justified based on these factors. These assessments should be supported with empirical evidence where available to the reviewers; this can be sourced both from evidence presented in the study, or presented in other studies. The very low quality level includes, but is not limited to, studies with critical problems and unsystematic observations (e.g. case studies).

Conclusions

This article examined the current best practices for quality assessment in healthcare (Cochrane Collaboration’s Risk of Bias Tool and the GRADE system) and investigated the extent to which these practices/tools could be useful for assessing the quality of environmental science. We highlighted that the feasibility of this transfer has been demonstrated in a number of existing SRs on environmental topics. It is therefore not difficult to imagine that the Cochrane Collaboration’s Risk of Bias Tool and the GRADE system could be adapted and applied routinely, as a preferred method, as part of the quality assessment of environmental science studies cited in CEE SRs. We propose a pilot version of the Environmental-Risk of Bias Tool and the Environmental-GRADE Tool for this purpose, and we provide worked examples in Additional files 4, 5 and 6.

Some of the terminology used in the original Risk of Bias Tool has been changed from clinical terms to environmental science terms. Definitions of all of the sources of bias have been added as subtitles to the Risk of Bias assessment criteria. The original GRADE Tool has been modified more substantially so that the Environmental-GRADE Tool now provides an assessment of the quality of individual studies rather than the overall body of evidence. This modification was deemed necessary because of the higher diversity of study designs used in environmental science, which could make it difficult to acquire a meaningful summary of quality across studies. On the basis of this change in assessment, we argue that the consideration of: (i) heterogeneity of study findings, and (ii) publication bias, MUST be conducted during the evidence synthesis and meta-analysis stages of the SR process.

We suggest that once used in a number of environmental SRs, it will be possible to conduct an evaluation of the Environmental-Risk of Bias Tool and Environmental-GRADE Tool to understand how ease of use, applicability across study designs, and degree of inter-reviewer agreement could be enhanced. This ‘learning by doing’ application of the pilot versions of these quality assessment tools is exactly how the Cochrane Collaboration and the GRADE Working Group refined their respective tools. The Cochrane Collaboration and GRADE Working Group continue to refine their quality assessment tools as more evidence comes forward that supports this. The environmental community should capitalise on this ongoing methodological research and development, harmonising its tools where appropriate. At the same time, there is a need to build capacity in methodological expertise within the environmental field. Finally, the environmental science community and research funding organisations should use the criteria contained within the Environmental-Risk of Bias Tool and Environmental-GRADE Tool to enhance the quality of funded studies in the future.

Endnotes

aSystematic differences between baseline characteristics of the groups that are to be compared.

bSystematic differences between groups in the care that is provided, or in exposure to factors other than the interventions of interest due to lack of blinding of investigators.

cSystematic differences between groups in how outcomes are determined.

dSystematic differences between groups in withdrawals from a study/loss of samples.

eSystematic differences between reported and unreported findings.

fMost statistical tests involve assumptions about the data that make the analysis suitable for testing a hypothesis. Violating the assumptions of statistical tests can lead to incorrect inferences about the cause-effect relationship. Violations of assumptions may make tests more or less likely to make Type I or II errors.

gEach hypothesis test involves a set risk of a Type I error. If a researcher searches or "fishes" through their data, testing many different hypotheses to find a significant effect, they inflate their Type I error rate.

hIf the dependent and/or independent variables are not measured reliably, incorrect conclusions can be drawn.

iRestriction of range, such as floor and ceiling effects, reduce the power of the experiment, and increase the chance of a Type II error, because correlations are weakened by superficially reduced variability.

Authors’ information

Dr. Gary Bilotta: Dr. Bilotta is currently a Senior Lecturer in Physical Geography at the University of Brighton. His research links the disciplines of hydrology, geomorphology, biogeochemistry and freshwater ecology. He has particular expertise in developing water quality monitoring and assessment tools for international water resource legislation. At the time of writing this article Dr Bilotta was seconded to the UK Department for Environment, Food and Rural Affairs (Defra), as part of the Natural Environment Research Council (NERC) Knowledge Exchange Scheme. The aim of this secondment was to drive improvements in the policy evidence-base and its use.

Dr. Alice Milner: Dr. Milner is currently a Research Associate at University College London specialising in climatic and environmental change on a range of time scales, past, present and future. Her research is focussed on understanding how ecosystem respond to abrupt climate changes during previous warm intervals, using terrestrial and marine sediments as natural archives of environmental change. At the time of writing, Dr. Milner was seconded to the UK Defra as part of the NERC Knowledge Exchange Scheme to improve the use and management of evidence in policy.

Prof. Ian Boyd: Professor Boyd is currently a Professor in Biology at the University of St Andrews and Chief Scientific Adviser to the UK Defra. His career has evolved from physiological ecologist with the NERC’s Institute of Terrestrial Ecology, to a Science Programme Director with the British Antarctic Survey, Director at the NERC’s Sea Mammal Research Unit, Chief Scientist to the Behavioural Response Study for the US-Navy, Director for the Scottish Oceans Institute and acting Director and Chairman with the Marine Alliance for Science and Technology for Scotland. In parallel to his formal positions he has chaired, co-chaired or directed international scientific assessments; his activities focusing upon the management of human impacts on the marine environment.