A primer on systematic reviews in toxicology
Systematic reviews, pioneered in the clinical field, provide a transparent, methodologically rigorous and reproducible means of summarizing the available evidence on a precisely framed research question. Having matured to a well-established approach in many research fields, systematic reviews are receiving increasing attention as a potential tool for answering toxicological questions. In the larger framework of evidence-based toxicology, the advantages and obstacles of, as well as the approaches for, adapting and adopting systematic reviews to toxicology are still being explored. To provide the toxicology community with a starting point for conducting or understanding systematic reviews, we herein summarized available guidance documents from various fields of application. We have elaborated on the systematic review process by breaking it down into ten steps, starting with planning the project, framing the question, and writing and publishing the protocol, and concluding with interpretation and reporting. In addition, we have identified the specific methodological challenges of toxicological questions and have summarized how these can be addressed. Ultimately, this primer is intended to stimulate scientific discussions of the identified issues to fuel the development of toxicology-specific methodology and to encourage the application of systematic review methodology to toxicological issues.
KeywordsSystematic review Evidence-based toxicology Narrative review Evidence synthesis Review steps
Evidence-based approaches are received growing attention in toxicology due to their potential to improve the field’s transparency, objectivity, consistency and reproducibility, and to inform regulatory decisions and policy more effectively (Guzelian et al. 2005; Hoffmann and Hartung 2006; Schreider et al. 2010; Woodruff and Sutton 2014; Thayer et al. 2014; National Toxicology Program 2015; Stephens et al. 2016). By analogy to evidence-based medicine (EBM), the umbrella term evidence-based toxicology (EBT) has been coined to group all approaches intended to implement more effectively evidence-based principles in toxicology in general, and in toxicological decision making in particular. Such approaches include inter alia the establishment and universal use of a common ontology, justified design and rigorous conduct of studies, consistently structured and detailed reporting of experimental evidence, structured frameworks for evidence synthesis that characterize confidence in the evidence, probabilistic uncertainty and risk assessment, and the development of synthesis methodology to integrate evidence from diverse streams, e.g., from human observational studies, animal studies, in vitro studies and in silico/mathematical modeling.
The core evidence-based tool is the systematic review. Much attention has been focused on the application of systematic review methodology to toxicological questions in line with the efforts of government institutions from both sides of the Atlantic, such as the European Food Safety Authority (EFSA 2010) and the US National Toxicology Program’s (NTP) (Birnbaum et al. 2013). The work of these organizations has triggered the adoption and adaptation of systematic review approaches as a tool for conducting evidence-based assessments (EFSA 2010; Rooney et al. 2014; National Research Council 2014). As these initiatives are focused on the requirements and mandates of the respective institutions, the Evidence-Based Toxicology Collaboration (EBTC) and its stakeholders have identified the need to build on these recent developments and to provide a general introduction to systematic reviews for the broader toxicology community.
Historically, reviews in toxicology have been predominantly narrative in approach, whereby an expert uses literature to summarize a particular field, or attempts to address a specific research question, for example, regarding the potential toxicity of a chemical or drug for humans. A narrative review typically uses an implicit process to compile evidence to support the statements being made in the review. The reader often cannot tell how the available literature was identified, selected and compiled, why some studies were given more weight than others, and how the evidence was summarized to arrive at conclusions. It is often uncertain whether the author of a narrative review selectively cited reports that reinforced his or her preconceived ideas, or promoted specific views of a topic. Also, a quantitative summary of the literature is often absent in a narrative review.
Some differences between systematic and narrative reviews
Broad and informal (often not explicitly specified)
Specified and specific
Literature sources and search
Usually not specified
Comprehensive sources (more than one database) and explicit search strategy
Usually not specified
Explicit selection criteria
Quality assessment of included studies
Usually not present or informal (not explicitly specified)
Critical appraisal on the basis of explicit criteria
Often a qualitative summary
Qualitative and sometimes also a quantitative summary (meta-analysis)
>1 year (usually)
Science, systematic review, literature searches, data analysis (including meta-analysis)
Moderate to high
Historically, clinical research reviews were expert-written narrative reviews as well, before the advent of the evidence-based medicine/healthcare (EBM/EBHC) movement. This movement established systematic review methodology as the best practice for summarizing all available evidence bearing on a research question. The need for reproducible, transparent, and comprehensive syntheses of the ever-growing volume of medical evidence triggered the development of increasingly rigorous approaches to review question formulation, literature search, evidence selection, and evidence integration. The field of clinical systematic reviews has grown into a large discipline with offshoot products, such as PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses; www.prisma-statement.org), which is a guideline for reporting systematic reviews. In addition, working groups such as GRADE (Grading of Recommendations Assessment, Development and Evaluation) have developed to better understand and interpret the results of systematic reviews. The Cochrane group (http://www.cochrane.org), previously known as the Cochrane Collaboration, has played an instrumental role in fostering the continued development of systematic review methodologies since 1993. The Cochrane Handbook for Systematic Reviews of Interventions provides detailed guidance and instructions for conducting systematic reviews in a medical context (Higgins and Green 2011). Cochrane is also drafting a handbook for systematic reviews of diagnostic test accuracy (DTA) that adapts the evidence-based approaches to the challenges associated with evaluating diagnostic testing. Many of the methods in both handbooks are widely applicable, so they can be directly adopted for conducting systematic reviews in non-clinical areas, such as toxicology, while other methods need to be adapted for the toxicology context.
However, it is important to mention the specific differences between toxicology and clinical research, and the unique challenges associated with the application of this framework to toxicological questions (Wikoff and Britt 2016). These include multiple evidence streams and the challenges of their integration, multiple animal species (and strains), multiple outcomes and endpoints that characterize hazards, exposures to complex mixtures and the frequent lack of human data engendering the need to extrapolate from other species to human outcomes. Further, the objectives in a toxicological review often involve the evaluation of adversities as compared to clinical interventions. These complexities make it clear that the process developed for systematically reviewing randomized clinical trials for medical interventions, while serving as a foundational framework, will have to be substantially adapted to be applicable and useful in toxicology.
The guidance documents that were considered
Application of systematic review methodology to food and feed safety assessments to support decision making
European food safety agency
Cochrane handbook for systematic reviews of interventions
Higgins JPT, Green S
Finding what works in healthcare: standards for systematic reviews
Institute of medicine (IOM): Committee on standards for systematic reviews of comparative effectiveness research
Guidelines for systematic reviews in environmental management
4.2 (March 2013)
Collaboration for environmental evidence
Handbook for conducting a literature-based health assessment using OHAT approach for systematic review and evidence integration
National toxicology program OHAT
Methods guide for effectiveness and comparative effectiveness reviews
Agency for healthcare research and quality
Review of EPA’s integrated risk information system (IRIS) process
National research council (NRC)
Systematic reviews: CRD’s guidance for undertaking systematic reviews in healthcare
Centre for reviews and dissemination (CRD), University of York, UK)
Reliance on these guidance documents should not be interpreted as an endorsement of any particular approach at this point in time as the field and methodology are continuing to develop. In addition, it should be noted that we deliberately did not conduct a systematic review on this topic. As the purpose was rather to provide a survey of available guidance and characterize the key components in the conduct of a systematic review and in doing so, highlight some of the challenges in applying existing frameworks to toxicological questions, we considered a narrative approach to be more suitable.
To date, the application of systematic review methodology to toxicological issues has focused primarily on questions regarding the impact of chemicals on human health, e.g., for perfluorooctanoic acid (PFOA) (Johnson et al. 2014; Koustas et al. 2014) or metals (Navas-Acien et al. 2007; Meyer-Baron et al. 2009). However, many other toxicological questions are suitable for systematic review. Examples are, the risk associated with a specific exposure (Tsuji et al. 2014), the (eco-)toxicity of mixtures of substances (Cedergreen 2014), the relevance of a toxicity biomarker (Dello et al. 2013), the assessment of new toxicological test methods, the determination of toxicological mechanism, e.g., in the frame of the Adverse Outcome Pathways (AOP) approach, the status of technology development relevant for toxicological questions (de Vries et al. 2013), and the evaluation of risks to human health posed by a chemical under specific regulations, such as the Regulation, Evaluation and Authorisation of Chemicals (REACH) in Europe (Whaley et al. 2015). Note that the examples given here claimed to be systematic reviews. However, it is important to note that our citing of those examples here does not imply that these publications met the systematic review criteria in all cases. Indeed, an increasing number of publications claim to be systematic reviews, but upon a closer examination, fail to meet basic criteria such as the production of a review protocol, the documentation of the literature search, or the appraisal of studies (Haddaway et al. 2017).
Because minimizing bias is a guiding principle of systematic reviews, even the initial planning should be conducted as rigorously, objectively, and transparently as possible. This step may involve iterative consideration of sponsor and stakeholder needs, scoping of the topic—including considerations of feasibility, and input and participation from a multidisciplinary team sharing a variety of roles.
The motivation to conduct a systematic review should be documented to provide a summary of what is known on a given topic, e.g., to summarize a large amount of evidence, to explore reasons for inconsistency in the results of studies, resolve controversies or uncertainty about what the existing evidence is demonstrating, or to identify data gaps. Once the plan to conduct a review assumes shape, it has to be decided which type of review to perform. In some cases, a narrative approach may be chosen for any of a variety of reasons, such as limited available time or resources, limited data or to express an expert opinion. However, if the goal is to provide an objective and comprehensive summary of the evidence on a certain topic, a systematic review approach should be conducted. Table 1 (see “Preamble”) suggests that it is conceivable to adopt a mixed review approach that addresses some features in the manner of a narrative review and others in the manner of a systematic review. It is emphasized, however, that only a review conducted systematically in all steps is a systematic review. To avoid improper use of terminology, a mixed review approach should not be called a systematic review.
Various motivations exist to conduct a systematic review in toxicology. It is conceivable that the motivated party is researchers, who conduct systematic reviews to answer questions in their specific field of interest, or governmental, nonprofit or commercial organizations, which may conduct systematic reviews themselves or sponsor them. In the frameworks that have been created by agencies, including the NTP and EFSA, the motivation is driven by the respective public health mandates and needs of the conducting entity. Whether conducted by an agency or not, a systematic review may seek to clarify the health effects of an evidence-rich chemical. In other cases, a systematic review may be undertaken when evidence is scarce to identify data gaps or to assess the accuracy of a toxicological test method. Given that the systematic review framework is still an emerging practice in toxicology, it is also possible in these early days that a systematic review may be conducted, in part, to explore the proper translation of this methodology to the toxicological arena.
The questions addressed by systematic reviews should be meaningful to relevant stakeholders. Once the question and objective(s) for a systematic review have been at least roughly formulated, an effort should be made to make certain that no systematic review of sufficient quality and timeliness already exists. If a systematic review addressing a similar question is currently in progress, the results should be awaited prior to considering the undertaking of another one. In this regard, a registry of ongoing or completed systematic reviews in toxicology would be helpful. The PROSPERO database (http://www.crd.york.ac.uk/PROSPERO/), may serve as such a registry as the inclusion criteria have been expanded to systematic reviews with a health-related outcome. Other options are available, such as the publication of protocols for systematic reviews of laboratory animal studies offered by SYRCLE (SYstematic Review Centre for Laboratory animal Experimentation).
While not necessarily required, scoping the literature on the topic could be helpful in assessing the need for a systematic review. This approach is particularly useful in fields where little is known regarding the current state of the literature and previous systematic reviews have not been performed. Scoping may range from a simple non-systematic search in one or two databases to a more formalized, resource intensive scoping review (described in Levac et al. 2010 and in Peters et al. 2015). If a more formalized approach is adopted, it is recommended to consult with or involve a trained information specialist in conducting comprehensive literature searches for systematic reviews (McGowan and Sampson 2005). The findings of a scoping exercise may reveal that the question has already been adequately addressed or may confirm that better understanding of the evidence could provide clarity. A scoping search can inform the planning process by revealing important details such as the expertise required, the stakeholders that have interest in the topic, and the resources needed. Scoping may be conducted before or after a review team is formed, but the approach used should be transparent and objective.
A review team should be created and the roles and responsibilities of the team members should be defined during the planning phase. This team should be multidisciplinary and combine appropriate expertise and experience to conduct the systematic review. It should include expertise on (1) the topic, (2) systematic review methodology, (3) literature search and retrieval, and if required, (4) quantitative methods and statistics. Two or more members of the team should collaborate to allow cross-checking of essential systematic review steps, many of which require parallel work independently conducted. The team should establish a leader who understands the task in detail and is skilled in facilitating multidisciplinary projects. Among the first tasks of the review team should be detailed planning of required resources, distribution of tasks and planning of the time frame. Furthermore, the team should engage with expected users of the review’s results and diverse stakeholders to collect their input. Bias in the review team should be minimized and disclosed. Members should be independent of parties with potential conflicts of interest. All members should complete a formal conflict of interest statement, e.g., using the International Committee of Medical Journal Editors (ICMJE) Conflict of Interest form (COI). (available at http://www.icmje.org), which may need to be re-visited throughout the review, e.g., using a COI management plan.
An advisory group that includes representation of relevant stakeholders, especially potential users of the systematic review outcome with appropriate interests, skills and commitment, should be considered. The availability of an advisory group may prove especially valuable in supporting the review team by informing key decisions, particularly those that arise from the need to adapt general systematic review methodology to the needs of toxicology. In addition, the advisory group can help interpret and disseminate results, and collect stakeholder and user input.
If not yet available, potential sponsors that provide financial resources may be approached. Sponsors of systematic reviews can be governmental, nonprofit or commercial, noting, however, that in the clinical field, some (e.g., Cochrane) prohibit commercial sponsors. They should not interfere with the independence of the review team. While allowed to give direction in the very early stages of a systematic review, sponsors should not exert any influence once the (broad) review question is defined. However, input on the scope as well as oversight of the review to ensure progress and timeliness should be allowed. All sponsorships should be acknowledged. It is advisable that the motivation of the sponsor to support a specific systematic review should be made known to all parties involved. Best practices regarding sponsorship are likely to evolve with the acceptance and application of systematic reviews in toxicology.
By the end of the planning stage, the decision to conduct the review will have been confirmed (or not). The resources and the timeframe will have been established, and the review team and advisory group will be in place.
- 2.Framing the question
Once the need for a systematic review has been established, that need should be translated into the review question(s) for conducting the review. Framing the review question(s) is a crucial step in a systematic review.
When reviewing clinical questions/interventions, the process of formulating a review question follows a structured framework that consists of a few essential components. One of the approaches for formulating systematic review questions used in medicine is captured in the PICO framework. This framework calls for the systematic review question to address the Population/participants, the Intervention, the Comparison or Control, and, if considered relevant, the Outcome. In addition, the review question may also specify the types of studies to be considered, e.g., randomized clinical trials, and may include the (clinical) setting, which is sometimes reflected by amending the framework to PICOTS.
In general, this framework should also be applicable to the toxicological context; it has been adapted to exposure-related review questions as PECO, replacing the Intervention component with Exposure, defined by exposure conditions, e.g., a substance or radiation, the route (e.g., oral, dermal, intravenous), the duration (ranging from once (acute) to daily for a lifetime (chronic)) and the relevant exposure range. Appropriate definition of the exposure component is essential for the relevance of the review results for public health protection. An example of a PECO for a chemical health effect is to investigate if chronic oral exposure to chemical X (exposure) induces health effect Y (outcome) in adult rats (population) as compared to not exposed adult rats (control). The population should clearly define the evidence stream(s) and the subjects considered (e.g., adult, juvenile/children, pregnant, healthy, diseased, etc.).
For systematic reviews assessing the accuracy of a test method compared with another, one would need to specify an index test and a comparator test. Note that not all components are relevant to every systematic review and that specific questions may require different components. Regardless of specifics, framework components are a general requirement and are specified in the systematic review protocol. The review team should plan sufficient time for framing the review question, including the generation of the associated rationale and context, and possibly for iterative modifications, as decisions made during problem formulation have significant impact on the scope and form of the systematic review.
What is the effect of exposure to fluoride used as additive for water fluoridation, compared to vehicle-only treatment, on neurobehavioral outcomes in whole non-human mammalian animals? (National Toxicology Program 2016).
What is the epidemiological evidence for an association of low-level arsenic exposure in drinking water with cardiovascular disease? (Tsuji et al. 2014).
What is the animal and human evidence for the usefulness of ophthalmate as a biomarker for oxidative stress and hepatic glutathione homeostasis? (Dello et al. 2013).
What is the association between intake of isoflavones from food supplements and adverse effects on the three target organs (mammary gland, uterus and thyroid) in peri- and post-menopausal women? (EFSA Panel ANS 2015).
Is developmental exposure to air pollution associated with autism spectrum disorder? (Lam et al. 2016).
For healthy adults, is caffeine intake above 400 mg/day, compared to intakes of 400 mg/day or less, associated with adverse effects on cardiovascular outcomes? (PROSPERO 2015: CRD42015026673; available from http://www.crd.york.ac.uk/PROSPERO/display_record.asp?ID=CRD42015026673).
While secondary review question(s) may be addressed, a clear single primary review question should drive the formulation of the review. Because this question will be the systematic review’s guiding element and principal goal, defining it precisely and appropriately is of crucial importance; the entire review team should be involved in the process. A properly framed review question will facilitate all the review’s subsequent steps, including the definition of the eligibility criteria and the literature search, how the evidence/data will be collected, and how the results will be presented and integrated. In particular, the question should help define the criteria for the inclusion and exclusion of research studies in a way that ensures that all relevant evidence is included to answer a particular question. For example, the review question could focus on a specific study type, such as chronic toxicity studies in animals, and would exclude any other study type, such as acute or sub-acute toxicity studies. An example of a question about a specific hazard of a substance would be ‘What is the current evidence from animal studies that substance X compared to substance Y (e.g., vehicle or no treatment) can induce effect Y?’
Once the review question has been formulated, it can be modified if more detailed insight into the topic demands it. However, all modifications should be well documented and justified, agreed to by the entire review team, eventually approved by the advisory group and reported, preferably, both in a protocol and in the final publication. Any introduction of bias, such as modifying the review question after data extraction should be strictly avoided. Modifications to the question during protocol development may be indicated, for example, when the scope of the review question proves to be too narrow or wide, the study type(s) to be considered need to be restricted or extended, or the outcome of interest is too specific or nonspecific (e.g., developmental effects vs malformations vs delayed ossification).
- 3.Developing and publishing the protocol
Protocol publication and registration
Once the review question has been defined, the protocol needs to be developed. Protocol development is often an iterative process. To minimize bias, the protocol should specify the methods to conduct the systematic review in such detail that the review could be independently reproduced. This reduces the potential for introducing bias, because the process is defined without detailed knowledge of the evidence. As the review team’s understanding of the topic evolves, issues may arise that make the need for adjustments to the protocol apparent. To ensure the transparency of any protocol modifications, the review team needs to document and justify them. Developing the protocol may require regular communication within the review team and with the advisory group. Standardized protocol formats are available for clinical systematic reviews (e.g., in Higgins JPT and Green S (2011)) and for animal intervention studies (e.g., by de Vries et al. 2015), but not for systematic reviews of toxicological issues.
The literature search strategy, including the databases and other sources to be searched, the languages to be considered, the publication period to be covered as well as database-specific search syntax.
The inclusion/exclusion criteria detailing how the studies/hits identified during the search, which can be in the thousands, will be screened for relevance, and a description of the process how these will be applied. Usually, at least two reviewers screen the studies independently in duplicate, initially based on title and abstracts, and later on full texts.
The data to be extracted and the process of data extraction, including an explanation of how it will be retrieved from disparate data sources. This should involve at least two trained extractors that abstract data independently in duplicate using standardized forms/templates.
The criteria that will be used to assess each evaluated study’s quality, including internal validity/risk of bias, i.e., the degree to which a result of a study is likely to be true and free of bias, such as selection bias, performance bias or detection bias, but also other quality aspects, e.g., related to exposure, and a description of the process how these will be applied. Usually, at least two reviewers screen the studies independently in duplicate. Instructions for assessing the entire body of evidence’s risk of bias should also be included.
How the data will be summarized and synthesized relative to develop conclusions, which may or may not include a quantitative analysis, e.g., a meta-analysis. Here, it is helpful to anticipate the likely data types to be encountered, e.g., ordinal or dichotomous, one- or multi-dimensional, and to determine how the (summary) data will be represented when the data are synthesized.
The process of determining confidence in the final result distilled out of the included studies considering aspects such as precision, consistency, directness, magnitude, dose–response relationship, publication bias and aspects of quality, including external validity, and internal validity/risk of bias).
Once the review team considers the protocol as complete, it should be made publicly available. This allows interested parties that were not involved or consulted in the preparation of the systematic review, such as additional experts or stakeholders, to provide constructive input at this stage that still allows the team to accommodate suggestions. The review team should be, within limits, responsive to the comments received. Additionally, protocol publication or registration serves as documentation of a priori decisions—a critical component of systematic reviews. For example, it safeguards against introduction of bias via changing methods part-way through the review process, as it requires justifying the points in which the actual report of results deviates from the protocol. Protocols can be submitted at various stages, ranging from the very initial stages of problem formulation, but prior to the completion of data extraction. Several venues for protocol publication or registration are available. Aside from PROSPERO, HAWC (Health Assessment Workspace Collaborative), which is designed to facilitate development of human health assessments of chemicals, provides an opportunity for authors to document and make protocols publicly available. Other sources to publish protocols include: Open Science Framework (www.osf.io), CAMARADES for animal systematic reviews (http://www.dcn.ed.ac.uk/camarades/), or any electronic repository that is publicly available and searchable (e.g., UOttawa repository—https://www.ruor.uottawa.ca/).
- 4.Searching the evidence
Design of a search strategy
Implementation of the search
Sources to search
The literature search is at the heart of systematic review. It needs to be sensitive enough that it does not inadvertently exclude evidence, which is relevant to the review question, without returning an unmanageably large amount of irrelevant information. Care must be taken not to introduce bias during the literature search (for example, by accidentally searching only sources that tend to report significant findings). Therefore, the design of a search strategy needs to be developed thoughtfully with the help of an information specialist experienced in systematic review searches and documented in the protocol.
be guided by the primary question, e.g., in the selection of search terms
minimize potential sources for biases, e.g., by specifying the information sources to be searched
be in-line with the pertinent inclusion criteria, e.g., publication date or language(s) to be considered
be developed using syntax specific to databases (e.g., MeSH terms in PubMed)
strike a balance between sensitivity, i.e., the ability to identify relevant evidence, specificity, i.e., the ability to exclude irrelevant information
be double-checked for appropriateness, e.g., in a pilot phase
be appropriately documented in the protocol.
Because including population(s) and outcome(s) in a search can render it too complex to be conducted efficiently, these parameters are not usually considered in clinical systematic review searches. However, a search strategy for a toxicology topic can be expected to include the evidence stream assessed, e.g., animal, human, in vitro or mechanistic studies, and often also the toxicological endpoint(s) of interest.
The search should be implemented in an objective manner. In this way, some biases, e.g., inclusions of studies known to the authors, or supporting the authors’ view (frequently observed in traditional narrative reviews) can be minimized. Furthermore, authors need to be especially aware of issues such as publication bias, i.e., systematic differences between the findings of published and unpublished research, selective outcome reporting, time-lag bias (time of publication depending on the results), citation bias [(non-) citations driven by the results], gray literature bias (publication in gray literature depending on the results) or multiple publication (Song et al. 2014). It remains to be explored which biases play a role for toxicological systematic reviews.
Some biases can be minimized by an appropriate choice of sources to search. Searches will be most efficient in bibliographic databases such as PubMed, EMBASE or Toxline, as they offer advanced search options and will in many cases include at least a substantial proportion of the relevant evidence. A BIOSIS preview, which inter alia includes proceedings, provides access to a certain proportion of gray literature. To determine which sources to search, review teams may wish to contact other experts in the topic of interest, such as research groups or manufacturers. Citation searching may be necessary to complement the database search using tools such as science citation index (SCI) or SCI Expanded, as e.g., implemented by SciSearch or the Web of Science or Scopus. These tools can be used to support the identification of relevant backward citations, i.e., references in eligible studies, as well as forward citations, i.e., later studies referring to an eligible study. It should clearly be stated what level of detail will be required for inclusion, e.g., if posters or abstracts will be considered.
In addition, other sources of evidence may be considered, including general search engines (e.g., Google Scholar), subject-specific or regional databases (e.g., National Toxicology Program study databases), or dissertation and thesis databases.
Gray literature—here defined as material either unpublished or not controlled by commercial publishers, often differing in form, e.g., not a scientific article or report, and the way it is available, i.e., not in journals or databases—may comprise a variety of sources, such as government reports, theses, dissertations, conference proceedings, regulatory databases, case reports or interest group media, such as websites. Gray literature is potentially as important for toxicological systematic reviews as it is for those in the clinical field, but with different emphasis. While in the clinical field gray literature has been shown to include more ‘negative’ results (Hopewell et al. 2007), no evidence on any possible pattern in gray toxicological literature is available. However, it can be expected that it will be associated with intellectual properties. Therefore, some of the original toxicological studies might not be readily accessible for the review team since these are owned by the study sponsors. It is essential to clearly define the gray literature sources to be searched and the type of information to be included (experimental studies, case studies, collection of information (e.g., on websites)). Note that with the new channels of scientific communications, i.e., all forms of web publishing, including blogs, newsletters and websites, the definition of gray literature is currently undergoing adaptation that potentially may require a review of the use of this term in the context of systematic reviews.
Proper documentation of the search is required to allow its replication. While the overall strategy design is reported in the protocol, the database-specific search strategies should be included in an appendix to the review. Consequently, the exact strategies should be stored electronically with the date of the search and the number of unique records found. This information should be reported in the first step of a study flow diagram.
- 5.Selecting the evidence
Literature searches usually yield thousands of records. Many of these will either not be relevant or will have specific characteristics that disqualify them for evidence synthesis. To prevent subjectivity in the evidence selection, systematic reviews include definition of eligibility criteria that are used to identify the proportion of relevant studies that will ultimately be included in the systematic review. Driven by the review question, it is important to define the criteria as unambiguously as possible to allow consistent interpretation and application by the assessors. To minimize biases and to ensure reproducibility of the process of selecting the studies, the eligibility criteria need to be defined in advance in such a way that the scope for subjective judgment is reduced to a minimum.
Generally, the eligibility criteria for a systematic review will address the pivotal aspects of the framed question, i.e., in case of PECO, its four defining elements. Depending on the scope of the systematic review, the population may refer to a human study population (e.g., in terms of sex, age, geographical region); to animal species, strains and other characteristics for in vivo studies; or to cell source(s) for in vitro studies. Exposure criteria may address the preparation of the doses/concentrations or the administration/treatment (route, characterization, stability, frequency, duration, optimal treatment window, etc.) scheme. Criteria related to the comparison will specify requirements related to experimental controls (e.g., for animal and in vitro studies, the need for a negative or vehicle control) and aspects like randomization and blinding. The eligibility of outcomes needs to be carefully considered in order not to bias the review. Systematic exclusion of outcomes may bias the results, such as excluding evidence that opposes the effect determined by the included outcomes. By contrast, inclusion of a wide range of outcomes may result in a spectrum too broad for a meaningful synthesis, e.g., when considering all types of developmental effects. Other study characteristics that may be used as eligibility criteria include the animal species, reporting of required data and—although often discouraged—the language and date ranges. Studies’ internal validity/risk of bias may be used to exclude studies. However, this has the potential (a) to discard too much evidence perceived low internal validity that may nevertheless be useful and/or (b) to inject a substantial risk of bias, as studies first need to undergo detailed assessment, which may influence reviewers. Therefore, the internal validity of studies is usually accounted for in a sensitivity or subgroup analysis later in the review process.
Another bias may be introduced at this stage of a review if the reviewers’ knowledge results in the formulation of eligibility criteria potentially biased to meet the experts’ expectations. A further potential pitfall at this stage is random error associated with reading and reviewing records.
The selection process should be described in detail in the protocol. It should specify the qualification and/or training of the reviewers and how the quality of the selection is controlled (usually by independent duplicate review, i.e., requiring that two reviewers independently carry out the selection, with a procedure to resolve disagreements). In addition, it should provide instruction to document the selection in a way that allows its reproduction. The selection is usually carried out in two stages. First, all identified records are screened, e.g., on the basis of title and abstract, to exclude obviously irrelevant records. Although reviewers should be conservative, and when in doubt, not to exclude studies, screening substantially reduces the number of records. Rejected studies will either be completely off-topic or fail to meet one or more eligibility criteria. These rejections should be clearly documented including a justification. The second stage involves retrieving full reports of the remaining records and determining their eligibility. The task of obtaining full study reports poses challenges to the review team, as both the process and eligibility assessment may be time consuming. For toxicological issues, reports may be scientific articles, but can also be study reports usually in the possession of the study sponsor or conducting agency, which may be difficult to obtain.
Duplicate records should be identified and excluded, usually before the eligibility is evaluated. While some duplicates may be straightforward to identify (e.g., identical records retrieved from different sources), especially when using reference managing software, or other software designed to facilitate systematic reviews, others may be very difficult to detect. Duplicates may be especially hard to detect when only parts of the data have been duplicated.
Researchers may want to consider the appropriateness of the selection process in a pilot exercise. Therefore, a representative subset of studies identified through the literature search should be selected. To assess the reproducibility and appropriateness, two reviewers should independently apply the selection criteria. This practice can identify and remedy ambiguities both in the eligibility criteria themselves or in the way the reviewers interpret them.
Detailed documentation of the decision(s) made in the selection process is essential for the transparency of the review. Reviewers’ assessments should be captured, as well as the solutions in case of disagreements. The reasons for exclusion of records at the screening stage should be documented in a dedicated place, where distinguishing the irrelevant cases from the cases that failed eligibility criteria is considered sufficient. Similarly, exclusion at the full-text level should be documented. All full texts retrieved should be kept in a database.
Data to collect
Data extraction process
Data extraction is the process of collecting relevant information from the full-text version of selected studies for the subsequent data summary and analysis steps of a systematic review. Thorough planning of the extraction is required to minimize biases, reduce human errors, and allow for reproducibility. A priori planning of tables and figures to be included the final report can help ensure that all data relevant for the intended analyses will be collected.
Proper documentation, the use of user-friendly extraction tables or software, and a piloting exercise (see below) will maximize reproducibility of the extraction. Striking the right balance between over- and under-extracting is the key, as failing to collect relevant data may require additional reviews of the full texts of all eligible studies, while extraction of irrelevant data will unnecessarily consume valuable time and resources.
eligibility, including reasons for both inclusion and exclusion (see above)
data for quality/risk of bias assessment
study characteristics (regarding all components, e.g., PECO)
other information, such as funding source, study authors’ conclusions or if study authors were contacted.
Collected data need to be augmented with general information, e.g., reviewer ID and date of extraction, and with unambiguous study identifiers, such as the citation and a study ID. For some cases, e.g., when only study types with a well-harmonized way of reporting are considered, it is efficient to also extract the data required for the study quality/risk of bias assessment, as described in the next step, during this step. For fields with lesser standardization such as toxicology, it may be more prudent to conduct three different extractions—one for the study characteristics, one for the quality/risk of bias assessment, and one for the results—as each of them requires a different focus and level of interpretation by the extractor.
Usually, most data should be extracted from study reports that allow the data source to be traced. Some journals have included the raw data as supplementary material. In some cases, the funding agencies have a requirement that raw data are submitted to a central database. In some cases, the review team may decide to seek clarification or missing data by contacting study authors. As such obtained information will not be directly accessible to others, it needs to be explicitly annotated. Moreover, this procedure creates the risk of introducing bias, as only some authors will respond. Author response rate should be reported in the final review.
The data extraction process is greatly facilitated using electronic data extraction forms. Various nonspecific (e.g., spreadsheet or database application) or specific (e.g., DRAGON, HAWC, RevMan, or DistillerSR) software solutions are available. The extraction form, a sample of which should be included in the study protocol, compiles the relevant data for the review in a clear and unambiguous manner. A clear structure and features such as pre-specified entries increase efficiency as well as user-friendliness, which in turn will reduce human error. It is important to design an efficient extraction process, e.g., using pre-defined lists of values for specific information to be extracted, and by minimizing free text fields.
The extraction itself is a time-consuming process and standardized approaches will help to decrease burdens. Certain study document formats of studies, e.g., PDFs, can facilitate the process by allowing electronic searching. Dual independent review by trained review authors is strongly recommended, while extraction by one reviewer and quality control by a second may be acceptable. Consistency and reproducibility of extraction can be evaluated by a piloting exercise with some representative studies; experience shows that this often leads to modifications of the form. A process to resolve reviewer disagreements should be specified at the outset.
- 7.Assessing the evidence
Internal validity/risk of bias
In a systematic review, the quality of the individual pieces of evidence is assessed systematically. In this context, some authorities, especially Cochrane, purposefully avoid use of the term ‘quality,’ owing to the potential for misunderstanding. Consequently, this section briefly introduces the terminology used in this context
The term ‘methodological quality’ or ‘study quality’ can refer to study validity as well as to other methodological criteria such as ethical approval and reporting or lack of power (Krauth et al. 2013). An individual study’s validity is composed of its external validity, or relevance, i.e., the extent to which a study provides a correct basis to be generalized to other circumstances, and its internal validity, which is concerned with the reliability of the study itself, regardless of whether it is relevant to other circumstances. A study is internally valid if the differences in results observed between the experimental groups can, apart from random error, be attributed to the intervention under investigation. Certain characteristics of a study may threaten its internal validity, namely if these characteristics introduce systematic differences between the experimental groups other than the intervention of interest. These differences may result in either systematic over- or underestimation of the true effect size, i.e., bias. However, the actual bias (magnitude and direction) in a study can usually not be assessed. Therefore, the term ‘risk of bias’ (RoB) is now widely used in the clinical field to assess the degree of bias susceptibility of a study.
Given that we do not have space to discuss all aspects of methodological quality and given that internal validity/RoB is considered to be a crucial element of quality assessments in toxicological reviews (Krauth et al. 2013; Lam et al. 2014), we will focus in the remainder of this section on internal validity/RoB. However, external validity and possibly other quality aspects of toxicological studies are also important and should be evaluated systematically. Samuel et al. (2016), who provided a systematic compilation of available approaches for assessing methodological and reporting quality of toxicologically relevant studies, present a good starting point for exploring study quality more broadly.
Selection bias refers to systematic differences between baseline characteristics of the groups that are compared. In experimental studies, this type of bias can be reduced or prevented by randomized allocation and allocation concealment.
Performance bias refers to systematic differences between groups in the care that is provided, or in exposure to factors other than the interventions of interest. It can be minimized by, for example, blinding researchers and caretakers or by randomizing the order in which the groups receive the experimental exposure.
Detection bias refers to systematic differences in the way the outcomes are assessed, e.g., when outcome assessors are aware to which experimental group the subject/specimen being assessed belongs. It can be avoided by appropriate blinding and randomization of the outcome assessment.
Attrition bias refers to systematic differences between the experimental groups in withdrawals or drop-outs from the study. Withdrawals or drop-outs lead to incomplete outcome data. Because the outcomes of the study can only be based on the available data, the reported outcomes may not reflect the true effect of the intervention. Attrition bias can be taken into account by detailed reporting of the number of withdrawals/drop-outs per experimental group and the reason for withdrawing/dropping out.
Reporting bias refers to systematic differences between reported and unreported findings. For instance, in a published report, those analyses with statistically significant differences between intervention groups, are more likely to be reported than non-significant differences. This sort of ‘within-study publication bias’ is usually known as outcome-reporting bias or selective-reporting bias. The risk of this type of bias can only by assessed if protocols for primary studies are registered or made publicly available before the data analysis. For animal studies, for example, such registration is still highly uncommon.
Other biases: there may be other sources of bias that are relevant only in certain circumstances or for particular study designs. It is up to the review authors to judge whether for the studies that will be included in their systematic review other factors are likely to cause structural underestimation or overestimation. However, to prevent bias in the application of the criteria for this sixth type of bias, these criteria should be prespecified and clearly defined in the protocol, preferably with an explanation why the criterion in question is likely to reflect an actual risk of bias.
It must be kept in mind that the method for appraising study validity by assessing risk of bias was initially developed in the clinical field, at first for randomized controlled trials. While some of the concepts can directly be transferred to toxicological systematic reviews, e.g., reporting bias, others need to be adapted. Furthermore, for some (aspects of) study types, e.g., in vitro studies on toxicological mechanisms, potential “threats” to internal validity need to be identified.
The RoB assessment is based on specific questions that are defined in advance to address the various bias types. The utility of these questions strongly depends on the kind of evidence to be reviewed. When dealing with human toxicological data, it is possible to adopt the clinical approaches (see e.g., Johnson et al. 2014). With some modifications, the approaches used in clinical systematic reviews have been used for animal experiments, including pre-clinical (see e.g., Wever et al. 2012) and toxicological studies (see e.g., Koustas et al. 2014). For pre-clinical studies, a risk of bias tool has been proposed (Hooijmans et al. 2014a) that focuses on one or more domains for each bias type, e.g., performance bias is addressed by the domains of ‘random housing’ and ‘blinding of caregivers.’ The proposed tool comprises ten specific questions, such as ‘Was the outcome assessor blinded?’ or ‘Are reports of the study free of selective outcome reporting?’ Half of the tool’s questions were in agreement with the Cochrane risk of bias tool (Cochrane handbook) stressing that much can be learned from the role model of clinical systematic review practice. This tool could also be applicable to systematic review of toxicological animal studies. Guidance or examples of risk of bias assessment for in vitro studies are, to our knowledge, not yet available, but the NRC report (2014) addresses this aspect in some detail. Ongoing efforts in toxicology to improve existing scoring systems may be helpful in the future (Segal et al. 2015).
It is important to predefine answers to the questions used to evaluate the risk of bias for each outcome, e.g., low, high, no or unknown risk of bias. This produces results that lend themselves to straightforward summary and improves the consistency and reproducibility of answers from different reviewers. The answer spectrum to the questions should be harmonized across questions as much as possible and should be transparent and clearly described to foster reproducibility. In this process, guidance and/or examples for the answer choices for each question further reduces reviewer disagreement. In addition, answers should be justified, which will help resolving reviewer disagreement.
The importance of the various bias domains or questions varies depending on the outcome. For example, detection bias is likely to be less important, i.e., will more unlikely result in bias, for animal studies with death as the primary outcome. Detailed discussions of the applicability of certain domains to toxicological animal studies are available (see e.g., National Research Council (NRC) (2014); National Toxicology Program (2015)).
For all potential sources of bias, it is important to consider the likely magnitude and the likely direction of the bias. For example, if all methodological limitations of studies were expected to bias the results towards a lack of effect, and the evidence indicates that the intervention is effective, then it may be concluded that the intervention is effective even in the presence of these potential biases.
General lack of scientists’ awareness of the risk of bias concept in the toxicological community results in reporting that omits details important for risk of bias assessment. This may seriously hamper assessing the actual risk of bias of the included studies and therefore of the systematic review. Guidance to improve reporting quality is available for toxicological animal studies, notably the ARRIVE guidelines (Kilkenny et al. 2010) and others (Hooijmans et al. 2010; Landis et al. 2012). However, harmonization of guidance and focus on potential use in systematic reviews is required. As more toxicological systematic reviews are conducted, their authors will identify which aspects need to be reported (or not), possibly creating an iterative feedback loop motivating authors and even journals.
With regard to use in a systematic review, inadequate reporting per se does not reduce internal validity, but is an obstacle to the assessment of this validity. It remains to be seen if the solution of contacting authors to retrieve missing information is a feasible and useful approach in toxicology. This solution is often applied in the clinical fields, but is generally not very successful because of low response rates. An approach that is likely to work better is education in and adherence to reporting standards, with proper enforcement by journals.
- 8.Analyzing data
Planning the analysis
Heterogeneity, sensitivity analysis and reporting bias
It is important that the analysis is planned a priori, in the context of the review question. To prevent selective outcome reporting, the protocol should describe the planned analysis as precisely as possible, notably the outcome measures to be analyzed as well as the ways to deal with heterogeneity (statistical model, subgroups). The data type (binary, ordinal or continuous) of the measured effects to be analyzed will inform the analytical methods. Because it is not always easy to anticipate all details of the analysis steps, the analysis section of a systematic review protocol sometimes has to be revised once the analysis has been started. It is important that these revisions are clearly indicated and justified both in the protocol and the published systematic review. Consultation with a statistician is strongly recommended, particularly if the data analysis takes the form of a meta-analysis.
A narrative analysis (or narrative synthesis) is a descriptive summary of the included studies’ results. It is an essential ingredient of any systematic review that should provide sufficient detail, usually using tables.
A meta-analysis is a structured quantitative analysis of outcome data from comparable studies leading to a quantitative summary of results. Note that a systematic review does not have to contain a meta-analysis. If, for instance, the number of studies is too low, the outcome measures vary substantially between the included studies or the studies are too dissimilar in design, a meta-analysis does not make sense. The benefits of a meta-analysis include increased statistical power and improved precision in the estimation of an effect. However, the reliability of the results of a meta-analysis depend on the reliability of the included studies. If a meta-analysis contains many low quality studies (e.g., a high risk of bias), the results can be misleading.
While meta-analysis is the predominant approach in the clinical field, it still has to be explored to what extent and to which type of toxicological review questions meta-analytical techniques can be applied (Goodman et al. 2015). Possibly, lessons can be learned from the experiences made when adapting and applying meta-analyses to pre-clinical animal studies (Vesterinen et al. 2014, Hooijmans et al. 2014b). Meta-analysis of human epidemiological studies has been applied to assess neurobehavioral effects of metal or organic solvent exposure in an occupational setting (e.g., Goodman et al. 2002; Meyer-Baron 2005). Also, unique to toxicology is that many assessments involve quantitative characterization of hazard via toxicological benchmark values, such as daily reference doses or occupational limit values—values that the process of systematic review can support, but the approach for doing so is not yet clear.
Heterogeneity in a meta-analytical context refers to statistical heterogeneity, i.e., variation in the results of studies greater than would be expected from chance alone. This heterogeneity may be caused by several types of differences between the included studies. Such differences may be present in study characteristics such as the population, e.g., when different species, strains or cell lines are used; the exposure, e.g., when various exposure routes or durations have been used; the control, e.g., when different solvent vehicles were employed; or the outcome. Furthermore, study designs and methodological factors, as addressed in the risk of bias assessment may vary between included studies, potentially leading to heterogeneous effect sizes.
The protocol should describe how heterogeneity will be identified and dealt with. An important method of exploring the sources of heterogeneity is performing subgroup analyses, in which the studies included in the meta-analysis are split according to basic characteristics that might lead to differences in effect. For example, it might be of toxicological interest to study in a subgroup analysis whether studies that only used a negative control had different results compared to those employing a solvent control and whether these differences (partly) explain the heterogeneity found in the analysis. Subgroup analyses need to be pre-specified in the protocol to prevent selective reporting and should be limited in numbers to allow meaningful interpretation and to keep control of the multiple testing.
To assess the robustness of the results of the systematic review/meta-analysis, i.e., the extent to which these results depend on the decisions made during the review process, a sensitivity analysis can be performed. Although a major aim of systematic reviews is to be as objective as possible, in the conduct of a review decisions have to be made that to some extent depend on subjective preferences. Examples of such ‘subjective’ decisions are the definition of a numerical value (e.g., NOEL vs LOEL) or the choice of statistical methods (e.g., Chi-square test vs Fisher’s exact test). Documentation and justification of these decisions will ensure transparency. In a sensitivity analysis, however, the impact of these decisions can be shown. Such an analysis studies whether the overall results would have been different, if another choice had been made, for example, if a LOEL rather than a NOEL had been used. If the overall results are the same, the conclusions of the meta-analysis are more robust.
In the clinical field, publication bias, i.e., bias in the published literature because studies with neutral or negative results are less likely to be published, is a frequently suspected problem. Tools that provide an indication of the presence and/or impact of publication bias, such as funnel plots and trim-and-fill analysis, have been developed. Moreover, the practice of prospective registration of trials, which makes studies traceable, has been established as a countermeasure.
- 9.Interpreting the results
Confidence in a body of evidence
Integration of evidence streams
The interpretation (or synthesis) of the results found and the conclusion of the systematic review should be clear, precise, and comprehensive in light of the review question. The components of this final section of the systematic review should be presented in a way that can be easily understood by scientists, the public, and decision makers.
In toxicology, often data from the various so-called evidence streams, i.e., sets of studies representing the same type or level of evidence, e.g., human (observational) studies, animal studies, in vitro or mechanistic studies, need to be integrated. For methodological reasons, it is advisable to conduct systematic reviews for each evidence stream separately. Consequently, interpretation will have to be done at at least two levels; for the individual systematic reviews/evidence streams and for the combination of all evidence streams.
Within each systematic review, the interpretation should be carried out for the so-called bodies of evidence, i.e., sets of studies of the same type or level of evidence grouped by outcome measure. The interpretation is qualitative and aims at determining the confidence (or certainty) in the evidence. The confidence in the evidence expresses the level of certainty that the findings from a group of studies reflect the true relationship between exposure to a substance and the outcome measure in question.
There is no consensus yet on the details of how the confidence in a body of evidence should be determined in the field of toxicology. For systematic reviews for environmental health assessments, Rooney et al. (2014) proposed a system for rating confidence in the body of evidence based on GRADE, where GRADE stands for: Grading of Recommendations Assessment, Development and Evaluation (Guyatt et al. 2011; Balshem et al. 2011). The GRADE approach was developed for healthcare systematic reviews. It starts by setting an initial level of confidence depending on study type, e.g., randomized controlled trials start high and observational studies start low. This initial level of confidence may be decreased or increased if certain attributes are present. Attributes that can reduce the confidence in the body of evidence include overall risk of bias, publication bias, imprecision, inconsistency and indirectness, whereas characteristics such as large effect sizes and a dose–response gradient may increase the confidence. The outcomes of this grading process inform a rating of the final confidence in body of evidence, which will guide the conclusions. It is acknowledged that this rating may be subjective. However, the rating is considered helpful as it increases the transparency of the final conclusions.
The approach developed by the National Toxicology Program (2015), and the approach used by the Navigation Guide (Woodruff and Sutton 2014), are similar to the original GRADE approach, but use slightly different criteria for setting initial levels of confidence and for upgrading and downgrading. For example, in the Navigation Guide, observational studies start at a moderate rather than a low level of confidence and NTP uses consistency across species as an extra-upgrading criterion. However, it is re-emphasized that there is not yet consensus in the toxicology field regarding this approach.
Ideally, the confidence in the different bodies of evidence (i.e., per outcome) should be integrated into a confidence across outcomes for the evidence stream in question.
As indicated above, a complexity of toxicology, in contrast to healthcare, is the potential need for integration of evidence streams. It has been proposed to distinguish the following three evidence streams: ‘human’; ‘animal’; and ‘mechanistic/in vitro’ (National Toxicology Program, 2015). It remains to be seen if this categorization of evidence based on the test species/system is sufficiently coherent, as mechanistic evidence may also be derived from animal studies and in vitro studies may provide other information than mechanistic. However, mechanistic understanding is essential in the assessment of external validity, especially of non-human evidence. Although the integration of these evidence streams is not part of the systematic review process (it is more about the integration of the results of different systematic reviews), it is desirable that this process is conducted in a structured, transparent and pre-specified way.
The methods to integrate different evidence streams are even less established than the methods to interpret and rate bodies of evidence. An important reason for this is that, at this level, not only the confidence in each evidence stream may be important but also its relevance, e.g., for the human exposure of interest, or external validity.
Questions that arise at this level are: Should all evidence streams get the same weight or should the evidence streams be weighted by their external validity, for example, the evidence from mechanistic in vitro studies, by definition, have a lower weight because these data might have lower relevance for the human situation? This is especially relevant, since more Adverse Outcome Pathways are available as well as data from programs like US EPA ToxCast™ and TOX21, providing a wealth of data on a molecular level. For example, if the human evidence stream has a low confidence level (e.g., because these studies contain many confounding factors) and the mechanistic stream has a high level (e.g., because it consists of well-designed and well-conducted experimental studies with very similar results), should the latter get more weight than the former?
NTP has developed a framework that translates confidence ratings per evidence stream into evidence of health effects, which is then used to integrate evidence from human and animal studies into a hazard identification conclusion. The mechanistic evidence is used to support the decision to turn the initial hazard conclusion into a final hazard conclusion. Rhomberg (2015) has suggested a different approach for integrating data from different evidence streams, called a hypothesis-based weight-of-evidence approach. However, a standardized approach does not yet exist.
The conclusions of the systematic review should detail the implications of the findings. These may vary depending on the study question(s), but should not go beyond the review scope. For example, in cases where a specific hazardous property of a substance was investigated, the conclusions should make a clear statement about that hazard. Other potential topics to address in the conclusions section include the implications for research, e.g., how to address identified data gaps or how to solve methodological issues, and the reflection on limitations in design and conduct of the review itself.
Presentation of findings
Regardless of the specific form of a systematic review report, some basic requirements exist that are applicable to the reporting of toxicological findings in general. General authorship rules apply, such as identifying a corresponding or lead author. These decisions are likely to be made as part of protocol development and establishment of team members and roles. Of particular importance to systematic reviews, conflict of interest statements for all authors are required. Most, if not all, systematic reviews will be reported in English. Correct spelling and grammar, and clear, concise language should be used, keeping in mind that audiences will often include non-toxicologists. Toxicological systematic reviews should undergo peer review and should be made publicly available.
Systematic reviews can be written up in the form of a stand-alone report (perhaps published on a website) and/or a publication in a peer-reviewed journal. While these two reporting forms may differ in length due to journal restrictions, both should cover essential reporting elements. If important information cannot be included in a journal publication, this should be made available elsewhere, e.g., as supplementary documentation, with a level of detail that allows reproduction of the review, or through data repositories.
Structured abstract/executive summary
Introduction including the rationale and the review question
Methods according to the review protocol including evidence search and selection, data extraction, quality assessment and data analysis
Discussion summarizing results and including a conclusion
This checklist, with minor amendments, has been adopted by others (Sena et al. 2014; Whaley et al. 2016). In addition, commissioning bodies or organizations conducting systematic reviews may have or may develop specific reporting requirements.
The results should present findings in a clear and structured manner, using tables and figures. For example, two tables may be derived from the extracted study data: one for the characteristics of included studies (such as authors and year of publication, source, study design aspects, essential components (e.g., PECO)), as well as one for the study data and results. The most important findings should be summarized narratively, and if possible, complemented by a tabular summary. A flow chart on the study selection process (Fig. 2) should be included.
The challenge in reporting toxicological systematic reviews will be to include all information that is relevant to allow independent replication of the review. Meeting such requirements may be facilitated by the use of online repositories for protocols, as well as other reporting materials (e.g., extraction and quality tables), particularly for large assessments.
Framing the question
Developing and publishing of the protocol
Searching for evidence
Selecting the evidence
Extracting (the data)
Assessing the evidence
Interpreting the results
Conducting a systematic review is not a trivial task, often specifically funded as independent studies, especially for healthcare interventions. The process differs substantially from that involved in narrative reviews, which are commonly used in toxicology. Although systematic reviews have clear advantages, such as their explicit methodology and transparency, they require diverse expertise, substantial resources, time and for most questions the availability of sufficient data. Therefore, it is advisable to carefully consider the intentions and aims of a toxicological review to decide on the type of review to be conducted. When a systematic review is not feasible, a narrative review may be justifiable. However, even in such cases, reviewers should consider the feasibility of implementing at least some of the individual systematic review steps. For example, an explicit, clear and unambiguous statement of the review question should always be provided. In addition, some basic elements of a systematic literature search, such as specifying the databases searched, the search date and the search terms and strategy, can be easily implemented. These steps nevertheless increase the transparency of the process of study selection and the search criteria, and ultimately would improve the reproducibility and quality of narrative reviews.
Challenges in adapting systematic review methodology to toxicology
Systematic review step
Challenges specific to toxicology
Composition of a skilled review team covering all fields of expertise required, especially systematic review experience
Definition of the role of the systematic review sponsor
Framing the question
Framing of the question in a way that it is amenable to systematic reviews
Developing and publishing of the protocol
Publication of protocols so that they are highly and timely visible to stakeholders and interested parties
Searching for evidence
Identification of sources to be searched, including gray literature sources
Provision of means to conduct appropriately balanced searches, e.g., by better annotation of the toxicological literature
Familiarization of information specialists with toxicological evidence and databases
Selecting the evidence
Handling of the possibly vast amount of identified records and appropriate documentation of the selection process
Extracting (the data)
Efficient and transparent data management
Assessing the evidence
Determination of the importance of the various potential quality aspects, e.g., by empirical evidence
Determination on how to best to integrate quality appraisal results into developing and supporting conclusions
Exploration of the role of publication bias in toxicology, i.e., the frequency, the direction and the causes
Exploration of the use of investigating external validity of subgroups
Interpreting the results
Determination of the confidence in a body of evidence
Exploration of how to integrate bodies of evidence within an evidence stream and across evidence streams
Making available all information relevant to allow for independent replication of the review
We recognize that any brief introduction to such a complicated topic will have limitations. Many of the topics addressed above are quite complex, and will require further guidance for complete understanding and practical tools for their implementation in the systematic review methodology. Some experts may question the systematic review steps proposed here, preferring instead an alternate framework. Such issues are only natural, as the application of systematic review to toxicological issues is just emerging. Consequently, there is little direct experience and empirical evidence available to guide these types of systematic reviews, although the knowledge base is rapidly increasing. It is the aim of this primer, and in particular of the challenges highlighted, to stimulate development of tools that facilitate the application of systematic review in toxicology, and to encourage the application of the methodology. The systematic review process will be instrumental in guiding toxicology to a more evidence-based science that is rooted in transparency, objectivity and consistency.
We thank Richard Judson (US Environmental Protection Agency, Washington, DC, USA) and Johanna Rochester (The Endocrine Disruption Exchange, Washington, DC, USA) for their review of an early draft version of the manuscript.
Compliance with ethical standards
Conflict of interest
The contents of this manuscript are solely the responsibility of the authors and do not necessarily reflect the views or policies of their employers. Sebastian Hoffmann, a self-employed consultant, was paid by the Evidence-Based Toxicology Collaboration (EBTC) for his work on this manuscript. Nancy B. Beck was employed by the American Chemistry Council, a trade association of U.S. chemical manufacturers. Sebastian Hoffmann, Martin L. Stephens, John R. Fowle III, Ian Kimber, Nancy B. Beck, Rob de Vries and Katya Tsaouin are members of the Board of Trustees of the EBTC, which is funded by Center for Alternatives to Animal Testing at Johns Hopkins Bloomberg School of Public Health and ExxonMobil Foundation. Hubert Dirven, Julie E. Goodman, Manoj M. Lalu, Kristina Thayer, Paul Whaley and Daniele Wikoff are members of the Scientific Advisory Council of the EBTC. Manoj Lalu was supported by The Ottawa Hospital Anesthesia Alternate Funds Association.
A group of people including representation of relevant stakeholders, with relevant interests, skills and commitment to support the review team, e.g., in key decisions, interpretation and dissemination
Bias due to systematic differences between treatment and comparison groups in withdrawals or exclusions from the results of a study.
A systematic error or deviation from the truth in results or inferences. A common classification scheme for biases of individual studies includes selection bias, performance bias, attrition bias, detection bias, while publication bias refers to a set of studies
Cochrane organizes medical research information in a systematic way to facilitate the choices that health professionals, patients, policy makers and others face in health interventions according to the principles of evidence-based medicine
The exact terms and their combinations (including Boolean operators) used to search a bibliographic database designed to be sensitive, i.e., not to miss relevant evidence
A situation in which a measure of the effect is distorted because of an association between the intervention (or exposure) with other factor(s) that influence the outcome under investigation
Data extraction is the process of retrieving primarily study characteristics and outcome data, out of included evidence sources for further data processing
Detection bias refers to bias due to systematic differences between groups in how outcomes are determined.
Pre-defined criteria are derived from the review question and used to select eligible studies, i.e., those to be included in the systematic review
Evidence-based medicine is the conscientious explicit, and judicious use of current best evidence in making decisions about the care of individual patients. The practice of evidence-based medicine means integrating individual clinical expertise with the best available external clinical evidence from systematic research
The discipline of evidence-based toxicology is a process for transparently, consistently, and objectively assessing available scientific evidence to answer questions in toxicology
Evidence-based Toxicology Collaboration is a collaboration of science regulatory and industry leaders, united in their vision to improve the public health outcomes and reduce human impact on the environment by bringing evidence-based approaches to safety sciences
Toxicological evidence can be assigned to evidence streams, i.e., sets of studies representing the same type or level of evidence—e.g., human (observational) studies, animal studies, in vitro or mechanistic studies
The degree to which the results of a study hold true in other settings (generalisability).
Gray literature is material either unpublished or not controlled by commercial publishers often differing in form, e.g., not a scientific article or report, and the way it is available, i.e., not in journals or databases
Used in a general sense to describe the variation in or diversity of, participants, interventions, and measurement of outcomes across a set of studies, or the variation in internal validity of those studies. Used specifically, as statistical heterogeneity, to describe the degree of variation in the effect estimates from a set of studies. Also, used to indicate the presence of variability among studies beyond the amount expected due solely to the play of chance
The degree to which a result of a study is likely to be free of bias (systematic errors), i.e., has measured what it had intended to measure
The process of synthesizing outcome data from a number of independent studies using statistical methods.
The extent to which a study is likely to be free of features that reduce trust in the results including internal validity and aspects such as ethical approval, reporting and statistical power
Expert summary of a set of publications from which conclusions are drawn without a detailed and transparent description of what was done to allow reproduction
The components of population/participants, exposure, control/comparison and outcome that the question of a systematic review of an exposure should cover.
Performance bias refers to bias due to systematic differences between groups in the care that is provided or in exposure to factors other than the interventions of interest
The components of population/participants, intervention, control/comparison and outcome that the question of a systematic review of an intervention should cover
The plan or set of steps to be followed in a study. A protocol for a systematic review should describe the rationale for the review the objectives, and the methods that will be used to locate, select, and critically appraise studies, and to collect and analyze data from the included studies
Publication bias is caused when only a subset of all relevant studies is published. The publication of research often depends on the nature and direction of the study results.
Reporting bias is introduced when only selected outcomes are reported (applying both to systematic reviews and primary studies).
Reporting quality describes how well and complete a study is reported, ultimately allowing reproduction
The team that conducts the systematic review with ideally all the various required areas of expertise represented
An approach to searching the literature that can be used to support decisions about whether it is possible or worthwhile proceeding with a systematic review.
Selection bias refers to bias due to systematic differences between baseline characteristics of the groups that are compared.
An analysis used to test the robustness of findings and determine how sensitive results are to the data that were included and/or the way that analyses were done.
In the context of a meta-analysis subgroup analyses compare the effect sizes of different subgroups of studies/experiments within the included evidence. Subgroup analyses may be conducted as a means of investigating heterogeneous results, or to answer specific questions about particular subgroups (e.g., if the effect depends on sex)
A review of a clearly formulated question that uses systematic and explicit methods to identify, select, and critically appraise relevant research, and to collect and analyze data from the studies that are included in the review
Weight-of-evidence is a process of taking into account different types of scientific evidence based on the strength and limitations of individual studies, in assessing the validity of a causal hypothesis, usually referring to both evidence syntheses within the individual evidence streams and to the evidence integration across evidence streams
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.