Introduction

A range of detrimental impacts of air pollution exposure on reproductive and children’s health have been established [1,2,3,4,5]. However, air quality regulatory efforts, and especially those accounting for the specific vulnerabilities inherent to reproductive and children’s health, have yet to be effectively implemented on a larger scale [6,7,8]. Formally assessing the quality of the body of evidence, meaning the collection of available individual studies, has been identified as central to translating research into policy [9]. In fact, grading the quality of the body of evidence has become an integral part of the systematic review process [10], reflected in recent additions to the revised Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) 2020 guidelines, recommending authors explicitly report their approach to the process of rating the body of evidence [11]. Evidence grading approaches were developed predominantly for clinical questions, including well-established guidelines such as the Grading of Recommendations, Assessment, Development, and Evaluations (GRADE) criteria [10, 12].

However, the field of reproductive and children’s environmental health, including research on air pollutant exposure, is affected by characteristics that may complicate the critical evaluation of primary studies and bodies of evidence:

  1. i)

    The predominantly observational nature of available studies means that, due to inherent differences in study design compared to experimental studies, a different approach is required for identifying and addressing potential confounding and other biases [13,14,15]. Specific aspects of epidemiologic studies of air pollutant exposure and reproductive/ children’s health outcomes that may result in confounding (e.g., frequent use of spatial rather than temporal comparators, lack of covariate information from birth records or other sources), have been described [16, 17]. However, both the default ranking of experimental studies above observational studies, as well as the practice of rating primary studies based on how well they emulate a “hypothetical target RCT” have been criticized [18,19,20,21,22,23].

  2. ii)

    Highly heterogeneous and dynamic population characteristics that define the field of reproductive and children’s health (e.g., vulnerabilities related to developmental stages, rapid changes in health -related behaviors) require a lifestage-specific approach. Profound physiological and developmental differences between children and adults impact the toxicity and adverse biological implications of chemical exposures, based on variations in metabolic rates, (de-)toxification processes, and vulnerability during specific developmental windows [17, 24,25,26].

  3. iii)

    Further aspects specific to reproductive and children’s health, including generally longer expected lifespans and long latency periods, life course perspectives (e.g., developmental origins of disease), trans-generational effects, among others, necessitate a tailored approach [24].

  4. iv)

    Challenges related to exposure assessments are generally an issue in observational vs. experimental studies, where exposures are not controlled by investigators, and in particular, in environmental health studies [14, 27]. Exposure assessments regarding air pollution are characterized by specific challenges (e.g., differences in the availability of air monitoring data, seasonal variations in exposure patterns, etc.) [17, 27], potentially increasing misclassification, also with regard to relevant developmental periods, such as gestational trimesters. Also, there are additional considerations with regard to reproductive and children’s health: Due to differences in body size and behaviors, among others, exposure patterns are different for developing fetuses, children, and pregnant persons vs. non-pregnant adults (e.g., relative exposure doses, exposure routes and settings, timing and duration of exposure in relation to windows of susceptibility) [24, 27,28,29]. For example, children have different breathing zones (due to shorter stature) and oxygen consumption patterns, affecting their individual exposure to air pollution [25].

  5. v)

    The co-exposure to mixes of pollutants reflects the real-world risks faced by the global population, which may include additive/ synergistic effects between chemicals, and while modeling impacts of multiple pollutants jointly could provide more valid results, there are challenges such as collinearity and high dimensionality, among others [17, 27, 30,31,32].

  6. vi)

    Further, the context of decision-making in environmental health research differs: Unlike in the clinical setting, environmental exposures are often assessed for risks only after exposure -often wide-spread and long-term- has already occurred in the population [28]. Also, environmental health studies are focused on protecting, rather than improving health [28]. Therefore, while clinical research is primarily concerned with demonstrating a desired treatment effect, reproductive/ children’s environmental health should, arguably, be concerned with demonstrating the absence of adverse effects: For the former, the burden of proof lies in demonstrating an association or effect, while for the latter, it would lie in demonstrating no association or effect, in essence, safety [33,34,35,36]. Statistical methods for testing for the absence of effects (e.g., equivalence tests) are available, and in addition to providing evidence regarding the equivalence of different exposure scenarios, may also help to reduce publication bias [37,38,39].

Methodological weaknesses specific to assessing evidence related to environment exposures [40, 41], and specifically ambient air pollution [42], and pregnancy outcomes [43, 44], were previously identified among systematic reviews, particularly related to assessing internal validity and a lack of transparent evidence grading methodologies. Because systematic review methodologies were primarily developed for clinical trials, their suitability for evaluating evidence from observational/ environmental health, and how these methods can best be adapted, has been debated [14]. Further, certain aspects of existing approaches, including the aforementioned default ranking of evidence from randomized controlled trials (RCT) above that from observational studies, have previously been criticized in the context of environmental health [18, 22, 23].

In this methodological survey we aimed to evaluate frameworks for critically assessing bodies of evidence, applied in systematic reviews of epidemiological studies of environmental exposures and adverse reproductive/ child health outcomes, using research on air pollution exposure as a case-study. Air pollutant exposure was chosen based on the comparability of approaches within this research area, and the large body of available systematic reviews [45]. Based on this, we exemplify and discuss challenges and recommendations for evidence grading in the context of reproductive/ children’s environmental health.

Methods

As the unit of analysis of this work was systematic reviews, we adhered to the Preferred Reporting Items for Overviews of Reviews (PRIOR) guidelines (Supplemental Material S1, PRIOR checklist) [46], and further relevant guidance [47,48,49,50,51]. Two reviewers independently completed all steps of the systematic process, including screening for eligible references, extracting data, and assessing risk of bias (SM and AA). Discrepancies were resolved by discussing or by consulting with the third reviewer (OVE).

Eligibility criteria and review selection

The inclusion criteria are presented and explained in Table 1.

Table 1 Inclusion criteria

As highlighted in Table 1, we identified systematic reviews explicitly employing published criteria or guidelines for assessing or rating the quality of the body of evidence, among the collection of systematic reviews of studies of air pollutant exposure and adverse reproductive and child health outcomes.

Titles, abstracts, and full-texts of the identified publications were consecutively screened, and included in the subsequent screening step, unless there was explicit indication that the publication did not meet our inclusion criteria.

Data sources and search strategy

For identifying systematic reviews, PubMed and Epistemonikos have been identified as the database combination with the highest inclusion rate [54], and we additionally searched the database Embase. For identifying systematic reviews, in favor of built-in filters, we developed a hedge combining searches of text words, filters, and publication types, based on current recommendations for achieving maximum sensitivity [54,55,56,57].

Controlled vocabulary terms and keywords were employed to combine the concepts “air pollution”, “childhood”, and “systematic review” (Supplemental Material S2: Full electronic search strategies). We used the PubMed Reminer tool [58], and the SearchRefiner tool from the Systematic Review Accelerator website [59], to develop and assess the sensitivity and specificity of our search strategy.

On December 9, 2020, we conducted the initial systematic search of the electronic databases, without language or publication status restrictions. All searches of electronic databases were performed by SM and updated until April 07, 2023.

In addition, supplementary searches were performed using the search engines Google and Google Scholar. Search engines are used supplementarily, as these allow limited insights into how search results are produced [60]. Further, we manually performed backward and forward citation searching.

Data extraction

Data on systematic review characteristics were extracted using a standardized data extraction form. For extracting information pertaining to the evidence grading systems, descriptions reported in the original articles, as well as cited guidance documents and further related references (i.e., organization websites, etc.) were consulted. Also, we considered the versions of the approaches used within the identified systematic reviews, although in some cases, newer versions exist. If necessary, we attempted to contact systematic review authors to identify or clarify missing or unclear information.

Risk of bias assessment (ROBIS)

Risk of bias in systematic reviews was evaluated using the ROBIS tool, based on 1) the appropriateness of study eligibility criteria, 2) methods for identifying and selection of studies, 3) data extraction and quality appraisal methods, and 4) appropriateness of data synthesis, and 5) overall risk of bias [61].

Qualitative analysis/synthesis

We calculated the proportion of systematic reviews explicitly employing formal evidence grading frameworks. The main characteristics of these reviews, including both the main objectives and findings, as well as the systematic review methods, were synthesized in descriptive and tabular format. Methodological characteristics, specifically the guidelines and approaches used for grading bodies of evidence were reviewed. Notably, because approaches used for assessing a body of evidence are partially based on preceding assessments of the quality or risk of bias among primary/ individual studies, both of these types of assessments in the systematic review processes were distinctly considered herein.

With regard to individual studies, quality versus risk of bias or internal validity are related but distinct concepts, concerned with the critical assessment of individual studies. Risk of bias, refers to aspects of study design, conduct, or analysis that could give rise to systematic error in study results, and can be used synonymously with internal validity, which is the extent to which bias has been prevented through methodological aspects [62]. Study quality, on the other hand, may refer to (a) reporting quality; (b) internal validity or risk of bias; and (c) external validity or directness and applicability, among others [15]. However, while risk of bias vs. study quality assessments are truly distinct concepts, they are often interchanged or merged in research practice [63]. For this reason, in this methodological survey, these approaches were considered jointly.

The quality of/ certainty in the body of evidence, on the other hand, is assessed based on strengths and limitations of a collection of individual studies, and incorporates results from preceding risk of bias assessments, as well as aspects of directness/ applicability of the identified primary studies with regard to the review question, heterogeneity/ inconsistency across studies, the magnitude and precision of effect estimates, potential publication biases, and further criteria [12, 15]. Sometimes this step is followed by subsequent ratings regarding the strength or levels of evidence, or hazard identification, across study types, outcomes, or species [15, 64].

Certain criteria are applied differently when assessing the internal validity of individual studies versus the body of evidence. For example, while an identified risk of confounding will result in a lower internal validity score for an individual study, a body of evidence may receive a higher quality rating, if all plausible confounding “would reduce a demonstrated effect, or suggest a spurious effect when results show no effect”, as noted by multiple guidelines [64,65,66].

We considered characteristics of frameworks for rating risk of bias in individual studies, and for grading the body of evidence, specifically as they relate to reproductive/children’s environmental health, as discussed earlier (see Table 2).

Table 2 Considered characteristics regarding reproductive/children’s environmental health

Results

Review selection process

The selection process of systematic reviews is shown in Fig. 1. After screening 10,241 titles, 1,030 abstracts and 423 full texts, 177 systematic reviews were found to assess the association between exposure to air pollution and adverse reproductive/ children’s health outcomes. The most common reasons for exclusions of full texts were that the reviews considered adult or general populations (n = 62), and that reviews were non-systematic (n = 61). Out of the 177 eligible systematic reviews, 18 articles (9.8%) explicitly reported using evidence grading systems [5, 68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84]. The proportion of systematic reviews using evidence grading systems appeared to increase over time (see Fig. 2).

Fig. 1
figure 1

Screening process for systematic reviews

Fig. 2
figure 2

Systematic reviews per year, and proportion using evidence grading approach (designed using R software). Includes publications up until April 2023

Systematic review characteristics

General characteristics of the 18 systematic reviews that used formal evidence grading systems are summarized in Table 3. These reviews were published between 2015 and 2023; and outcomes assessed were: spontaneous abortion [80], gestational diabetes mellitus [83], fetal growth [72], preterm birth [73, 79], birth weight [76], term birth weight [77], congenital anomalies [74], upper respiratory tract infections [81], bronchiolitis in infants [71], sleep-disordered breathing [70], blood pressure in children and adolescents [84], neuropsychological development [68, 78], autism spectrum disorder [5, 69], academic performance [75], or all child health outcomes [82].

Table 3 Characteristics of included systematic reviews (sorted by publication year)

None of the included reviews specified inclusion criteria related to the method of exposure assessment (e.g., modeling vs. monitoring approaches) (see Table 3). Two reviews considered both intervention and observational studies [70, 81], while the others included only observational studies. Between 7 and 84 studies were included by the individual reviews (Table 3) [79, 80]. One review included only studies using air monitoring stations’ data [74], while others reported a variety of exposure assessment methods and data sources. Individual-level measures of exposure (e.g., adducts in cord blood, backpack for individual monitoring) were reported for few studies included in the systematic reviews [68, 72, 76, 77].

The majority of systematic reviews included fixed or random effects meta-analyses, while five refrained for statistical pooling and synthesized their findings in narrative form [68,69,70,71, 75]. All meta-analyses included adjusted effect estimates; several reported only considering single-pollutant models.

ROBIS assessment results

Four of the included systematic reviews were rated at a low risk of bias [5, 71, 75, 76], four at a high risk of bias [68,69,70, 74], and the remaining ten at an unclear risk of bias. The most critical concerns related to methods used to search for primary studies, synthesis approaches, and insufficient reporting (Fig. 3). Between one and eight databases were searched by the various review teams [69, 75]. Six groups made no additional efforts to identify published or unpublished literature [68, 69, 71, 79, 81, 82], while eight additionally screened the reference lists of included studies and/ or those of relevant reviews [70, 72, 74,75,76, 80, 83, 84], in some cases additionally searching relevant reports [73], using web search engines [78], and one further searched grey literature databases and relevant websites, performed forward citation searches, and contacted experts in the field (Supplemental Materials S3 and S4: Details of ROBIS assessment) [5]. Methods used for primary study appraisal, synthesis, and evidence grading are described further below.

Fig. 3
figure 3

Summary of risk of bias assessment. Designed using the robvis tool [85]

Methodological characteristics- Methods for assessing risk of bias/quality in primary studies

The 18 included systematic reviews used 15 distinct approaches for assessing risk of bias/ quality/ internal validity among primary studies (Tables 3 and 4). The Newcastle–Ottawa Scale (NOS) was the most commonly cited tool (n = 9 reviews) [86], with an additional four reviews using modified NOS versions, followed by the Office of Health Assessment and Translation (OHAT) approach (n = 4 reviews)[66]. However, multiple reviews reported using multiple tools, in order to assess quality and risk of bias separately, as well as to address various study designs (e.g., cohort vs. cross-sectional studies) included within reviews [74, 78, 79, 81, 84]. Further, six reviews modified/ tailored the selected tools themselves [69, 71,72,73, 75, 81], while four reviews used tools as modified by preceding systematic reviews [74, 79, 81, 84].

Table 4 Risk of bias/ quality assessment tools for primary studies used by included systematic reviews

The tools originated from a wide range of research fields (see Table 4), and only the Navigation Guide and OHAT approaches, used by six reviews, were developed specifically for environmental health research [64, 66]. The Risk of Bias in Non-randomized Studies of Exposure (ROBINS-E) tool was developed for studies of non-randomized studies of exposures and used in one review in its preliminary version [94]. Two reviews newly developed their own criteria for assessing primary study quality/ risk of bias [5, 68]. Notably, Lam et al. further developed the Navigation Guide risk of bias tool with expert input, as part of their application of the Navigation Guide methodology. This included developing an approach for rating exposure assessment methods for different air pollutants/ chemical classes [5]. This approach was subsequently adopted by other identified reviews [76, 81].

Exposure assessment methods in general were evaluated in all but two out of fifteen approaches [92, 93], although we considered only four tools applicable to environmental/ air pollution exposures in this regard [64, 66, 87, 94]. Co-exposures were explicitly considered by five tools [15, 62, 87, 91, 94], while all but one tool assessed confounder control [93]. However, review authors modified existing tools in some cases, for example adding considerations of sample size, selection bias, exposure assessment method, and confounder adjustment [69, 71]. Another review group used subgroup analyses to explore the effect of different exposure assessment methods [77].

Methodological characteristics- Evidence assessment methods

As stated above, 18 out of 177 systematic reviews used formal systems for assessing the quality/certainty of the body of evidence, and nine different approaches were used f by these 18 reviews (see Table 5), including published modifications of existing tools. The majority of reviews (n = 8) used the GRADE system [12], followed by modified versions of GRADE, namely the Navigation Guide (n = 4) [64], an approach developed by the World Health Organization (WHO) for air pollution research (n = 1) [98], and a modified version for environmental health research (n = 1) [99]. Other approaches were adopted from OHAT (n = 3) [15], and the International Agency for Research on Cancer’s (IARC) preamble for monographs (2006) [100], among others (see Table 5) [95,96,97, 101]. Modifications to and deviations from the frameworks were noted [68, 69, 71, 81].

Table 5 Evidence grading tools used by included systematic reviews

The identified approaches for evidence grading were originally developed either for clinical practice [95,96,97, 101], or for research on environmental exposures [15, 64, 66, 98, 99, 104], including air pollution [98], and were characterized by highly heterogeneous methodologies. The original GRADE system assigns an initial rating based on study type, where RCTs begin at a “high” quality rating, while observational studies begin as “low”, before considering various criteria (e.g., consistency between studies), to reach a final rating of the body of evidence [106, 107]. The GRADE system as modified by the WHO (for air pollution studies) and the Navigation Guide (developed for environmental health studies, partly based on the U.S. Environmental Protection Agency’s (EPA) criteria for reproductive and developmental toxicity [28]) differ from the original version in that observational studies are initially rated as “moderate” quality, rather than “low”, among other distinguishing features (e.g., additionally calculating 80% prediction intervals to assess heterogeneity) [64, 98, 108].

While the OHAT approach is based on the GRADE system, the initial rating is based on the number of present study-design features, rather than the study type. These include: controlled exposure, exposure prior to outcome, individual outcome data, and comparison group used . Therefore, evidence from observational studies, due to a lack of controlled exposure, will never start higher than “moderate” . Unlike the GRADE system, upgrades may additionally be given for consistency across different study designs, species, or dissimilar populations, and for “other” reasons [15]. Guidance for subsequently considering quality of evidence across multiple exposures encourages considerations across the entire body of evidence .

In the IARC approach, no initial rating is assigned based on study type, although the appropriateness of different study designs in relation to the research question are considered [100]. Further criteria include study quantity and quality, statistical power, and consistency of findings. This is preceded by considerations including exposure assessment methods, temporality, use of biomarkers, and Hill’s criteria for causality [105]. In the most recent version, this is replaced by “considerations for assessing the body of epidemiological evidence” [13, 21, 105].

The Centre for Evidence Based-Medicine (CEBM) and Scottish Intercollegiate Guidelines Network (SIGN) systems again use previously assigned ratings of each included primary study, based on study type and quality, in addition to a subset of the same criteria as GRADE, but with markedly less specific guidance and explanation, compared to the aforementioned systems. The updated version of the SIGN handbook from 2019 now recommends using the GRADE system for grading evidence. The Best Evidence Synthesis (BES) system, developed for research on lower back problems, does not explicitly rate the study type as a criterion, instead presenting a highly abbreviated approach of considering merely the number, relevance, and quality of available studies [101].

In terms of considering aspects of reproductive/ children’s environmental health research in the “indirectness”, “heterogeneity”, or “confounding/ bias” domains, the Navigation Guide, GRADE approach, and OHAT framework all provide brief commentary, in the form of examples or general guidance, while the other tools make no specific reference to reproductive/ children’s health (see Table 5). Besides the SIGN and the BES systems, all tools consider the timing of exposure and/ or outcome assessment, although only the Navigation Guide and OHAT approach explicitly address this aspect with regard to reproductive/ children’s health research (e.g., developmental stages). Finally, only the Navigation Guide, OHAT approach, and IARC framework provide guidance on assessing “evidence for no effect”. Notably, systematic review authors addressed some of these aspects outside of their application of the evidence grading frameworks, in their methods (e.g., by applying relevant inclusion criteria, or by conducting subgroup analyses of different pregnancy trimesters or age groups [73, 76, 78, 83, 84]), or in their discussions.

Discussion

This is to our knowledge the first methodological survey to systematically identify and describe evidence grading systems used in the area of air pollution exposure and adverse reproductive/ child outcomes. Of note, this is not an overview of recommended, but of practiced methods in the field. Only 18 out of 177 systematic reviews (9.8%) were found to explicitly utilize formal rating systems for bodies of evidence. Such a small proportion suggests that this process is still not common in the field, although an increase was observed after 2015 (see Fig. 2), which is in line with previous findings on evidence grading approaches used in systematic reviews of air pollution exposure [42]. The inconsistency in the approaches used—15 different risk of bias assessment and 9 different evidence grading tools used across 18 reviews- plus the numerous modifications applied, reflect a lack of consensus. The NOS and GRADE system were the most commonly used tools for assessing internal validity and for grading evidence, respectively, discussed further below. It is noteworthy that multiple reviews “borrowed” tools originating from rather unrelated fields (e.g., clinical research on lower back problems), and there was marked heterogeneity in the comprehensiveness and relevance of the employed tools.

Further, numerous systematic reviews cited preceding reviews using the same approach, in reference to their own approach [5, 74, 76,77,78,79,80,81, 83]. This suggests a “propagated” methods adoption, where systematic review authors use preceding reviews for guidance, possibly leading to the uptake of inappropriate methods [109]. This implicates that the publication of worked examples, as those provided by the Navigation Guide group [110], are essential for further improving the methodological quality of systematic reviews.

Risk of bias assessment

Our findings indicate that systematic review authors use a wide range of approaches for assessing risk of bias/ quality among individual studies, in many cases originating from clinical or other less related fields. 13 reviews were found to use the original or a modified NOS version. The widespread use of the contested NOS may be one of the most "spectacular" examples of the risks of quotation errors and citation copying [109, 111]. Vandenberg et al. recently outlined how flawed exposure assessment methods put public health at risk [27], and this extends to a lack of appropriate and comprehensive evaluations of exposure assessment methods. The NOS includes only a cursory evaluation of exposures assessment methods that is arguably not applicable to environmental exposures. In general, risk of bias/ quality assessment tools have been criticized for focusing on mechanically determining the potential presence of biases, often based on how closely they emulate a hypothetical “target” RCT, rather than their likely direction, magnitude, and relative importance [18, 112]. Rather than assigning ratings based on study design, assessments should identify the most probable and important biases in relation to the particular population, exposure, and outcome under investigation, rate each study on how effectively it addresses each potential bias, and differences in results across studies should be considered in relation to susceptibility to each bias [14, 112,113,114].

The iterative development of the ROBINS-E tool [94, 115], which in its preliminary version was criticized for being based on comparisons to the “ideal” RCT, among other limitations [116], but in its final version addressed many of these concerns, including a more nuanced approach to causal inference [117], demonstrates that continuous collaboration between experts and critical appraisal of developing tools is effective and desirable. Also, the WHO has introduced a risk of bias assessment tool for air pollution exposure studies in systematic reviews [118]. In addition, informative evaluations of additional risk of bias tools available for environmental health studies have been presented [119]. Useful interactive data visualization tools exist to facilitate comparison and selection of risk of bias/ methodological quality tools for observational studies of exposures [120], collated on the basis of a preceding systematic review [63].

Evidence grading approaches

In this methodological survey, 16 out of 18 reviews used evidence grading systems that provided higher scores to experimental (vs. non-experimental) studies or related study features. The practice of ranking evidence based on a crude hierarchy of study designs has been criticized [18,19,20,21, 23]. For one, experimental studies may be no better at reducing “intractable” confounding, and other approaches (e.g., difference-in-difference) may be much more effectual in addressing particular confounding scenarios [23]. Pluralistic approaches to causal inference, that extend beyond counterfactual and interventionist approaches, have been proposed [21, 22].

Six reviews were found to use the original GRADE system for rating bodies of evidence, for which we noted a lack of consideration with regard to heterogeneities across different developmental stages, a paucity of attention paid to the timing of exposure to environmental risks, and a lack of discussion of evidence for no association or effect, in addition to the default ranking of experimental studies above observational ones. The applicability of the GRADE approach to observational studies has previously been discussed [121, 122], and challenges with rating the body of evidence from observational studies have been reported [123,124,125,126], including rating evidence from non-randomized studies as “low” by default, difficulties in assessing complex bodies of evidence consisting of different study designs, and limited applicability regarding research on etiology, among others [124, 127].

The GRADE working group has proposed the possibility of initially rating evidence from non-randomized studies as “high”, when used in conjunction with risk of bias assessment tools like ROBINS-I [94, 115, 128, 129]. The reasoning is that the lack of randomization will usually lead to rating down by at least two levels to “low”, so ultimately, evidence from observational studies will be rated as “low” with either method [115, 129], hence, this approach is again based on the principle that non-randomized studies are inherently inferior. Other suggestions have been made to start observational studies as "moderate", as done in the Navigation Guide’s and WHO’s modified versions [64, 98], and expand criteria for upgrading [124]. In prognosis research, the GRADE system has been adapted to start observational studies at “high” [130]. Further developments of the GRADE system for environmental health research, including a recent exploration of how considerations of biological plausibility can be integrated into evidence grading [131], are in progress [99, 132].

Reproductive and children’s environmental health: specific guidance needed

While some of the identified frameworks were found to address selected aspects, concerns persist regarding reproductive/ children's environmental health research: Firstly, the risk of bias assessment and evidence grading frameworks frequently used by existing systematic reviews often do not explicitly or comprehensively address important aspects, such as vulnerabilities related to developmental stages, considerations of exposure timing and relative dose, etc. [24, 25]. Also, only three evidence grading systems provide any guidance on assessing evidence for the absence of effects. Addressing these points would require considerations of how domains of current evidence grading frameworks are operationalized, including indirectness domains (e.g., timing of exposures, “worst-case” exposure scenarios [27], etc.), heterogeneity (disparities related to social determinants, diverse etiological mechanisms, etc.), and biases specific to research on pregnancy and childhood (e.g., live-birth bias). Some of the identified methodologies offer some insights into how existing frameworks may be adapted [66, 98, 104]: For example, considering null findings, in addition to positive ones, is advised by the Navigation Guide with regard to publication bias, meaning that an excess of null findings, especially from small or industry-sponsored studies, are also of concern [104]. With regard to subsequent assignments of levels of evidence, the OHAT approach notes that due to the intrinsic challenges of proving a negative, concluding "evidence for no effect," requires high levels of confidence in evidence. Low/ moderate confidence should be considered “inadequate evidence” for absence of effects [15].

Failing to explicitly address the defining features and major characteristics of reproductive and children’s environmental health as described above renders nonspecific tools such as the NOS and GRADE inadequate for comprehensively evaluating the unique risks posed by environmental exposures during vulnerable developmental stages and across the lifespan. Failing to account for these complexities within evidence grading frameworks may result in an incomplete understanding of the risks posed by environmental exposures during crucial developmental stages. This lack of specification may give rise to invalid assessment results both at the level of primary studies, as well at the level of bodies of evidence, and thereby lead to erroneous conclusions about the certainty of the assessed evidence. This in turn may undermine the formulation of effective policies for protecting reproductive and children’s health. Therefore, emphasizing the need for using more specialized frameworks (e.g., ROBINS-E, Navigation Guide, OHAT) for assessing studies on reproductive and children’s environmental health is paramount for ensuring accurate findings and interpretations and, ultimately, safeguarding the health of future generations. Altogether, while the addition of new tools or domains may not be needed, further consensus and published direction on how exactly these can be operationalized in the context of reproductive/ children’s environmental health may be useful. Providing explicit guidance and clear definitions, promoting the use of more applicable frameworks, and a continued refinement and tailoring of existing frameworks towards reproductive/ children’s environmental health research is critical for improving current methodologies [133].

Further evidence grading systems and systematic review frameworks not utilized in the identified reviews

Additional evidence grading systems and systematic frameworks for environmental health research exist, but were not utilized by the identified reviews: In 2006, the EPA published a “Framework for Assessing Health Risks of Environmental Exposures to Children” [134], providing using a “lifestage” perspective. Developing specific assessment criteria during problem formulation is recommended. A weight-of-evidence approach is used, which places emphasis on higher quality studies for evidence grading [134]. Further systematic review frameworks developed for observational studies of etiology or environmental health and toxicology research include the COSMOS-E, COSTER, and SYRINA frameworks, among others. Notably, while provided guidance on evidence grading generally reflect principles of the GRADE system, specific recommendations as to what tool or approach to use [113, 135], or whether to assign an initial rating based on study type [136], are avoided.

The existence of the approaches described above, as well as those with clear relevance to reproductive/ children’s environmental health presented earlier (e.g., ROBINS-E, Navigation Guide, OHAT), together with the limited uptake we identified, suggest that the problem lies less in an absence of appropriate methods, but with their accessibility or implementation. Promoting simple, but not oversimplified, practicable, and specific guidance should be prioritized [109].

Also, calls for child-relevant extensions to the PRISMA checklist- “PRISMA-C” have been made [26, 137, 138], and are currently under development [139]. Specific recommendations regarding risk of bias and evidence assessments could be integrated herein.

Beyond evidence grading- linking evidence and triangulation

Different types of evidence (i.e., human and non-human studies) may be combined into integrated networks of evidences within systematic reviews of environmental health risks [18, 113]. In fact, the Navigation Guide, OHAT, and IARC methodologies provide guidance on integrating evidence from human, animal, and mechanist studies [15, 64, 66, 100, 104].

Further, triangulation (i.e., leveraging differences in evidence from diverse methodological approaches with different biases to strengthen causal inference) has been encouraged for environmental health research [22, 140]. However, guidelines are needed to help researchers integrate triangulation processes into systematic reviews effectively [140].

Implications for policy

Systematic review methods for environmental health research continue to evolve, including at the U.S. federal level, which may have a direct impact on policies to protect reproductive and children’s health: Within the EPA, revisions are being made to current systematic review methodologies [141, 142], while proposed changes to the existing “weight-of-evidence” approach, which considers a plethora of different types of evidence, in favor of a “manipulative causation” framework, are being heavily contested [18, 143, 144], and probabilistic risk-specific dose distribution analyses are being piloted, to expand beyond previous threshold-based approaches [145]. This highlights that considerations of evidence assessment methodologies span scientific, political, and legal realms, and carry massive public health implications.

We hope this work can provide a comprehensive overview of the current state of practice in the field, and serve as a starting point for those working on the further refinement or promotion of evidence grading systems for reproductive/ children’s environmental health research.