Machine learning-based health environmental-clinical risk scores in European children

Guimbaud, Jean-Baptiste; Siskos, Alexandros P.; Sakhi, Amrit Kaur; Heude, Barbara; Sabidó, Eduard; Borràs, Eva; Keun, Hector; Wright, John; Julvez, Jordi; Urquiza, Jose; Gützkow, Kristine Bjerve; Chatzi, Leda; Casas, Maribel; Bustamante, Mariona; Nieuwenhuijsen, Mark; Vrijheid, Martine; López-Vicente, Mónica; de Castro Pascual, Montserrat; Stratakis, Nikos; Robinson, Oliver; Grazuleviciene, Regina; Slama, Remy; Alemany, Silvia; Basagaña, Xavier; Plantevit, Marc; Cazabet, Rémy; Maitre, Léa

doi:10.1038/s43856-024-00513-y

Machine learning-based health environmental-clinical risk scores in European children

Article
Open access
Published: 23 May 2024

Volume 4, article number 98, (2024)
Cite this article

Download PDF

You have full access to this open access article

Communications Medicine

Machine learning-based health environmental-clinical risk scores in European children

Download PDF

Jean-Baptiste Guimbaud ORCID: orcid.org/0000-0003-3748-1453^1,2,3,4,
Alexandros P. Siskos ORCID: orcid.org/0000-0002-5635-7426⁵,
Amrit Kaur Sakhi⁶,
Barbara Heude⁷,
Eduard Sabidó^3,8,
Eva Borràs^3,8,
Hector Keun⁵,
John Wright^9,10,
Jordi Julvez^1,11,12,
Jose Urquiza ORCID: orcid.org/0000-0003-2220-607X^1,3,11,
Kristine Bjerve Gützkow⁶,
Leda Chatzi¹³,
Maribel Casas^1,3,11,
Mariona Bustamante ORCID: orcid.org/0000-0003-0127-2860^1,3,11,
Mark Nieuwenhuijsen¹,
Martine Vrijheid^1,3,11,
Mónica López-Vicente ORCID: orcid.org/0000-0002-5090-7538^1,3,11,
Montserrat de Castro Pascual ORCID: orcid.org/0000-0001-6547-5953^1,3,11,
Nikos Stratakis¹³,
Oliver Robinson^14,15,
Regina Grazuleviciene¹⁶,
Remy Slama¹⁷,
Silvia Alemany ORCID: orcid.org/0000-0002-7925-6767^18,19,20,
Xavier Basagaña ORCID: orcid.org/0000-0002-8457-1489^1,3,11,
Marc Plantevit²¹,
Rémy Cazabet ORCID: orcid.org/0000-0002-9429-3865² &
…
Léa Maitre^1,3,11

1682 Accesses
4 Altmetric
Explore all metrics

Abstract

Background

Early life environmental stressors play an important role in the development of multiple chronic disorders. Previous studies that used environmental risk scores (ERS) to assess the cumulative impact of environmental exposures on health are limited by the diversity of exposures included, especially for early life determinants. We used machine learning methods to build early life exposome risk scores for three health outcomes using environmental, molecular, and clinical data.

Methods

In this study, we analyzed data from 1622 mother-child pairs from the HELIX European birth cohorts, using over 300 environmental, 100 child peripheral, and 18 mother-child clinical markers to compute environmental-clinical risk scores (ECRS) for child behavioral difficulties, metabolic syndrome, and lung function. ECRS were computed using LASSO, Random Forest and XGBoost. XGBoost ECRS were selected to extract local feature contributions using Shapley values and derive feature importance and interactions.

Results

ECRS captured 13%, 50% and 4% of the variance in mental, cardiometabolic, and respiratory health, respectively. We observed no significant differences in predictive performances between the above-mentioned methods.The most important predictive features were maternal stress, noise, and lifestyle exposures for mental health; proteome (mainly IL1B) and metabolome features for cardiometabolic health; child BMI and urine metabolites for respiratory health.

Conclusions

Besides their usefulness for epidemiological research, our risk scores show great potential to capture holistic individual level non-hereditary risk associations that can inform practitioners about actionable factors of high-risk children. As in the post-genetic era personalized prevention medicine will focus more and more on modifiable factors, we believe that such integrative approaches will be instrumental in shaping future healthcare paradigms.

Plain language summary

Growing up in different environments can greatly affect children’s health later in life. This research looked at how living in cities, being exposed to chemicals, and other experiences before birth and during childhood, work together to influence children’s mental, cardiovascular and respiratory health. We used advanced computer programs to help us understand these effects and estimate health risk scores. These scores are simple numerical measures that help us quantify the likelihood of children developing health issues based on their environmental exposures. Using those scores, the study identified key factors impacting children’s health, in particular psycho-social, perceived environmental and prenatal pollutant exposures for mental health. It also revealed complex patterns and interactions between environmental factors. The results highlighted the potential of such risk scores to support the identification of actionable factors in high-risk children, informing tailored prevention measures in healthcare.

Environmental exposures in early-life and general health in childhood

Article Open access 21 July 2023

Hokkaido birth cohort study on environment and children’s health: cohort profile 2021

Article Open access 22 May 2021

Multi-omics signatures of the human early life exposome

Article Open access 21 November 2022

Introduction

The environment (i.e., non-genetic factors), is a key driver of physical and mental health, studied for a long time in the literature in both adults and children^1,2. It has been shown that early exposure to adverse environmental agents during sensitive periods such as pregnancy, birth, and childhood can have a long-term impact on both animal³ and human health^4,5. For instance, while most mental disorders begin during adolescence and early adulthood (e.g., schizophrenia, psychosis), their onsets are preceded by a prodromal phase with deviance from the typical neurodevelopmental trajectory taking place before clinical psychiatric symptoms. Not only important early life events such as obstetric complications at birth but also the environment in which a child grows up, including factors such as lifestyle, family social capital and pollutant exposure, can influence their development as we showed in a recent exposome study of child behavioural problems⁶.

In recent decades, the study of the impact of environmental exposure on health has evolved from analyzing individual factors to the broader analysis of the exposome. Described as “the totality of human environmental exposures from conception onwards”⁷, the exposome recognizes that individuals are simultaneously exposed to a multitude of environmental factors and takes a holistic approach to the discovery of etiological factors for disease. Its main advantage over traditional “one-exposure-one-disease” approaches is that it provides a conceptual framework for investigating multiple environmental hazards (e.g., urban, chemical, lifestyle, social) and their combined effects. Classical single exposure analyses may be limited as the studied exposure association could arise from another correlated factor not taken into account and are, moreover, unable to capture interactions or cumulative effects from the exposure mixture. Exposures are not isolated, can be correlated with one another, and are likely to interact both among themselves (ExE) and with genetics (GxE) to drive health and non-communicable diseases^8,9. Therefore, there is a need to collect a wide range of environmental data coupled with clinical biomarkers to comprehensively capture the main domains of the exposome before the apparition of clinical symptoms and apply more advanced modeling approaches suitable for complex mixtures of exposures that can leverage their interactions to facilitate its integration in observational and clinical studies.

Inspired by risk prediction models, such as the Framingham risk score for coronary heart disease¹⁰ or polygenic risk scores (PRS)¹¹, environmental risk scores (ERS) are summary measures of the effects of multiple exposures used to estimate exposomic liability for a particular health outcome at an individual level¹². Unlike genetics, some environmental factors are actionable, which gives ERS a broader potential for shaping public health policies by identifying actionable key factors and facilitating the implementation of preventive and personalized medicine measures. Combined with PRS, these scores could also serve as an initial step in identifying at-risk populations, who can then be directed to more specific clinical diagnostic tests, thereby serving as a complementary tool in the healthcare decision-making process^13,14. ERS are usually built as a weighted sum of the individual exposure estimates, obtained either from meta-analysis or derived through linear regression models (from single multivariate models or several models, adjusted for multiple testing)¹⁵. This scheme, however, assumes that each environmental stressor individually acts in a linear dose-response relationship, while previous research has shown that their combined effect does not necessarily follow this rule^16,17.

The availability of rich data on multiple levels of environmental exposures in birth cohorts presents an opportunity to address the gap in large-scale studies on the association between the exposome and child and adolescent development. Previous ERS studies have been limited by the number of exposure variables or domains included^10,18,19. In contrast, our study identify predictive environmental-clinical risk scores (ECRS) based on a wide array of pregnancy and childhood environmental exposures related to both external (e.g., air quality, lifestyle, psychosocial) and internal (e.g., blood metals, pesticides) factors and (pre)clinical factors (metabolites, proteins, co-morbidities), and link these to a range of physical and mental symptoms in the large European Human Early-Life Exposome (HELIX) cohort^20,21. In the context of this study, we use the term prediction to refer to the inference of diagnostical risk scores that include pregnancy and childhood cross-sectional epidemiological factors to predict childhood liabilities at a single point in time.

In this work, we provide ECRS for three general health outcomes in children: a P-factor score derived from the Child Behavior CheckList (CBCL) for mental health, a metabolic syndrome severity score (MetS) for cardiometabolic health, and a lung function score (spirometry test) for respiratory health. Those ECRS are computed using recent non-parametric machine learning models (i.e., ensemble of trees), able to capture complex exposome-health relationships and interactions. Obtained ECRS explain a substantial portion of the variance, in particular for mental and cardiometabolic ECRS, and their performances generalize well across all six cohorts. We identify predictors with an overall high impact on the predicted risk, such as maternal stress, child BMI, and noise exposure for mental health. We also extract non-linear dose-response relationships. Our approach’s main benefit lies in its ability to capture complex associations and extract insights at both a global and personal level for each exposure or groups of exposures. Overall, this study highlights the potential of such approaches to compute risk scores able to inform practitioners about actionable factors of high-risk children.

Methods

Study participants

This study uses data from the HELIX project. This project includes data from six different European longitudinal birth cohorts, namely: (1) Born in Bradford alias BiB; UK²², (2) Etude des Déterminants pré et postnatals du Développement et de la santé de l’Enfant, alias EDEN; France²³, (3) Infancia y Medio Ambiente alias INMA, Spain²⁴, (4) Kaunas Cohort alias KANC; Lithuania²⁵, (5) Norwegian Mother, Father and Child Cohort Study alias MoBa; Norway^26,27, and (6) Mother-Child Cohort in Crete alias RHEA; Greece²⁸. Children were born at different periods depending on the cohort (RHEA 2007–2008, EDEN 2003–2005, INMA and MoBa 2005–2007 and KANC 2007–2009). In total nearly 32,000 mother-child pairs were initially followed during pregnancy and a subset into childhood from 6 to 12 years old depending on the cohort. From these, we used data from 1622 pairs for which biological samples, environmental exposures, clinical biomarkers and health outcomes were assessed with common standardized protocols. All six cohorts in which HELIX is based had undergone the required evaluation by national ethics committees (prior the start of the project) and confirmed that relevant informed consent was given for secondary use of the data²⁰.

Data

Health outcomes

Health outcomes during childhood were either directly measured or derived from variables collected between December 2013 and 2016 in the Helix subcohort follow-up visit²⁰. Hence, health outcomes and childhood environmental factors are cross-sectional.

Mental health

We modelled mental health using the P-factor, a reliable measure of psychopathology in youth populations^29,30 that represents life course vulnerability to psychiatric disorders³¹ and is predictive of long-term psychiatric and functional outcomes³². P-factor in childhood has been found to predict the course and severity of a multitude of psychiatric outcomes in adolescence³³. It was computed using confirmatory factor analysis (CFA) to fit a hierarchical general psychopathology model with the Lavaan statistical package³⁴ with data from the 99-item Child Behavior Checklist³⁵, a questionnaire filled out by the parents.

Cardiometabolic health

We used an aggregated metabolic syndrome score as a summary score for cardiometabolic health³⁶. It was calculated using the z scores of waist circumference, systolic and diastolic blood pressures, levels of triglyceride, high-density lipoprotein cholesterol, and insulin with the following formula: metabolic syndrome = z waist circumference + (–z HDL cholesterol level + z triglyceride level)/2 + z insulin + (z systolic blood pressure + z diastolic blood pressure)/2. A higher metabolic syndrome score indicated a poorer metabolic profile.

Respiratory health

Finally, we assessed the lung function using the child forced expiratory (air) volume in one second (FEV₁) percent predicted value as in a previous HELIX study ³⁷. FEV₁ was measured with a spirometry test (EasyOne spirometer; NDD [New Diagnostic Design], Zurich, Switzerland) using a standard standardized protocol. Then the Global Lung Initiative reference equations³⁸ were used to compute FEV1 percent predicted values (i.e., values standardized by age, height, sex, and ethnicity of the patient).

Environmental data

We used a wide variety of environmental exposures from both mothers (during pregnancy) and children (between age 6 to 12 depending on the cohort of inclusion) that participated in the HELIX follow-up visit organized between December 2013 and 2016²⁰. Measurements on pregnant mothers were collected between 1999 and 2010. Information about the methods used to estimate those exposures is available in Supplementary Notes, Supplementary Tables 1–9.

In previous HELIX exposome wide studies (including chemical, outdoor, psychosocial exposures) a sub selection of variables was made among the questionnaire data, and they did not include together external and internal exposome. Due to the exploratory nature of our study, we included new variables previously unexplored in the HELIX studies. In total, we selected 63 prenatal and 240 postnatal exposures grouped in 18 exposure families. An overview of those families is given in Fig. 1. Most of the exposures used were already described in previous publications, specifically measurement available and baseline data²⁰. New exposures, previously not described in detail, were extracted from the HELIX subcohort main questionnaire, more precisely about children’s time spent outside (during weekends and holidays), noise disturbance, and house cleaning products. A full list and a description of the selected variables are available in Supplementary Data 1.

**Fig. 1: Proportions of exposures grouped into families.**

Outdoor and indoor exposures

Outdoor exposures were estimated using geographic information systems (GIS), remote sensing and spatiotemporal modeling³⁹. Considered exposures include air pollutants (e.g., particulate matter), meteorological factors (temperature, humidity, UV exposure), traffic noise, traffic indicators, natural space (green spaces, blue spaces) and built environment (e.g., building density, public transport, facilities, etc.). More details are provided in Supplementary Notes, Supplementary Tables 1–4.

Indoor air pollution exposure to NO2 and to volatile organic compounds benzene, toluene, ethylbenzene, meta-xylene, para-xylene and ortho-xylene, was measured through passive samplers installed in the homes of 150 individuals in the panel studies and extrapolated for the whole HELIX subcohort using prediction models²⁰.

Lifestyle

Lifestyle exposures were collected using a standardized questionnaire developed for HELIX and included the child’s diet, physical activity, sleeping patterns, socioeconomic variables (e.g., subjective wealth, social capital of the family), exposure to environmental tobacco smoke, water consumption habits, cleaning products, noise perception and time outdoors. Water disinfection by-product measurements were collected from water companies for the entire cohorts in each HELIX center. More details are provided in Supplementary Notes, Supplementary Table 5.

Biomonitored chemical pollutants

Pollutant biomarkers were assessed during pregnancy and childhood using blood and urine samples. They include organophosphate pesticides, phenols, phthalates, metals, perfluoroalkyl (PFAS) substances, polybrominated diphenyl ethers (PBDEs), organochlorines and creatine. These measurements were adjusted for lipids and creatinine when appropriate. More details are provided in Supplementary Notes, Supplementary Tables 6–9.

Metabolites and proteins

We included 122 protein and metabolite measurements in the study. More specifically, we included 36 proteins that were assessed from plasma using Luminex immunoassay kits (cytokines 30-plex, apoliprotein 5-plex and adipokine 15-plex). Forty-two blood serum metabolite indicators were assessed using the targeted Biocrates’ AbsoluteIDQ p180 kit and the MetIDQ^TM RatioExplorer software that calculated sums and ratios of metabolites, termed metabolism indicators, to improve biological interpretation. Forty-four urine metabolites were assessed using proton nuclear magnetic resonance (¹H NMR) spectroscopy⁴⁰. Urine metabolites were normalized using the median fold change normalization method⁴¹, which takes into account the distribution of relative levels of all metabolites compared to the reference sample in determining the most probable dilution factor. The full list and the description of selected metabolites and proteins are available in Supplementary Data 1.

Parental and child clinical factors

Clinical factors were collected during childhood or pregnancy from the HELIX subcohort follow-up clinical examination (between December 2013 and 2016)²⁰ or initial cohort assessments on pregnant mothers (between 1999 and 2010). Childhood clinical factors included maternal mental and cognitive states (e.g., maternal perceived stress (short form version)⁴², maternal working memory⁴³), child respiratory factors (e.g. diagnosed asthma, self-reported rhinitis) and cardiometabolic factors (e.g., systolic and diastolic blood pressure, blood lipids). Pregnancy clinical factors only include maternal blood lipids collected during the initial cohorts’ assessments. The complete list and the description of included clinical variables are available in Supplementary Data 1.

Covariates

Covariates were used as predictors in the ECRS. As covariates, we used both children (e.g., age at examination, sex, asthma medication, the season of birth) and parents’ characteristics (e.g., parents’ nativity, paternal and maternal education, mother age at birth, mother’s parity). The full list and description are available in Supplementary Data 1.

Statistical analysis

All data processing was performed in Python 3.9.7.

Data preparation

Figure 2 provides a brief description of the data selection process and the study workflow. This study aims to agnostically discover exposure-health associations while minimizing the likelihood of overfitting and maximizing ECRS models’ interpretability. Hence, we performed several minimal data selection steps, from the initial selection of data to the filtering of strongly correlated and noisy data.

First, as detailed in the Method—Data section, we used a wide selection of previously described variables enriched with new exposures that were not assessed in previous HELIX studies. In this step, similarly to previous studies on the HELIX sub-cohort data^37,44, we selected single representatives from groups of related and correlated variables to reduce multicollinearity and increase interpretability. For instance, we only kept the 300 meter buffers for the number of road intersections per km² and removed the 100 meter buffers; or we considered only home-based air pollutants measurements and discarded those from schools or other places. The full description of the preselection steps is available in the annex (Supplementary Methods—part 1).

Then, we further refined this selection by filtering among strongly correlated variables (r > 0.9), discarding 28 variables in total. The rules applied for this step are also described in the annex (Supplementary Methods—part 2).

To reduce the amount of missingness in the data, we discarded records of individuals with >50% (102 records discarded) and variables with >60% missing data (3 variables discarded). More information about each selection step is provided in Supplementary Table 10. The mean percentage of missing values per exposure was 14% (first quartile 0.66% and third quartile 20.59%). Percentages of imputed missing values for each selected variable are available in Supplementary Data 1 and computed for each cohort in Supplementary Data 2. Missing values were imputed once using MissForest⁴⁵, a single iterative imputation algorithm that can handle both categorical and continuous variables and capture nonlinear relationships. We compared the performance of this method with two classical single imputation methods, namely KNN and mean imputation on manually generated missing values. The number of missing values to add on top of the originals was settled to be proportional to 15% of the total sample size for each variable (i.e., 243). MissForest considerably outperformed KNN and mean imputation with a mean squared error of 0.56, 1.00 and 1.25 respectively. From this imputed dataset, individuals with non-missing outcomes were selected for the mental, cardiometabolic, and respiratory risk scores resulting in three distinct datasets.

Finally, depending on the outcome for each dataset, we excluded clinical factors closely related to the outcome to prevent data leakage, which could over-inflate the model’s predictive power (e.g., blood pressure for MetS). The selection made on the basis of the outcome is available in annex (Supplementary Data 1).

After selection, the data were prepared for the analysis. Depending on their nature, categorical data were one hot encoded (i.e., changed into dummy variables), labeled, or encoded with floating values (for frequencies, binned continuous variables, etc.). Out of the 478 total selected variables, 75 were categorical (11 one-hot encoded, 43 labeled). Then each phenotype was standardized into z-scores.

Modeling

We computed ECRS with supervised machine learning methods, predicting simple scalar measures as outcomes (P-factor, MetS and lung function) and using multiple environmental and clinical variables, including metabolites and proteins as predictors. Training is first performed in a 10-fold cross-validation (CV) procedure, where hyperparameters of the methods are optimized and performances measured. A 10-fold CV was chosen over, for instance, a leave-one-out or a 5-fold CV to balance the computational efficiency and the robustness of our performance estimates. Hyperparameters optimization of each method was achieved using the Tree-structured Parzen Estimator⁴⁶ from the Optuna library, which optimizes the path through the hyperparameters space. Supplementary Tables 11 and 12 show the selected hyperparameters for each model.

We compared predictive performances of ECRS computed with three methods:

The first method is LASSO, a penalized linear regression method widely used in the field. Note that technically, while the model itself is linear (it models the relationships between the input features and the output using a linear equation), the optimization process due to the L1 penalty is non-linear. One of its main advantages is its ability to handle high-dimensional data through regularization. For this method, data was standardized before training.

The second and third methods are nonlinear and non-parametric ensembles of trees (i.e., Random Forests and eXtreme Gradient Boosting aka XGBoost⁴⁷). Ensemble methods are able to handle small datasets and high dimensionality⁴⁸ while interaction effects can be captured and extracted from tree models^49,50. In terms of prediction power, tree-based methods are still competitive with Deep Neural Networks (DNN) on tabular data⁵¹ due to (1) their robustness to uninformative features, (2) their ability to preserve the data orientation, and (3) their ability to easily learn irregular functions. Machine learning, with systematic out-of-sample testing, employs a rigorous approach to test how well relationships between targets and variables are captured. The better a model performs, the more accurate associations it captures, and very poor performances may indicate unreliably captured information.

One of the HELIX project particularities is that it aggregates data from six different cohorts, with some variables having very distinct distributions across them⁵². Those variables are likely to be biased by cohort-related effects. Thus, we chose to penalize contributions of features strongly associated with the cohorts proportionally to their importance in the prediction. Each ECRS was computed using the following modeling (including those in the CV procedures):

$$Y={f}_{0}\left(X\right)+{g}_{0}\left(Z\right)+R,\,E\left[R{{{{{\rm{|}}}}}}Z,X\right]=0$$

Where Y is the phenotype, X the environmental/clinical factors and usual covariates, Z the one hot encoded cohorts of inclusion, and R the residuals. f0 and g0 were estimated using our regressive methods (LASSO, RF, XGB) in two sequential steps with separate models.

$${{{{{\rm{Step}}}}}}1:Y\,=\,{g}_{0}\left(Z\right)+U,\,E\left[U{{{{{\rm{|}}}}}}Z\right]=0$$

(2)

$${{{{{\rm{Step}}}}}}2:{U=f}_{0}\left(X\right)+V,\,E\left[V{{{{{\rm{|}}}}}}X\right]=0$$

(3)

Models’ explanations

Unlike LASSO, tree-based approaches can capture complex relationships that are not limited to a single coefficient per feature. We used SHAP⁵³, a local explanation method that uses Shapley values to extract different contribution coefficients for each individual. This allowed us to keep nonlinear relationships in the explanations. Shapley values were initially used in cooperative game theory to estimate the contribution of each player to the overall cooperation with desirable properties⁵⁴. Adapted to machine learning, it gives the contribution of each feature to the overall prediction at the local (individual) level and can be aggregated (taking the average absolute Shapley values) to give a global measure of feature importance. Concretely, a Shapley value gives, for a given individual, how much the given associated variable impacted the model prediction from the mean predicted value (negatively or positively). They are additives and sum up to the mean predicted value. We used them to explore captured associations (e.g., their directions, nonlinearities) and to compute measures of the global importance of a feature (or a family/group of features) obtained by averaging its absolute (Shapley) values across individuals. Because Shapley values are additives, the contribution of a group of features is easily computed (taking the sum of Shapley values at the individual level). We also computed SHAP interactions⁵⁰ to explore potential (pairwise) cocktail effects.

Measures of global feature importance were computed in the 10-fold CV loop to estimate confidence intervals. Features importances were computed using XGBoost, the best performing nonlinear tree-based method. Local explanations and stratification were conducted on final ECRS obtained by training the XGBoost models on the whole dataset, with the hyperparameters selected by the 10-fold CV procedure. We also computed ECRS with Lasso on the whole dataset to compare extracted Shapley values.

Sensitivity analysis

Finally, for each ECRS, we tested the robustness of our method when applied to different populations. Leveraging the six different cohorts’ data available in our study, we applied a leave-one-cohort-out cross-validation procedure, recursively training our XGBoost models on five cohorts to predict the sixth. Before training, we standardized both features and targets (e.g., P-factor, MetS and lung function) across each cohort. The hyperparameters used were the same as before and were not optimized for this task.

Ethics approval

Local ethical committees approved the studies that were conducted according to the guidelines laid down in the Declaration of Helsinki. The ethical committees for each cohort were the following: BIB, Bradford Teaching Hospitals NHS Foundation Trust; EDEN, Agence nationale de sécurité du médicament et des produits de santé; INMA, Comité Ético de Inverticación Clínica Parc de Salut MAR; KANC, LIETUVOS BIOETIKOS KOMITETAS; MoBa, Regional komité for medisinsk og helsefaglig forskningsetikk; Rhea, Ethical committee of the general university hospital of Heraklion, Crete. Informed consent was obtained from a parent and/or legal guardian of all participants in the study.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Results

Population characteristics

To investigate the predictive potential of early-life external and internal exposome associated with mental, cardiometabolic, and lung health in children, we selected a study cohort of 1622 mother-child pairs who participated in the HELIX study. This cohort was composed of approximately half females (46.1%), mainly of European ancestry (82.9%), from highly educated families (40.1% with high maternal education), and the majority residing in urban areas (75.3% in areas with a density of population > 1500 inhabitants/km²) (see Supplementary Fig. 1). At the time of the health assessment, children were on average 8 years old (range: 5.5–12 years), 3.9% regularly visited the psychologist, and 7.3% had a neuropsychiatric diagnosis at the time of the visit (according to parent’s reports, besides the CBCL screening). Based on the World Health Organization (WHO) international standards for BMI cut-offs (normal: 18.5–25 kg m-2, overweight: 25–30 kg m-2, obese: ≥ 30 kg m-2), while 69.2% of participants fell within the normal category (n = 1122), 10.6% of participants were categorized as overweight (n = 172), and 20.2% of participants were identified as obese (n = 328). Additionally, 10.2% of children were reported to have asthma (ever diagnosed).

Leveraging data from the pregnancy (p = 62) and childhood (p = 241) exposome, preclinical biomarkers (p = 112), and clinical factors (p = 18) along with covariates (p = 14) (full list available in Supplementary Data 1), we trained machine learning models to predict the P-factor, MetS, and lung function. We used 445 features from 1513 individuals for mental health, 432 from 1151 for cardiometabolic health, and 442 from 1176 for respiratory health (Fig. 2). P-factor and metabolic syndrome were square root transformed to be normally distributed. All three health outcomes were standardized to have a mean of 0 and a standard deviation of 1, with ranges of −2.27–3.46, −3.25–3.68, and −4.80–5.60 for mental, cardiometabolic, and respiratory health, respectively (Fig. 3). After transformation and standardization, all outcomes were normally distributed with two-sided Kolmogorov-Smirnov test p-values of 0.08, 0.11 and 0.24 for P-factor, MetS and lung function respectively. For P-factor and MetS, higher scores indicate an increased risk and for lung function a decreased risk.

**Fig. 3: Standardized health outcome distributions measured in 6–12 year-old HELIX children.**

Predictive performances: environmental-clinical features captured 13%, 50%, and 4% of the variance in mental, cardiovascular, and respiratory health

To address overfitting, improve stability, and compare model performances and generalizability for all children, we implemented a ten-fold iteration scheme for tested algorithms (XGBoost, RF, and LASSO) with cross-validation (CV) (Fig. 2; see Method). This approach generated, for each algorithm, ten fitted sparse models for each outcome. The comparative analysis of all methods’ predictive performances is presented in Fig. 4. These performances were obtained after cohort adjustment (see Method, part 2: Modeling). Cohorts accounted for 5 to 14% of the out-of-sample variance (see Supplementary Fig. 2).

**Fig. 4: Models’ performance comparison obtained after cohort adjustment.**

LASSO models explained around 12% of the variance in the P-factor, 51% in MetS, and 2% in lung function. In contrast, RF explained 11% of the variance in the P-factor, 41% in MetS, and 3% in lung function. Finally, XGBoost explained 13% of the variance in the P-factor, 50% in MetS, and 4% in lung function (Fig. 4). Across the three outcomes, the differences between LASSO and XGBoost were not statistically significant, as confirmed by 10-fold cross-validated two-sided paired student t-tests with p-values of 0.686, 0.656, and 0.216 for P-factor, MetS, and lung function, respectively. Information about the residuals obtained during the cross-validation procedure is summarized in Supplementary Table 13. They were, on average, centered around 0 and normally distributed across all CV folds unless for the respiratory health scores that were considered normal less consistently.

Global feature importance: ECRS captures the influence on health of both external and internal exposome factors

From the three ECRS computed with XGBoost, the best performing non-linear method, we computed Shapley values for each feature and each individual using SHAP. Averaging the absolute values of Shapley values across all individuals gives a measure of the global impact of each variable (or groups of variables) on health outcomes. Fig. 5 shows the feature importance for the top 20 variables at the level of each feature and of exposure families for all phenotypes obtained from the 10 fitted cross validated XGBoost models. More exhaustive feature importance lists (top 100) are available in Supplementary Data 3 for mental, cardiometabolic and respiratory ERS. In addition, the overall importance of all metabolites and proteins compared to exposomic variables, clinical factors and covariates is displayed, for each ECRS, in Supplementary Fig. 3.

**Fig. 5: Global feature contributions to the three environmental-clinical risk scores in the HELIX mother-child pairs.**

For the P-factor, maternal stress was by large the most important feature, with a mean SHAP of 0.16, followed by noise disturbance from other children with a mean of 0.05 and zBMI with a mean of 0.04. Apart from parental clinical factors with a mean SHAP of 0.16 (mostly driven by the impact of maternal stress), noise disturbance, and lifestyle exposures (such as skipping breakfast, dairy intake and processed meat consumption) were the most important families of exposures, with mean of 0.07 and 0.06, respectively. Other factors not belonging to these families such as tyrosine (urine metabolite) with a mean SHAP of 0.02 and prenatal bisphenol A (phenols) with a mean SHAP of 0.02 were also noteworthy.

For MetS, the interleukin-1 beta (IL1B) protein was the most prominent feature with a mean SHAP of 0.29. Proteins, serum and urine metabolites, as families of variables, exhibited the most impact on the predicted phenotype. They were largely driven by IL1B, Apolipoprotein A1 (APOA1), and the ratio of short chain acylcarnitines to free carnitine ((C2 + C3)/C0). Overall, for this ECRS, metabolites and proteins combined had a mean SHAP of 0.50 while exposures had 0.14 and clinical factors 0.03 (Supplementary Fig. 3).

Finally, for lung function, although the XGBoost model could explain only a minor part of the outcome (4%) and therefore warrants precaution in the interpretation of the feature importance, no individual features were standing out compared to the other health outcomes. The most important features were child zBMI, N-acetylneuraminic acid (Neu5Ac) and the inverse distance to the nearest road during pregnancy (Fig. 5).

On average, childhood measurements were 3.03–8.95 times more important than prenatal variables for all risk scores. We found that childhood factors had a mean contribution of 0.23 (standard deviation: 0.01), in comparison with a prenatal mean contribution of 0.06 (standard deviation: 0.01). For cardiometabolic health, the postnatal mean contribution was 0.51 (standard deviation: 0.03) while prenatal factors had a mean contribution of only 0.06 (standard deviation: 0.00). For respiratory health, the postnatal mean contribution was 0.13 (standard deviation: 0.01), while the prenatal mean contribution was only 0.04 (standard deviation: 0.01).

Local feature importance: Local explanations reveal complex nonlinear relationships between exposures and health outcomes

Unlike linear models where feature coefficients are identical for all individuals, SHAP extracts contributions that are specific to each individual, allowing to assess more complex exposome-health relationships than regression coefficients. The ECRS captured both linear and nonlinear relationships, as shown by SHAP dependence plots (Fig. 6 and Supplementary Fig. 4). For instance, the relationship between maternal stress and noise disturbance from neighbors followed a linear trajectory, while the impact of child zBMI on the P-factor displayed a more complex pattern with a threshold effect. These plots also allowed us to visualize the directions of the distinct associations for each outcome.

**Fig. 6: Local explanations (SHAP) from the three environmental-clinical risk scores in HELIX mother-child pairs.**

A high value in maternal stress was related to an increase in the predicted P-factor, indicating an increased risk for mental health issues. Low levels of noise disturbances and child zBMI slightly reduced the risk of mental health problems, while high values had a particular harmful impact, especially in the case of child zBMI.

For cardiometabolic health, IL1B was positively associated with MetS, with high values having an important impact on the risk. We observed similar relationships, to a lesser magnitude for AAA, apolipoprotein-E (APOE) and C-peptide. Conversely, high values of APOA1 and the (C2 + C3)/C0 ratio showed a strong protective impact.

The results for the respiratory health scores should be interpreted with caution because of the low variance explained by the model after cohort adjustment (4%). We observed a protective impact on lung function for child BMI and inverse distance from the nearest road during pregnancy. Low BMI values substantially increased the risk, while moderate to high values slightly reduced it. On the contrary, high values of Neu5Ac, facility density near schools (300 m), CRP, Monocyte Chemoattractant Protein-1 (MCP1), sugar and oil intake were associated with decreased lung function.

Compared with the linear associations obtained with Lasso (Supplementary Fig. 5), directions of associations were consistent with XGBoost, with some exceptions (e.g., leptin for cardiometabolic health and bus line accessibility for respiratory health). Overall, predictions obtained with XGBoost were more conservative for extreme values of the predictors (e.g., maternal stress for mental health or child zBMI for respiratory health, etc.).

Pairwise input interaction effects were extracted from captured relationships

Pairwise interaction effects were derived from Shapley values using SHAP. Supplementary Figs. 6 and 7 show plots for the top ten interactions (according to the mean absolute value of Shapley values) derived from the mental and cardiometabolic risk scores. For lung function, the predictive power of the risk score was insufficient to extract meaningful information from its captured interactions. Overall, interaction effects on predicted risk were relatively small compared to the marginal effects. We observed a 7.4–8.8 ratio between the mean top ten marginal effects and the mean top ten interaction effects, depending on the outcome of interest (respectively, 0.042 and 0.005 for mental health, and 0.093 and 0.013 for cardiometabolic health). This indicates that overall, pairwise interactions had a much smaller impact than marginal relationships on the predicted risk for the two scores.

For mental health, the most important captured interactions were between perceived maternal stress during follow-up assessment and factors from diverse exposure families (clinical factors, lifestyle, noise disturbance, etc.). Specifically, the most important interactions were between the following factors: maternal stress and allergic rhinitis, with a mean SHAP value of 0.0070; maternal stress and insulin, with a mean SHAP value of 0.006 and maternal stress with dairy intake, with a mean of 0.006 (Supplementary Fig. 6). For cardiometabolic health, the top ten most relevant captured interactions were between clinical biomarkers (proteins and metabolites) or between IL1B and other factors, such as temperature or child’s age. Specifically, the most important captured interactions were between the following factors: IL1B and ratio of short chain acylcarnitines to free carnitine, with a mean SHAP value of 0.025; APOA1 and APOE, with a mean of 0.016 and APOA1 and the ratio of short chain acylcarnitines to free carnitine with a mean of 0.013 (Supplementary Fig. 7).

Figure 7 shows two arbitrary interactions, selected from the top ten most important ones, derived from the mental and the cardiometabolic risk scores. The first interaction is between maternal stress and the insulin measured in children and impacts mental health. It indicates a harmful impact of high insulin combined with maternal stress. The second interaction for cardiometabolic health is betweenIL1B and the ratio of short chain acylcarnitines to free carnitine. It indicates a significative impact of this ratio on children with high values of IL1beta, with a high ratio value associated with higher risk and vice-versa.

**Fig. 7: SHAP selected interaction effects derived from the mental (P-factor) and the cardiometabolic (MetS) environmental-clinical risk scores.**

Overall ECRS performances generalize well on different populations with discrepancies across cohorts

For each of the three ECRS, performances obtained from the leave-one-cohort-out cross-validation were consistent with those obtained using the ten-fold cross-validation. The explained variance was 13.4% (4% std), 46.8% (12% std) and 2.4% (2% std) for mental, cardiometabolic and respiratory health respectively. Differences in predictive performances were observed for all ERCS depending on the left-out cohort. For instance, the ECRS for the P-Factor was less predictive of the outcome when the KANC cohort was predicted based on all the other cohorts (R² = 6.3% versus 13.4 on average). Table 1 shows the variance explained within each cohort.

Table 1 Variance explained (R²) by each ECRS in the leave-one-cohort-out cross validation procedure

Full size table

Discussion

This is the first study that computed children’s ECRS for mental, cardiometabolic, and respiratory health, and covered a wide range of pre- and post-natal factors (including air pollution, noise, urban and social environment, lifestyle, chemical exposures, metabolites, proteins, and clinical factors). The inclusion of data from six different cohorts across different countries allowed us, by adjusting our models for cohorts, to extract non cohort-specific relationships, thereby increasing their likelihood of being generalizable. Predictive performances were superior for the cardiometabolic risk score (R²: 50%), in particular driven by the (pre)clinical biomarkers compared to the other two health domains, R²: 13% for general mental health and 4% for lung function. The most important variables were the following: parental clinical factors (mainly maternal stress), noise disturbance (mainly from neighbors and other children) and lifestyle exposure for mental health; protein and metabolites (mainly IL1B) for cardiometabolic health; and child BMI and urine metabolites for respiratory health. While our results need to be validated on an external population (e.g., different from the countries available in HELIX), the cohorts-based sensitivity analysis (leave-one-out cross-validation) showed promising results regarding the generalizability of our ECRS on European populations. Our approach’s main benefit lies in its ability to capture complex associations and extract insights at both a global and personal level for each exposure or groups of exposures. Results showed several important captured relationships that were nonlinear.

Our study was performed in a context where current clinical tests often fail to identify children at risk of developing chronic diseases, notably regarding mental and cardiovascular diseases, which is a key challenge to the development of effective prevention and treatment policies. This limitation is largely attributed to the fact that many of these tests were developed and validated primarily in adult populations and to the paucity of longitudinal data on children. In the case of cardiovascular risk, the role of adipokines is a pertinent example. While the association between elevated inflammatory biomarkers and increased cardiovascular risk is well established in adults, there is a lack of comprehensive studies on the onset and progression of these biomarkers in children⁵⁵. Therefore, more prospective studies focused on children are necessary to provide guidance to the pediatric medical community for more effective interventions and prevention policies.

Polygenic risk scores have substantially improved predictions in comparison with single genetic factors and its potentials for screening, prevention, treatment and disease management start to become apparent suggesting that their environmental analogue will be equally valuable in public health prevention¹⁸, for both the identification of individuals and environmental factors of risk. A recent study from the UK Biobank⁵⁶ showed that, in addition to standard clinical risk factors such as sex, age, blood pressure or BMI, ERS provides a greater increase in predictive performances compared with PRS. Furthermore, ERS captures holistic individual-level non-hereditary risk associations, providing clinicians with actionable factors of high-risk patients that are independent of genetics and provide guidance for prevention and treatment. In this case the environmental factors include, among others: noise disturbance, prenatal bisphenol A and ambient humidity for mental health; bus lines accessibility, child’s dichlorodiphenyldichloroethylene and oily fish intake for cardiometabolic health; facility density, bus lines accessibility and sugar intake for respiratory health.

A known limitation of most nonlinear machine learning approaches is the model identifiability which can lead to complications such as overfitting or ambiguity in interpreting the model’s parameters^57,58. However, we believe our approach has several strengths that mitigate these concerns. First, our modeling strategy, backed by rigorous cross-validation, limits overfitting and prioritizes generalizability across different data subsets. Second, both nonlinear methods used in this study aggregate results from multiple models, whether from boosting or bagging, which enhances the stability of results. Finally, it is worth remembering that identifiable models such as linear regression or LASSO are also limited. They cannot capture nonlinear relationships and their interpretation are correct only if all relationships are linear.

Our tree-based approach has shown comparable predictive performances to LASSO, which could indicate that the relationships to capture are mostly linear in nature and that there are no interactions to be captured. However, the difference in prediction is likely to be due to the ability of simpler models to perform better in situations of small training sample size⁵⁷. More data would be needed in order to confirm the nature of those relationships and, even if LASSO performs well, it may not capture all the complexities within the data, especially nonlinear relationships and interactions, which other algorithms could reveal. In this study, we explored such relationships using, to our knowledge, a novel approach in this field. We did not aim to assess nor confirm the causality of extracted relationships, but rather to identify new associations that previous methods would have missed or refine the potential nature of known ones (e.g., nonlinearities, interactions). Thus, the causality of those findings needs to be validated in a causal inference framework.

We compared our findings to those previously obtained in the literature and highlighted novel relationships. Besides nonlinearities that are rarely assessed in the field, we found our results, when comparing only directions of associations, to be mainly consistent with those observed in other studies, which indirectly validates our approach.

For mental health, previous studies have found that maternal stress can lead to an increased risk of anxiety, depression, and behavioral problems in children⁵⁹, while a higher BMI has been associated with depression, anxiety, and low self-esteem⁶⁰. Similarly, exposure to noise has been linked to both internalizing and externalizing behavioral problems in children⁶¹. Overall, while the majority of factors identified in this study have been linked with mental health outcomes in the literature, the exact causal pathways, the potential interactions between these factors, and the nuances of their impacts during specific developmental windows would benefit from further investigation. This is special true for the following environmental factors that are rarely studied in clinical mental health research: perceived noise disturbance, pollutant exposure to bisphenol A (endocrine disruptor chemical), and certain metabolic markers such as tyrosine levels and the Kynurenine/Tryptophan ratio.

For cardiometabolic health, several of the identified markers are well established (e.g., IL1beta⁶², APOA1⁶³, Leptin⁶⁴) and their known relationships are consistent with our findings. Other markers, such as aromatic amino acids or plasmalogens have been linked to several aspects of metabolic and cardiovascular health^65,66,67, but their causal pathways are still areas of ongoing research. Finally, 4-deoxyerythronic acid is not well covered in the literature.

At last, concerning respiratory health, we mainly identified a positive (nonlinear) association between FEV1 and child BMI, which supports the obesity paradox in chronic obstructive disease^68,69. Unexpectedly, the inverse distance to the nearest road during pregnancy was associated with an increased FEV1. This association was already reported in a previous HELIX study³⁷ and is driven by the RHEA cohort. Overall, while our study identified relationships already well established (e.g., air quality through facility density or accessibility), others might be less direct and require further investigations (e.g., N-acetyl neuraminic acid, sugar and vegetable oil intake).

This study benefited from the richness of the HELIX project data, which used standardized outcomes, clinical biomarkers, and exposure measurement methods across six different European countries. The HELIX project used a wide range of exposure measurement techniques to collect both internal and external exposome data. In contrast to previous studies where ERS are usually derived from weighted sums of a limited number of exposures, our approach simultaneously assessed the impact of a wide range of exposures, metabolites, and clinical factors on several health outcomes. To address multicollinearities and nonlinearities, we used penalization and recent AI modeling techniques. SHAP allowed us to decompose the complex relationships captured by our ECRS for each feature, at a global and personal level, extracting interactions and marginal effects. Our risk scores were adjusted for the cohorts of inclusion, which is a particularity of the HELIX project. We acknowledge that in the case of highly correlated variables, our approach exposes those with the estimated higher impact on the predicted risk without implying causality. Correlations between exposures are known to present a challenge for exposome research, especially in the ability to differentiate true predictors from correlated covariates⁷⁰.

The main limitation of our study is the lack of external validation using an independent cohort. While our study benefited from the richness of exposome data not found in any other early-life cohort studies or pre-clinical studies, our sample size was relatively small (n ~ 1500), especially for applying certain machine learning methods (e.g., deep neural networks). This sample size limited our ability to capture and extract complex relationships, such as interaction effects, and favored our choice towards tree-based ensemble machine learning methods. Nevertheless, we observed a similar predictive value with LASSO. The other outcomes were both precomputed composite risk scores (P-factor and MetS), possibly suggesting better performances of machine learning methods in predicting raw outcomes. Additionally, as our study is exploratory in nature, we did not assess the causality of extracted exposome-health relationships. Further work is required to validate these relationships using a proper causal inference methodology. For instance, our scores might capture bidirectional cause-and-effect relationships, such as, potentially, the association between maternal stress and child behavior. The inclusion of internal peripheral markers that reflect the body’s response to the measured health outcomes (e.g., IL1B and obesity) could also reinforce this phenomenon. In the case of MetS, the array of proteins and blood metabolites primarily covered biological pathways related to cardiometabolic outcomes, such as blood lipids. By decomposing the importance attributable to metabolites/proteins and the other factors (Supplementary Fig. 3), and further refined into families of factors, we limit this issue. The integration of these preclinical biomarkers aimed to enhance the predictive power of our ECRS and to investigate their relationships and interactions when combined with a wide variety of factors.

Another limitation is that our study does not account for the risk evolution over time since we used pregnancy and cross-sectional epidemiological data to assess childhood exposures at a single point in time, without considering their impact throughout an individual’s lifetime. This limitation will be addressed in ATHLETE, the HELIX follow-up project in which the same children were regularly monitored into adolescence with repeated health outcome measures⁷¹.

Finally, our study participants are mostly representative of the European population since the data was collected from six different European countries. Therefore, caution must be taken when extrapolating our findings to different populations. As more exposome datasets become available in the future, we will be able to further validate the scores obtained in this study and generalize our findings beyond the populations here analyzed. Beyond the validation of our scores, the associations extracted in this study, which are yet rarely studied in exposome research, could be validated more independently without such a high-dimensional dataset. Some of those factors are easily collectable through questionnaires (e.g., noise disturbance, maternal stress for mental health). Social and perceived environmental factors are yet poorly covered in exposome research⁷² and this study provides more evidence of their importance.

The combined use of complex machine learning techniques and explainable artificial intelligence methods remains uncommon in the environmental epidemiology field, despite its excellent fit with the exposome paradigm, which aims to capture complex associations within mixtures of environmental factors. Our study revealed results mostly consistent with previous studies, while at the same time, exposing individual level relationships with nonlinearities. Furthermore, we believe that the development of bigger databases and federated analysis tools such as dataShield⁷³ can unlock the true potential of these approaches to more accurately capture exposome-health associations.

Conclusion

In this large exposomic study, environmental clinical risk scores were computed using linear (LASSO) and nonlinear methods (XGBoost, Random Forest). No significant differences in predictive performances among these methods were found across the computed risks. From the nonlinear risk scores, we extracted exposome-health relationships from Shapley values, which is, to our knowledge, a novelty in the field, allowing us to derive feature importance at a local and a global level and uncover interactions. The most important predictors included maternal stress, child BMI and noise exposure for mental health; biomarkers such as IL1B and APOA1 for cardiometabolic health; and child BMI and sialic acid (Neu5Ac) for respiratory health.

Besides their usefulness for epidemiological research, our risk scores showed great potential to capture holistic individual-level nonhereditary risk associations that can inform practitioners about actionable factors of high-risk children. As in the post-genetic era, personalized prevention medicine will focus more and more on modifiable factors, we believe that such integrative approaches will be instrumental in shaping future healthcare paradigms.

Data availability

The raw data supporting the current study, including the proteomics and metabolomics data, are not publicly available due to ethical considerations and legislative restrictions. These data include sensitive information that requires careful handling to protect participant confidentiality and comply with data protection regulations. The access to the raw data is available from the corresponding author on request subject to ethical and legislative review. The “HELIX Data External Data Request Procedures” are available with the data inventory at this website: http://www.projecthelix.eu/data-inventory. The document describes who can apply to the data and how, the timings for approval, and the conditions for data access and publication. Additionally, the data were made available for the exposome data challenge without the feature annotations to fullfill the privacy requirements of the birth cohorts: The datasets are available in the github repository. The source data supporting the study figures can be accessed here⁷⁴. All other data are available, if applicable, from the corresponding author on reasonable request.

Code availability

Python code is publicly available on GitHub⁷⁵.

References

Koppe, J. G. et al. Exposure to multiple environmental agents and their effect. Acta Paediatr. 95, 106–113 (2006).
Article Google Scholar
Rauh, V. A. & Margolis, A. E. Research review: environmental exposures, neurodevelopment, and child mental health – new paradigms for the study of brain and behavioral effects. J. Child Psychol. Psychiatry 57, 775–793 (2016).
Article PubMed PubMed Central Google Scholar
Pryce, C. R. et al. Long-term effects of early-life environmental manipulations in rodents and primates: potential animal models in depression research. Neurosci. Biobehav. Rev. 29, 649–674 (2005).
Article PubMed Google Scholar
Needleman, H. L., Schell, A., Bellinger, D., Leviton, A. & Allred, E. N. The long-term effects of exposure to low doses of lead in childhood. N. Engl. J. Med. 322, 83–88 (1990).
Article CAS PubMed Google Scholar
Weihrauch-Blüher, S., Schwarz, P. & Klusmann, J.-H. Childhood obesity: increased risk for cardiometabolic disease and cancer in adulthood. Metabolism 92, 147–152 (2019).
Article PubMed Google Scholar
Maitre, L. et al. Early-life environmental exposure determinants of child behavior in Europe: a longitudinal, population-based study. Environ. Int. 153, 106523 (2021).
Article CAS PubMed PubMed Central Google Scholar
Wild, C. P. Complementing the genome with an ‘exposome’: the outstanding challenge of environmental exposure measurement in molecular epidemiology. Cancer Epidemiol. Biomark. Prev. Publ. Am. Assoc. Cancer Res. Cosponsored Am. Soc. Prev. Oncol 14, 1847–1850 (2005).
Article CAS Google Scholar
Jaffee, S. R. & Price, T. S. Genotype–environment correlations: implications for determining the relationship between environmental exposures and psychiatric illness. Psychiatry 7, 496–499 (2008).
Article PubMed Google Scholar
Johns, D. O. et al. Practical advancement of multipollutant scientific and risk assessment approaches for ambient air pollution. Environ. Health Perspect. 120, 1238–1242 (2012).
Article PubMed PubMed Central Google Scholar
D’Agostino, R. B. et al. General cardiovascular risk profile for use in primary care: the framingham heart study. Circulation 117, 743–753 (2008).
Article PubMed Google Scholar
Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018).
Article CAS PubMed PubMed Central Google Scholar
Park, S. K., Tao, Y., Meeker, J. D., Harlow, S. D. & Mukherjee, B. Environmental risk score as a new tool to examine multi-pollutants in epidemiologic research: an example from the NHANES study using serum lipid levels. PLOS ONE 9, e98632 (2014).
Article PubMed PubMed Central Google Scholar
Murray, G. K. et al. Could polygenic risk scores be useful in psychiatry?: a review. JAMA Psychiatry 78, 210–219 (2021).
Article PubMed Google Scholar
Wray, N. R. et al. From basic science to clinical application of polygenic risk scores: a primer. JAMA Psychiatry 78, 101–109 (2021).
Article PubMed Google Scholar
Pries, L.-K., Erzin, G., Rutten, B. P. F., van Os, J. & Guloksuz, S. Estimating aggregate environmental risk score in psychiatry: the exposome score for schizophrenia. Front. Psychiatry. 12, 671334 (2021).
Gao, P. & Snyder, M. Exposome-wide association study for metabolic syndrome. Front. Genet. 12, 783930 (2021).
Article CAS PubMed PubMed Central Google Scholar
Le Magueresse-Battistoni, B., Vidal, H. & Naville, D. Environmental pollutants and metabolic disorders: the multi-exposure scenario of life. Front. Endocrinol. 9, 582 (2018).
Vassos, E. et al. The Maudsley environmental risk score for psychosis. Psychol. Med. 50, 1–8 (2019).
Google Scholar
Padmanabhan, J. L., Shah, J. L., Tandon, N. & Keshavan, M. S. The ‘polyenviromic risk score’: aggregating environmental risk factors predicts conversion to psychosis in familial high-risk subjects. Schizophr. Res. 181, 17–22 (2017).
Article PubMed Google Scholar
Maitre, L. et al. Human early life exposome (HELIX) study: a european population-based exposome cohort. BMJ Open 8, e021311 (2018).
Article PubMed PubMed Central Google Scholar
Vrijheid, M. et al. The human early-life exposome (HELIX): project rationale and design. Environ. Health Perspect. 122, 535–544 (2014).
Article PubMed PubMed Central Google Scholar
Wright, J. et al. Cohort profile: the born in bradford multi-ethnic family cohort study. Int. J. Epidemiol. 42, 978–991 (2013).
Article PubMed Google Scholar
Heude, B. et al. Cohort profile: the EDEN mother-child cohort on the prenatal and early postnatal determinants of child health and development. Int. J. Epidemiol. 45, 353–363 (2016).
Article PubMed Google Scholar
Guxens, M. et al. Cohort profile: the INMA—INfancia y medio ambiente—(Environment and childhood) project. Int. J. Epidemiol. 41, 930–940 (2012).
Article PubMed Google Scholar
Grazuleviciene, R. et al. Surrounding greenness, proximity to city parks and pregnancy outcomes in kaunas cohort study. Int. J. Hyg. Environ. Health 218, 358–365 (2015).
Article PubMed PubMed Central Google Scholar
Magnus, P. et al. Cohort profile update: the Norwegian mother and child cohort study (MoBa). Int. J. Epidemiol. 45, 382–388 (2016).
Article PubMed Google Scholar
Paltiel, L. et al. The biobank of the Norwegian mother and child cohort study – present status. Nor. Epidemiol. 24, 29–35 (2014).
Google Scholar
Chatzi, L. et al. Cohort profile: the mother-child cohort in crete, greece (Rhea study). Int. J. Epidemiol. 46, 1392–1393k (2017).
Article PubMed Google Scholar
Constantinou, M. P. et al. Changes in general and specific psychopathology factors over a pychosocial intervention. J. Am. Acad. Child Adolesc. Psychiatry 58, 776–786 (2019).
Article PubMed Google Scholar
Haltigan, J. D. et al. “P” and “DP:” examining symptom-level bifactor models of psychopathology and dysregulation in clinically referred children and adolescents. J. Am. Acad. Child Adolesc. Psychiatry 57, 384–396 (2018).
Article PubMed Google Scholar
Caspi, A. et al. Longitudinal assessment of mental health disorders and comorbidities across 4 decades among participants in the dunedin birth cohort study. JAMA Netw. Open 3, e203221 (2020).
Article PubMed PubMed Central Google Scholar
Cervin, M. et al. The p factor consistently predicts long-term psychiatric and functional outcomes in anxiety-disordered youth. J. Am. Acad. Child Adolesc. Psychiatry 60, 902–912.e5 (2021).
Article PubMed Google Scholar
Rijlaarsdam, J. et al. Genome-wide DNA methylation patterns associated with general psychopathology in children. J. Psychiatr. Res. 140, 214–220 (2021).
Article PubMed PubMed Central Google Scholar
Rosseel, Y. lavaan: An R package for structural equation modeling. J. Stat. Softw. 48, 1–36 (2012).
Article Google Scholar
Achenbach, T. M. Integrative Guide for the 1991 CBCL/4-18, YSR, and TRF Profiles. (Univ Vermont/Dept Psychiatry, 1991).
Stratakis, N. et al. Association of fish consumption and mercury exposure during pregnancy with metabolic health and inflammatory biomarkers in children. JAMA Netw. Open 3, e201007 (2020).
Article PubMed PubMed Central Google Scholar
Agier, L. et al. Early-life exposome and lung function in children in Europe: an analysis of data from the longitudinal, population-based HELIX cohort. Lancet Planet. Health 3, e81–e92 (2019).
Article PubMed Google Scholar
Quanjer, P. H. et al. Multi-ethnic reference values for spirometry for the 3-95-yr age range: the global lung function 2012 equations. Eur. Respir. J. 40, 1324–1343 (2012).
Article PubMed PubMed Central Google Scholar
Robinson, O. et al. The urban exposome during pregnancy and its socioeconomic determinants. Environ. Health Perspect. 126, 077005 (2018).
Article PubMed PubMed Central Google Scholar
Lau, C.-H. E. et al. Determinants of the urinary and serum metabolome in children from six European populations. BMC Med. 16, 202 (2018).
Article CAS PubMed PubMed Central Google Scholar
Dieterle, F., Ross, A., Schlotterbeck, G. & Senn, H. Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. Anal. Chem. 78, 4281–4290 (2006).
Article CAS PubMed Google Scholar
Cohen, S. Perceived stress in a probability sample of the United States. In The Social Psychology of Health. 31–67 (Sage Publications, Inc, Thousand Oaks, CA, US, 1988).
Sweet, L. H. N-Back Paradigm. In Encyclopedia of Clinical Neuropsychology (eds. Kreutzer, J. S., DeLuca, J. & Caplan, B.) 1718–1719 (Springer, New York, NY, 2011).
Maitre, L. et al. Multi-omics signatures of the human early life exposome. Nat. Commun. 13, 7024 (2022).
Article CAS PubMed PubMed Central Google Scholar
Stekhoven, D. J. & Bühlmann, P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 28, 112–118 (2012).
Article CAS PubMed Google Scholar
Bergstra, J., Bardenet, R., Bengio, Y. & Kégl, B. Algorithms for hyper-parameter optimization. In Advances In Neural Information Processing Systems. 24, 2546–2554 (Curran Associates, Inc., 2011).
Chen, T. & Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–794 (Association for Computing Machinery, New York, NY, USA, 2016).
Yang, P., Hwa Yang, Y., Zhou, B. B. & Zomaya, Y. A. A review of ensemble methods in bioinformatics. Curr. Bioinforma. 5, 296–308 (2010).
Article CAS Google Scholar
Lundberg, S. M. et al. From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell 2, 56–67 (2020).
Article PubMed PubMed Central Google Scholar
Lundberg, S. M., Erion, G. G. & Lee, S.-I. Consistent individualized feature attribution for tree ensembles. arXiv http://arxiv.org/abs/1802.03888 (2019).
Grinsztajn, L., Oyallon, E. & Varoquaux, G. Why do tree-based models still outperform deep learning on tabular data? Adv. Neural. Inf. Process. Syst. 35, 507–520 (2022).
Tamayo-Uria, I. et al. The early-life exposome: description and patterns in six European countries. Environ. Int. 123, 189–200 (2019).
Article CAS PubMed Google Scholar
Lundberg, S. & Lee, S.-I. A Unified approach to interpreting model predictions. Adv. Neural. Inf. Process. Syst. 30 (2017)
Hart, S. Shapley value. in Game Theory (eds. Eatwell, J., Milgate, M. & Newman, P.) 210–216 (Palgrave Macmillan UK, London, 1989).
Balagopal, P. B. et al. Nontraditional risk factors and biomarkers for cardiovascular disease: mechanistic, research, and clinical considerations for youth. Circulation 123, 2749–2769 (2011).
Article PubMed Google Scholar
He, Y. et al. Comparisons of polyexposure, polygenic, and clinical risk scores in risk prediction of type 2 diabetes. Diabetes Care 44, 935–943 (2021).
Article PubMed PubMed Central Google Scholar
Hastie, T., Tibshirani, R. & Friedman, J. Overview of supervised learning. In The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Vol. 2 (eds. Hastie, T., Tibshirani, R. & Friedman, J.) 9–41 (Springer, New York, NY, 2009).
Hastie, T., Tibshirani, R. & Friedman, J. Model assessment and selection. In The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Vol. 2 (eds. Hastie, T., Tibshirani, R. & Friedman, J.) 219–259 (Springer, New York, NY, 2009).
Farewell, C. V., Melnick, E. & Leiferman, J. Maternal mental health and early childhood development: Exploring critical periods and unique sources of support. Infant Ment. Health J. 42, 603–615 (2021).
Article PubMed Google Scholar
Wang, F. & Veugelers, P. J. Self-esteem and cognitive development in the era of the childhood obesity epidemic. Obes. Rev. 9, 615–623 (2008).
Article CAS PubMed Google Scholar
Lim, J. et al. Negative impact of noise and noise sensitivity on mental health in childhood. Noise Health 20, 199–211 (2018).
PubMed PubMed Central Google Scholar
Esser, N., Legrand-Poels, S., Piette, J., Scheen, A. J. & Paquot, N. Inflammation as a link between obesity, metabolic syndrome and type 2 diabetes. Diabetes Res. Clin. Pract. 105, 141–150 (2014).
Article CAS PubMed Google Scholar
Wilkins, J. T. et al. Spectrum of apolipoprotein AI and apolipoprotein aII proteoforms and their associations with indices of cardiometabolic health: the CARDIA study. J. Am. Heart Assoc. 10, e019890 (2021).
Article CAS PubMed PubMed Central Google Scholar
Tsai, J.-P. The association of serum leptin levels with metabolic diseases. Tzu-Chi Med. J. 29, 192–196 (2017).
Article PubMed Central Google Scholar
Sun, S. et al. Metabolic syndrome and its components are associated with altered amino acid profile in Chinese han population. Front. Endocrinol. 12, 795044 (2022).
Article Google Scholar
Ding, Y., Wang, S. & Lu, J. Unlocking the potential: amino acids’ role in predicting and exploring therapeutic avenues for type 2 diabetes mellitus. Metabolites 13, 1017 (2023).
Article CAS PubMed PubMed Central Google Scholar
Novgorodtseva, T. P. et al. Composition of fatty acids in plasma and erythrocytes and eicosanoids level in patients with metabolic syndrome. Lipids Health Dis 10, 82 (2011).
Article CAS PubMed PubMed Central Google Scholar
Sun, Y. et al. BMI is associated with FEV1 decline in chronic obstructive pulmonary disease: a meta-analysis of clinical trials. Respir. Res. 20, 236 (2019).
Article PubMed PubMed Central Google Scholar
Köchli, S. et al. Lung function, obesity and physical fitness in young children: the EXAMIN YOUTH study. Respir. Med. 159, 105813 (2019).
Article PubMed Google Scholar
Agier, L. et al. A systematic comparison of linear regression–based statistical methods to assess exposome-health associations. Environ. Health Perspect. 124, 1848–1856 (2016).
Article PubMed PubMed Central Google Scholar
Vrijheid, M. et al. Advancing tools for human early lifecourse exposome research and translation (ATHLETE). Environ. Epidemiol 5, e166 (2021).
Google Scholar
Neufcourt, L. et al. Assessing how social exposures are integrated in exposome research: a scoping review. Environ. Health Perspect. 130, 116001 (2022).
Article PubMed PubMed Central Google Scholar
Gaye, A. et al. DataSHIELD: taking the analysis to the data, not the data to the analysis. Int. J. Epidemiol. 43, 1929–1944 (2014).
Article PubMed PubMed Central Google Scholar
Guimbaud, J.-B. ML based health ECRS in European children - figure source data. https://doi.org/10.6084/m9.figshare.25625109.
Guimbaud, J.-B. ML based ECRS for European children - python code. https://doi.org/10.5281/zenodo.10519296.

Download references

Acknowledgements

The authors would like to thank all the participating children, parents, practitioners, and researchers in the six countries who took part in this study. The Norwegian Mother, Father and Child Cohort Study is supported by the Norwegian Ministry of Health and Care Services and the Ministry of Education and Research. We also acknowledge the support of the Spanish Ministry of Science and Innovation to the EMBL partnership, the Centro de Excelencia Severo Ochoa, and the CERCA Programme / Generalitat de Catalunya. The CRG/UPF Proteomics Unit is part of the Spanish Infrastructure for Omics Technologies (ICTS OmicsTech) and it is supported by “Secretaria d’Universitats i Recerca del Departament d’Economia i Coneixement de la Generalitat de Catalunya” (2021SGR01225 and 2021SGR01563). This project was funded by the H2020-EU.3.1.2.—Preventing Disease Programme under grant agreement no 874583 (ATHLETE project). JB Guimbaud was supported by a CIFRE PhD fellowship (#2020/1297). Léa Maitre is funded by a Juan de la Cierva-Incorporación fellowship (IJC2018-035394-I) awarded by the Spanish Ministerio de Economía, Industria y Competitividad. ISGlobal and the Exposome hub acknowledges support from the Spanish Ministry of Science and Innovation through the “Centro de Excelencia Severo Ochoa 2019–2023” Program (CEX2018-000806-S), and support from the Generalitat de Catalunya through the CERCA Program. Jose Urquiza is supported by Catalan program PERIS (Ref.: SLT017/20/000119), granted by Departament de Salut de la Generalitat de Catalunya (Spain). Oliver Robinson was funded by the UK Research and Innovation Future Leaders Fellowship (MR/S03532X/1). Silvia Alemany holds a Miquel Servet-I contract (CP22/00026) awarded by the Instituto de Salud Carlos III co-funded by the European Union Found: Fondo Social Europeo Plus, FSE + . Mónica López-Vicente is funded by a Juan de la Cierva-Incorporación fellowship (project IJC2020-045355-I, funded by MCIN/AEI/10.13039/501100011033 and for the European Union NextGenerationEU/PRTR). The data of the cohorts (BiB, EDEN, INMA, KANC, MoBa and RHEA) provided to this research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007‐2013) under grant agreement no 308333—the HELIX project. Fig. 2 was created with BioRender.com.

Author information

Authors and Affiliations

ISGlobal, Barcelona, Spain
Jean-Baptiste Guimbaud, Jordi Julvez, Jose Urquiza, Maribel Casas, Mariona Bustamante, Mark Nieuwenhuijsen, Martine Vrijheid, Mónica López-Vicente, Montserrat de Castro Pascual, Xavier Basagaña & Léa Maitre
Univ Lyon, UCBL, CNRS, INSA Lyon, LIRIS, UMR5205, F-69622, Villeurbanne, France
Jean-Baptiste Guimbaud & Rémy Cazabet
Universitat Pompeu Fabra (UPF), Barcelona, Spain
Jean-Baptiste Guimbaud, Eduard Sabidó, Eva Borràs, Jose Urquiza, Maribel Casas, Mariona Bustamante, Martine Vrijheid, Mónica López-Vicente, Montserrat de Castro Pascual, Xavier Basagaña & Léa Maitre
Meersens, Lyon, France
Jean-Baptiste Guimbaud
Imperial College London, Cancer Metabolism & Systems Toxicology Group, Division of Cancer, Department of Surgery & Cancer, London, UK
Alexandros P. Siskos & Hector Keun
Norwegian Institute of Public Health, Oslo, Norway
Amrit Kaur Sakhi & Kristine Bjerve Gützkow
Université Paris Cité, Inserm, INRAE, Centre for Research in Epidemiology and StatisticS (CRESS), Paris, France
Barbara Heude
Centre de Regulació Genòmica, Barcelona Institute of Science and Technology (BIST), Barcelona, Spain
Eduard Sabidó & Eva Borràs
Bradford Institute for Health Research, Bradford, UK
John Wright
Bradford Teaching Hospitals NHS Foundation Trust, Bradford, UK
John Wright
CIBER Epidemiología Y Salud Pública (CIBERESP), Madrid, Spain
Jordi Julvez, Jose Urquiza, Maribel Casas, Mariona Bustamante, Martine Vrijheid, Mónica López-Vicente, Montserrat de Castro Pascual, Xavier Basagaña & Léa Maitre
Institut d’Investigació Sanitària Pere Virgili, Hospital Universitari Sant Joan de Reus, Reus, Spain
Jordi Julvez
Department of Preventive Medicine, University of Southern Los Angeles, Los Angeles, CA, USA
Leda Chatzi & Nikos Stratakis
Μedical Research Council Centre for Environment and Health, School of Public Health, Imperial College London, London, UK
Oliver Robinson
Mohn Centre for Children’s Health and Well-being, School of Public Health, Imperial College London, London, UK
Oliver Robinson
Department of Environmental Sciences, Vytautas Magnus University, Kaunas, Lithuania
Regina Grazuleviciene
Team of Environmental Epidemiology, IAB, Institute for Advanced Biosciences, Inserm, CNRS, CHU-Grenoble-Alpes, University Grenoble-Alpes, Grenoble, France
Remy Slama
Psychiatric Genetics Unit, Group of Psychiatry Mental Health and Addiction, Vall d’Hebron Research Institute (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain
Silvia Alemany
Department of Mental Health, Hospital Universitari Vall d’Hebron, Barcelona, Spain
Silvia Alemany
Biomedical Network Research Centre on Mental Health (CIBERSAM), Instituto de Salud Carlos III, Madrid, Spain
Silvia Alemany
EPITA Research Laboratory (LRE), Kremlin-Bicêtre, France
Marc Plantevit

Authors

Jean-Baptiste Guimbaud
View author publications
You can also search for this author in PubMed Google Scholar
Alexandros P. Siskos
View author publications
You can also search for this author in PubMed Google Scholar
Amrit Kaur Sakhi
View author publications
You can also search for this author in PubMed Google Scholar
Barbara Heude
View author publications
You can also search for this author in PubMed Google Scholar
Eduard Sabidó
View author publications
You can also search for this author in PubMed Google Scholar
Eva Borràs
View author publications
You can also search for this author in PubMed Google Scholar
Hector Keun
View author publications
You can also search for this author in PubMed Google Scholar
John Wright
View author publications
You can also search for this author in PubMed Google Scholar
Jordi Julvez
View author publications
You can also search for this author in PubMed Google Scholar
Jose Urquiza
View author publications
You can also search for this author in PubMed Google Scholar
Kristine Bjerve Gützkow
View author publications
You can also search for this author in PubMed Google Scholar
Leda Chatzi
View author publications
You can also search for this author in PubMed Google Scholar
Maribel Casas
View author publications
You can also search for this author in PubMed Google Scholar
Mariona Bustamante
View author publications
You can also search for this author in PubMed Google Scholar
Mark Nieuwenhuijsen
View author publications
You can also search for this author in PubMed Google Scholar
Martine Vrijheid
View author publications
You can also search for this author in PubMed Google Scholar
Mónica López-Vicente
View author publications
You can also search for this author in PubMed Google Scholar
Montserrat de Castro Pascual
View author publications
You can also search for this author in PubMed Google Scholar
Nikos Stratakis
View author publications
You can also search for this author in PubMed Google Scholar
Oliver Robinson
View author publications
You can also search for this author in PubMed Google Scholar
Regina Grazuleviciene
View author publications
You can also search for this author in PubMed Google Scholar
Remy Slama
View author publications
You can also search for this author in PubMed Google Scholar
Silvia Alemany
View author publications
You can also search for this author in PubMed Google Scholar
Xavier Basagaña
View author publications
You can also search for this author in PubMed Google Scholar
Marc Plantevit
View author publications
You can also search for this author in PubMed Google Scholar
Rémy Cazabet
View author publications
You can also search for this author in PubMed Google Scholar
Léa Maitre
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.B. Guimbaud: Conceptualization, Methodology, Software, Formal analysis, Data Curation, Writing—original draft, Visualization; A.P. Siskos: Methodology, Resources, Writing—review and editing; A.K. Sakhi: Investigation, Writing—review and rditing; B. Heude: Data Curation; E. Sabidó: Methodology, Writing—review and editing; E. Borràs: Methodology, Formal Analysis, Writing—review and rditing; H. Keun: Writing—Review and Editing; J. Wright: Investigation, Writing—review and editing, Funding Acquisition, Project Administration; J. Julvez: Writing—review and editing; J. Urquiza: Writing—review and editing, Data Curation; K.B. Gützkow: Investigation, Writing—review and editing; L. Chatzi: Writing—Review and Editing; M. Casas: Writing—review and rditing, Data Curation; M. Bustamante: Methodology, Writing—review and editing; M. Nieuwenhuijsen: Writing—Review and Editing; M. Vrijheid: Writing—Review and Editing, Project Administration; M. López-Vicente: Methodology, Data Curation, Writing—Review and Editing; M.C. Pascual: Methodology, Software, Resources, Data curation, Writing—review and editing; N. Stratakis: Writing—review and editing; O. Robinson: Investigation, Writing—Review and Editing; R. Grazuleviciene: Writing—review and editing; R. Slama: Writing—review and editing; S. Alemany: Conceptualization, Methodology, Formal Analysis, Writing—review and editing; X. Basagaña: Writing—review and editing; M. Plantevit: Conceptualization, Methodology, Writing—review and editing; R. Cazabet: Conceptualization, Methodology, Writing—review and editing; L. Maitre: Conceptualization, Methodology, Writing—review and editing, Supervision.

Corresponding author

Correspondence to Léa Maitre.

Ethics declarations

Competing interests

The authors declare the following competing interests: JB Guimbaud has received PhD fellowship from an industry partnership CIFRE PhD fellowship (#2020/1297, CIFRE are fully financed by the French Ministry of Higher Education, Research and Innovation) and contracted by the private company Meersens. Meersens did not participate in any way in this study. All other authors declare no competing interests.

Peer review

Peer review information

Communications Medicine thanks the anonymous reviewers for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Peer Review File

Supplementary Information

Description of Additional Supplementary Files

Supplementary Data 1

Supplementary Data 2

Supplementary Data 3

Reporting Summary

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Guimbaud, JB., Siskos, A.P., Sakhi, A.K. et al. Machine learning-based health environmental-clinical risk scores in European children. Commun Med 4, 98 (2024). https://doi.org/10.1038/s43856-024-00513-y

Download citation

Received: 31 March 2023
Accepted: 26 April 2024
Published: 23 May 2024
DOI: https://doi.org/10.1038/s43856-024-00513-y
Springer Nature Limited

Associated content

Child and Adolescent Health

Collection 10 August 2022

Machine learning-based health environmental-clinical risk scores in European children

Abstract

Background

Methods

Results

Conclusions

Plain language summary

Similar content being viewed by others

Introduction

Methods

Study participants

Data

Health outcomes

Mental health

Cardiometabolic health

Respiratory health

Environmental data

Outdoor and indoor exposures

Lifestyle

Biomonitored chemical pollutants

Metabolites and proteins

Parental and child clinical factors

Covariates

Statistical analysis

Data preparation

Modeling

Models’ explanations

Sensitivity analysis

Ethics approval

Reporting summary

Results

Population characteristics

Predictive performances: environmental-clinical features captured 13%, 50%, and 4% of the variance in mental, cardiovascular, and respiratory health

Global feature importance: ECRS captures the influence on health of both external and internal exposome factors

Local feature importance: Local explanations reveal complex nonlinear relationships between exposures and health outcomes

Pairwise input interaction effects were extracted from captured relationships

Overall ECRS performances generalize well on different populations with discrepancies across cohorts

Discussion

Conclusion

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation