An interactive atlas of genomic, proteomic, and metabolomic biomarkers promotes the potential of proteins to predict complex diseases

Smelik, Martin; Zhao, Yelin; Li, Xinxiu; Loscalzo, Joseph; Sysoev, Oleg; Mahmud, Firoj; Mansour Aly, Dina; Benson, Mikael

doi:10.1038/s41598-024-63399-9

An interactive atlas of genomic, proteomic, and metabolomic biomarkers promotes the potential of proteins to predict complex diseases

Article
Open access
Published: 03 June 2024

Volume 14, article number 12710, (2024)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

An interactive atlas of genomic, proteomic, and metabolomic biomarkers promotes the potential of proteins to predict complex diseases

Download PDF

Martin Smelik¹^na1,
Yelin Zhao¹^na1,
Xinxiu Li¹,
Joseph Loscalzo²,
Oleg Sysoev³,
Firoj Mahmud¹^na1,
Dina Mansour Aly¹^na1 &
…
Mikael Benson¹^na1

921 Accesses
3 Altmetric
Explore all metrics

Abstract

Multiomics analyses have identified multiple potential biomarkers of the incidence and prevalence of complex diseases. However, it is not known which type of biomarker is optimal for clinical purposes. Here, we make a systematic comparison of 90 million genetic variants, 1453 proteins, and 325 metabolites from 500,000 individuals with complex diseases from the UK Biobank. A machine learning pipeline consisting of data cleaning, data imputation, feature selection, and model training using cross-validation and comparison of the results on holdout test sets showed that proteins were most predictive, followed by metabolites, and genetic variants. Only five proteins per disease resulted in median (min–max) areas under the receiver operating characteristic curves for incidence of 0.79 (0.65–0.86) and 0.84 (0.70–0.91) for prevalence. In summary, our work suggests the potential of predicting complex diseases based on a limited number of proteins. We provide an interactive atlas (macd.shinyapps.io/ShinyApp/) to find genomic, proteomic, or metabolomic biomarkers for different complex diseases.

Computational Methods for the Identification of Genetic Variants in Complex Diseases

An atlas of genetic scores to predict multi-omic traits

Article 29 March 2023

Human disease biomarker panels through systems biology

Article 13 October 2021

Introduction

The shifting landscape of global healthcare towards complex diseases affecting the immune, metabolic, respiratory, and vascular systems has underscored the need for accurate biomarkers for early prediction or diagnosis. Such biomarkers are often prioritised based on the scientific literature, clinical experiences, or analyses of different omics data. This may be confounded by knowledge biases or, in the case of omics data, limited sample numbers. Other problems obstructing clinical implementation include cost and complexity of analysing potential biomarkers in clinical settings. Moreover, the diagnostic use of any biomarker must take into account context-dependent demands on specificity and sensitivity. For example, biomarkers for a serious disease may require higher sensitivity at the cost of lower specificity compared to biomarkers for a less serious disease. As a result, recent meta-analyses of biomarker studies have shown promising examples, but difficulties in finding examples that had successfully reached the clinic on a wider scale^1,2.

Promising examples include longitudinal studies of proteomes and metabolomes in blood samples. These have shown associations with disease-associated traits³, as well as various immunological, cardiovascular or metabolic diseases^4,5,6. UK Biobank (UKBB) is prospective study of some 500,000 individuals, which makes extensive phenotypic and multiomics data available to researchers across the globe. Since it has longitudinal data, it is possible to identify biomarkers for both patients that are already diagnosed and patients that will get diagnosed in the future (henceforth referred to as prevalent and incident disease, respectively)⁷. The UKBB proteomics data have already been used to construct proteomics-based scores to predict incident cases and mortality of common diseases^8,9,10,11. While they found promising association of some proteins or group of proteins with the future outcome of the individuals, several questions stay unanswered. Namely, (1) How do the predictions using other omics layers, such as genomics or metabolomics perform compared to the proteomics-based prediction models? (2) How does the prediction focused on the incident cases relate to the prediction based on the prevalent cases? Are the same molecules good markers for earlier and later stages of the diseases? (3) Could a single molecule or a limited number of molecules suffice to predict complex diseases that, in contrast to monogenic diseases, are caused by multiple interacting molecules^6,12?

This problem is illustrated by a recent study, which identified multiple disease-associated metabolites, each of which could vary greatly between both healthy and sick individuals¹³.

To systematically answer these questions, we performed a comparison of genomic, proteomic, and metabolomic data from the UKBB. We used machine learning to build predictive models for different combinations of genetic variants, proteins, and metabolites, and these models were utilized to search for potential biomarkers for the incidence and prevalence of nine complex diseases (Fig. 1).

Results

Patient cohort

Our analyses were based on 92,916 patients with rheumatoid arthritis (RA), systemic lupus erythematous (SLE), ulcerative colitis (UC), Crohn's disease (CD), psoriasis (PSO), type 2 diabetes (T2D), obesity, atherosclerotic vascular disease (ASVD), and chronic obstructive pulmonary disease (COPD), as well as their age/sex matched controls. These diseases were selected because they had enough samples to perform statistically robust predictions. For each disease, we divided patients into those who were diagnosed after the assessment visit and those who were already diagnosed (incident and prevalent disease, respectively). The patient characteristics are presented in Supplementary Table 1.

Interactive web-based atlas enables the search for biomarkers

The UKBB data that we analysed consists of genotypes data of 90 million genetic variants, 1453 proteins, and 325 metabolite measurements. To find the optimal type and number of potential biomarkers for incidence and prevalence, we used a machine learning pipeline, which consisted of data cleaning, data imputation, feature selection, and model training, with comparisons of the results on holdout test sets. In all models, the data were divided into training and testing datasets. The classification model was trained using a tenfold cross validation. The results are presented in an interactive web-based atlas (macd.shinyapps.io/ShinyApp/), which generates receiver operating characteristic (ROC) curves for the incidence or prevalence of each disease based on user-selected numbers of either proteins or metabolites. In the case of genomics, the ROC curves are generated from published polygenic risk scores (PRS)s for the studied diseases, which were derived from the polygenic score catalog¹⁴.

Proteomics biomarkers outperforms biomarkers from other omics

To evaluate the predictive performance of the different molecular types for incidence and prevalence of each disease, we started by computing test ROC curves based on only five proteins or metabolites, while for genomics, we used the individual scaled PRSs. We found that there was a significant difference between incidence or prevalence and healthy controls for all diseases in proteins and metabolites, while the differences in PRSs were sometimes not significant (Fig. 2A). The boxplots for proteins and metabolites also indicate that healthy controls can be separated from incident and prevalent cases with a good precision. To further evaluate the models, we computed the area under the ROC curves (AUCs). We aimed for AUCs of 0.8 or greater, which may be of clinical significance¹⁵. Overall, proteins yielded the highest AUCs. The AUCs reached 0.8 or more for all diseases except CD and UC. The median (min–max) AUC for incidence was 0.79 (0.65–0.86), and for prevalence 0.84 (0.7–0.91). Metabolites yielded median (min–max) AUCs for incidence and prevalence of 0.70 (0.62–0.80) and 0.86 (0.65–0.90), respectively. T2D, obesity and ASVD had the highest AUCs. Genetic variants resulted in median AUCs for incidence and prevalence of 0.57 (0.53–0.67) and 0.6 (0.49–0.70), respectively. The most clinically significant AUCs were found in CD, PSO and T2D. These findings suggested that as few as five proteins may suffice for both predicting incident and diagnosing prevalent disease. However, the optimal number could be context dependent. For example, a serious disease may motivate a larger number of biomarkers than a less serious one. To address this question, we analysed the atlas to compute AUCs for different numbers of proteins and metabolites (Fig. 2B–D and Supplementary Table 2). For most of the diseases, five or fewer proteins sufficed to achieve AUCs of 0.8 or more. For example, in ASVD only three proteins resulted in an AUC of 0.88 for prevalence, namely, matrix metalloproteinase 12 (MMP12), TNF Receptor Superfamily Member 10b (TNFRSF10B), and Hepatitis A Virus Cellular Receptor 1 (HAVCR1), consistent with extant knowledge on the role of inflammation and matrix degradation in atherogenesis. However, for incidence, 18 proteins were needed to achieve an AUC of 0.8 (Fig. 2B).

Gene ontology analysis of potential protein biomarkers showed significant enrichment of a wide variety of pathways

To examine if the 30 most discriminating proteins for each disease were functionally connected, we performed a Gene Ontology Analysis (Fig. 3). Indeed, this showed significant enrichment of the proteins in a wide variety of pathways. In agreement with the general importance of the immune system¹⁶ the term “inflammatory response” was enriched in all the diseases. However, the pathophysiological diversity of the diseases was also reflected by enrichment of pathways regulating highly diverse immunological, structural, proliferative, and metabolic functions.

Discussion

Our comparison of genomic, proteomic, and metabolomic data provides a systematic solution for the prioritisation of the type and number of potential biomarkers for the incidence and prevalence of nine common complex diseases. The clinical relevance lies in that prioritisation of biomarkers is complicated by each disease involving thousands of genes and gene products that can vary between the same patient before and after diagnosis, as well as between patients with the same diagnosis. Recent organome-, cellulome and genomewide studies show that the same complex diseases can involve variable cellular and molecular changes across multiple organs, and not only in the organ that shows symptoms or signs of disease^17,18. Moreover, those changes can vary greatly between patients with the same diagnosis¹⁹. Thus, biomarker prioritisation based on literature, clinical experience, or omics data involves formidable challenges. The main finding of our study is that a limited number of proteins have potential for both prediction and diagnosis, representing substantial dimensionality reduction of the ever-expanding pool of big data acquired from patients with these complex disorders.

From a clinical perspective, an advantage of proteins is that they can be measured with routine clinical methods. For any biomarker and disease, the optimal number of proteins is a trade-off between cost, sensitivity and specificity^15,20. For example, the prediction or diagnosis of a serious disease may motivate a larger number of proteins than a less serious disease. We make all the molecule combinations and their AUCs available to facilitate systematic and context-dependent prioritisation of biomarkers for clinical studies.

It may be difficult to discern whether different omics layers have casual effect in relation to the disease mechanisms or rather reflect the consequences of those mechanisms. For example, we have previously found that different subtypes of diabetes have variable genetic and environmental associations²¹. This heterogeneity is particularly evident in type 2 diabetes, which has a strong genetic component. However, environmental factors, like diet also play a large role. In agreement with this, our analyses of type 2 diabetes, showed that metabolites and proteins had higher AUCs than genetic variants. While the genetic variants likely have causal roles, it is difficult to define if the metabolites or proteins change because of the environmental factors or secondary to intrinsic disease mechanisms. Further studies are warranted to offer a better understanding of the role of biomarkers in disease progression. Because of its longitudinal design, the UKBB provides a unique opportunity to identify incidence biomarkers that are potentially associated with early disease mechanisms. Proteins in blood may be particularly suitable because they reflect changes in tissues and mediate a wide range of disease-relevant functions, such as interactions between cells, immune responses, vascular functions, tissue remodelling^17,22.

Indeed, our pathway analyses of the 30 most discriminating potential protein biomarkers showed a wide variety of pathways. Some of these were shared between diseases. For example, the general pathway term “inflammatory response” was shared between all the diseases. By contrast more specific immunological pathways were associated with immune-mediated inflammatory diseases (UC, RA, ASVD and SLE). Indeed, in agreement with previous studies more specific pathways, like type 2 interferons was only enriched in SLE²³. Thus, the incidence biomarkers may help to discover, or prioritise among previously known, early disease mechanisms and thereby identify targets for preventive treatment.

As one example, ASVD is an important cause of morbidity and mortality worldwide and is associated with myocardial infarction, stroke, vascular dementia, and peripheral arterial occlusive disease. Early prediction of subclinical disease and prevention or treatment are, therefore, key health care objectives with the potential to greatly reduce patient suffering²⁴. The incidence biomarker proteins identified by our analysis are mechanistically rational (although they need not be a priori) and are responsible for inflammatory responses in the early plaque (CXCL17, PLAUR), stress responses after (hypoxic or inflammatory) injury (GDF15), innate immunity (WFDC2), and angiogenesis (PLAUR, WFDC2). Similarly, as discussed above, the prevalent biomarker proteins include proteins critical for matrix remodelling (MMP12, KLK4), cytokine-mediated inflammatory responses (TNFRSF10B), and Hif-1alpha-dependent angiogenesis (ADM).

Limitations of our study include the fact that proteins and metabolites in blood may not reflect disease-associated changes in tissues or may vary for reasons other than disease. Moreover, a limited number of proteins and metabolites were analysed, with technologies that could have method-dependent variations. Variable numbers of patients and controls were used for different omics layers, and thus, the statistical power of proteomics is lower than the other layers. However, to reduce the risk of spurious findings or overfitting the analyses, our analytical strategy was based on dividing each diagnostic group into a training and a test set. In support of the pathophysiological relevance of the analysed proteins, Gene Ontology Analyses showed that they were significantly enriched for a wide variety of pathways. This supported that they were functionally related rather than random findings. However, another limitation is that the UKBB mainly consists of European participants with restricted age ranges. Additionally, the UKBB participants have been shown to be potentially biased due to selective participation²⁵. Given all these limitations, the prediction potential of each biomarker combination should be evaluated by independent studies. However, the focus of our study, comparing different omics layers, should be less effected by the abovementioned limitations. These limitations, however, point to how our atlas can be exploited to facilitate future studies of (1) more diverse populations, starting with targeted analyses of prioritised biomarkers rather than with more costly omics technologies; (2) how to combine and integrate prioritised biomarkers with routine laboratory measurements as well as clinical symptoms and signs. As mentioned above proteins may be particularly suitable to integrate in clinical laboratory analyses, because they can be measured with routine methods. From a clinical perspective, the relevance of integration of novel biomarkers with other types of routine variables is consistent with the clinical experience that physicians are trained to make diagnostic and therapeutic decisions based on combining biomarkers with routine clinical variables; (3) investigation of data integration methods to construct classifiers with the capacity to incorporate different omics layers; and (4) incidence biomarkers to predict and potentially prevent complex diseases. The importance lies in the fact that many complex diseases have vague or no symptoms at early stages but are easier to treat at that point rather than at later stages. A well-known example is how biomarkers for early diagnosis have greatly improved the treatment of RA. However, finding targets for early treatment is complicated by systematic studies of early mechanisms being difficult to perform in human subjects before diagnosis. Thus, transformative changes of clinical practice based on biomarkers will require consorted multi-disciplinary efforts across multiple diseases, including analysing larger numbers of different types of different molecules.

In summary, we found that a limited number of proteins in blood may suffice for the early prediction and diagnosis of complex diseases. We make those proteins, as well as metabolites and genetic variants associated with the PRS of the diseases, available in the form of an interactive atlas for future studies to evaluate their potential.

Methods

Data source and participants

Participants of this study were a part of the UKBB dataset, a large prospective cohort study consisting of more than 500,000 participants recruited in the United Kingdom⁷. Full details of the UKBB study can be found on the UKBB website (https://biobank.ndph.ox.ac.uk/showcase/). UKBB received ethical approval from the National Information Governance Board for Health and Social Care and the National Health Service Northwest Multi-Center Research Ethics Committee⁷. All participants gave informed consent through electronic signatures before enrolment in the study. This research has been conducted under approved UKB Project ID 102162. The follow up of the individuals was until the 31st of October 2022.

The specific data fields used in this analysis were nuclear magnetic resonance (NMR) metabolomics, proteomics, imputed genomic data, date of recruitment, age, sex, date of diagnosis, and Diagnostic Codes-ICD10.

All methods were performed in accordance with the relevant guidelines and regulations or declaration of Helsinki.

Data processing

NMR spectroscopy measurements took place between June 2019 and April 2020 (Phase 1), April 2020 and June 2022 (Phase 2), using eight spectrometers at Nightingale Health based in Finland. The metabolic biomarkers are involved in multiple metabolic pathways, including lipoprotein lipids in 14 subclasses, fatty acids, and fatty acid composition, as well as various low-molecular-weight metabolites, such as amino acids, ketone bodies, and glycolysis metabolites quantified in molar concentration units. The dataset comprised 249 NMR metabolite measurements along with their associated quality control (QC) matrices. Of these measurements, 168 were absolute, and 81 were ratios. Data preprocessing, technical variation removal, and computation of an additional 76 biomarkers from the post-QC dataset were conducted using the ukbnmr package (version 2.0)²⁶. Consequently, 325 metabolite measurements were utilised in subsequent analyses. For individuals with both repeat assessments (2012–2013) and baseline assessments (2006–2010), only the baseline data were retained. Repeated measures were based on the Eid and visit_index columns from the UKBB dataset. Technical variations were removed using the updated Algorithm version 2 in the ukbnmr V2 package, where well positions within each batch were separately considered and adjusted. Further details on this approach can be found at the ukbnmr GitHub repository.

Proteomic profiling of blood plasma samples was collected during participant visits between 2006 and 2010 (UKBB dataset field: 53) using the Olink Explore 1536 platform, measuring 1472 protein analytes and capturing 1463 unique proteins. The criteria for participant inclusion in the UK Biobank Pharma Proteomics Project (UKB-PPP) and the specifics of the proteomics assays and normalisation processes are detailed in an earlier study⁸.

Genetic data was downloaded from UKBB. The genotyping and imputation (and quality control) were performed by the UKBB²⁷. Genome-wide data available from the UK Biobank v3 imputed data in BGEN v1.2 format.

The NMR measurement, proteomic profiling, and genomic data were processed using code 3, 143, and code 87, which enabled the decoding of the data.

Patient stratification

The 10th revision of International Classification of Diseases (ICD10) codes was used to assess the diagnosis of the patients. We identified patient groups based on ICD10 codes in the hospital inpatient data (UKBB datasets field: 41270), which is curated from UKBB as provided. In the analysis, we included Crohn’s disease (CD) (K50), ulcerative colitis (UC) (K51), psoriasis (L40), systemic lupus erythematosus (SLE) (M32), chronic obstructive pulmonary disease (COPD) (J449), obesity (E66), type 2 diabetes (T2D) (E11), atherosclerotic vascular disease (ASVD) (I70), and rheumatoid arthritis (RA) (M05, M06). In case of genomics, we only selected patients with European origin. We distinguished incident and prevalent cases based on the earliest reported data across the respective date of first inpatient diagnosis (fields 41262 and 41280) columns from the UKBB. Thus, individuals receiving a diagnosis after the time of sampling were labelled incident cases, and patients diagnosed before or at the time of sampling were classified as prevalent cases. For each group of patients, we identified a group of healthy controls, which were defined as all patients without any disease code. We used the MatchIt package²⁸ to match the healthy control cases with the prevalent and incident cases based on age and sex. The match was performed by the nearest neighbour method using Euclidean distance as a measure of similarity. We used exact matching for sex.

Weighted PRS analyses

The known PRS from the polygenic score catalog for each disease was used to determine the genetic contribution to the probability of developing the disease¹⁴. The known PRS and genetic variants for each disease can be found in Supplementary Table 1. IMIDs were RA (PGS000194), SLE (PGS000328), UC (PGS001306), CD (PGS001331), and PSO (PGS002293). Chronic diseases included COPD (PGS001332), T2D (PGS000864), obesity (PGS000848), and ASVD (PGS000863). Imputed genetic data (930995623 genetic variants) from UKBB were used to calculate the corresponding PRSs (1595 genetic variants in PRSs) for everyone in each disease separately. Genetic variants for each PRS were pruned for linkage disequilibrium (r2 = 0.5, 250-kb window in PLINK). The weighted PRSs were calculated for all participants included in the analysis of each disease using the standard formula in PLINK v1.9 software²⁹. All analyses of the PRSs included participants of European origin only and were adjusted for the first 3 genetic principal components supplied by the UKBB quality control files²¹. The principal components were generated by UKBB²⁷. The list of SNPs for each disease is available in supplementary Table 3.

Modelling the probability of health status

To construct the classifiers, we created a pipeline consisting of data cleaning, data imputation, feature selection, training of the classifier and summary of results. The pipeline was created with the following key ideas that we wanted to achieve: (1) Interpretability—we aimed to construct a pipeline that would be understandable by interdisciplinary researchers as well as clinicians. (2) Scalability—we aimed to use methods that would be applicable on diseases with relatively low number of patients (such as ASVD) as well diseases with relatively many patients (such as T2D), (3) Robustness—we aimed to develop a pipeline that would be robust in terms of replicability of the results. In detail, we first removed all molecules with more than 10% NA values and divided the data into training (70%) and testing (30%) groups. We then trained a KNNImputer³⁰ method to impute NA values. The choice of the KNN method was based on the computational efficiency, simplicity and a recent study³¹, which suggested that KNN has a comparable performance to other more complex methods for the continuous data, in our case proteomics and metabolomics, with a low amount of missing values. Next, we applied the extremely randomised trees (ERT)³² method for feature selection. We used 10,000 trees, and the main motivation for using ERT was the reduced impact of multicollinearity, which is handled by using multiple trees and utilising random splits. The feature selection ranked molecules from the most important to least important. For the downstream analysis, we chose only the N most important features, where N (1–30) was selected by the user. To train the prediction model, we used a logistic regression with a ridge (L2) penalty³³, with the N most important proteins as features and binary disease status as the response variable. Even though it may slightly reduce the performance of the model, the use of logistic regression with the L2 penalty over more complex nonlinear methods is motivated by the interpretability of the method, which we consider essential in clinical research. Furthermore, logistic regression is a tool with which many clinicians are familiar and provides us with the possibility to assess whether the feature has a positive or negative association with the disease. The disease status was dependent on the user input. In the case of ‘incident’, the aim of the model was to discriminate the incident cases from healthy controls matched by age and sex. Analogously, when setting ‘prevalent’, the aim was to discriminate prevalent cases from their respective healthy controls. We used cross-validation as implemented in the LogisticRegressionCV³⁴ method from the sklearn package to optimise the L2 penalty factor. All the above was performed per training dataset. To measure the performance of the whole pipeline, we applied the trained imputation model, downsampled the feature set based on the trained ERT model and applied the trained logistic model to the test datasets and presented the receiving operating characteristic (ROC) curve together with its corresponding area under the ROC curve (AUC). The same pipeline was used for proteomics and metabolomics. For PRSs, we used the same pipeline in which the feature selection step was omitted. Instead, we used sex, age, scaled PRS and three genetic principal components (these are generated from the QC standard pipeline of the genotyping data³⁵ and are supplied by the UKBB for all individuals)²¹ as features and patient group as response variable. We chose these features because they are commonly used in the scientific literature. We used the ggpubr³⁶ package to perform a two-sided t test with a false discovery rate adjustment to determine whether the probability of developing disease was significantly different between healthy controls and incident or prevalent cases. More specifically, we tested whether the mean of the test predictions from the logistic regression for incident or prevalent cases differed significantly from the mean of the test predictions from the logistic regression for healthy controls. The analysis was performed using R (v4.1.1) and Python (v3.7.9), and we used default parameters and random seed 42 for all analyses unless otherwise stated.

Gene ontology enrichment

For each disease, we picked the 30 most disease-related proteins for prevalent cases as predicted by the methods above. We used gseapy³⁷ package to compute the enrichment of these proteins in ‘GO_Biological_Process_2023’ database using the ‘human’ organism and reported the adjusted p values. Otherwise, default parameters were used to compute the enrichment.

Construction of the multiomics atlas

To simplify the interpretation of the results, we created a shiny app³⁸. The app can be found at macd.shinyapps.io/ShinyApp/. All the results shown in the atlas are based on the predictive models explained above. For each disease, the atlas provides context-dependent options for adjustments, namely, the type of (1) predictive model, (2) omics layer, and (3) number of molecules. The atlas predicts a set of biomarkers based on those settings. There are three omics layers, namely, genomics, proteomics, and metabolomics, from which a user can choose. To assess the discriminative performance, a receiver operating characteristic (ROC) curve for the test dataset is presented. If the predictive model is set to “incident”, the atlas generates AUCs for incident cases versus healthy controls for different types and numbers of molecules. Analogously, the setting “prevalent” generates a test set for prevalent cases. We provide a choice of one to thirty biomarkers.

Ethics approval and consent to participate

UK Biobank has approval from the Northwest Multi-centre Research Ethics Committee (MREC) as a Research Tissue Bank (RTB) approval. This approval means that researchers do not require separate ethical clearance and can operate under the RTB approval (there are certain exceptions to this which are set out in the Access Procedures, such as re-contact applications).

Data availability

An interactive, web-based atlas for translational researchers to find optimal biomarkers is available at macd.shinyapps.io/ShinyApp/. All data used in this study are available to access from the UK Biobank at https://www.ukbiobank.ac.uk/ for approved researchers through the UK Biobank data-access protocol.

Code availability

All codes used in this manuscript are available at https://github.com/SDTC-CPMed/MACD.

References

Glaab, E. et al. Biomarker discovery studies for patient stratification using machine learning analysis of omics data: A scoping review. BMJ Open. 11(12), e053674. https://doi.org/10.1136/bmjopen-2021-053674 (2021).
Article PubMed PubMed Central Google Scholar
Savva, K. V. et al. The Biomarker Toolkit—An evidence-based guideline to predict cancer biomarker success and guide development. BMC Med. 21(1), 383. https://doi.org/10.1186/s12916-023-03075-3 (2023).
Article PubMed PubMed Central Google Scholar
Williams, S. A. et al. Plasma protein patterns as comprehensive indicators of health. Nat. Med. 25(12), 1851–1857. https://doi.org/10.1038/s41591-019-0665-2 (2019).
Article CAS PubMed PubMed Central Google Scholar
Slieker, R. C. et al. Identification of biomarkers for glycaemic deterioration in type 2 diabetes. Nat. Commun. 14(1), 2533. https://doi.org/10.1038/s41467-023-38148-7 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Chen, Y. et al. Genomic atlas of the plasma metabolome prioritizes metabolites implicated in human diseases. Nat. Genet. 55(1), 44–53. https://doi.org/10.1038/s41588-022-01270-1 (2023).
Article CAS PubMed PubMed Central Google Scholar
Emilsson, V. et al. Co-regulatory networks of human serum proteins link genetics to disease. Science. 361(6404), 769–773. https://doi.org/10.1126/science.aaq1327 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12(3), e1001779. https://doi.org/10.1371/journal.pmed.1001779 (2015).
Article PubMed PubMed Central Google Scholar
Sun, B. B. et al. Plasma proteomic associations with genetics and health in the UK Biobank. Nature. https://doi.org/10.1038/s41586-023-06592-6 (2023).
Article PubMed PubMed Central Google Scholar
Sethi, A., Raj, A., Wright, K., Melamud, E. Plasma proteomic determinants of common causes of mortality. (2023).
Papier, K., Atkins, J.R., Tong, T.Y., et al. Identifying proteomic risk factors for cancer using prospective and exome analyses: 1463 circulating proteins and risk of 19 cancers in the UK Biobank. medRxiv. 2023:2023.07. 28.23293330.
Gadd, D.A., Hillary, R.F., Kuncheva, Z., et al. Blood protein levels predict leading incident diseases and mortality in UK Biobank. medRxiv. 2023:2023.05. 01.23288879.
Gustafsson, M. et al. Modules, networks and systems medicine for understanding disease and aiding diagnosis. Genome Med. 6(10), 82. https://doi.org/10.1186/s13073-014-0082-6 (2014).
Article PubMed PubMed Central Google Scholar
Julkunen, H. et al. Atlas of plasma NMR biomarkers for health and disease in 118,461 individuals from the UK Biobank. Nat. Commun. 14(1), 604. https://doi.org/10.1038/s41467-023-36231-7 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Lambert, S. A. et al. The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nat. Genet. 53(4), 420–425. https://doi.org/10.1038/s41588-021-00783-5 (2021).
Article CAS PubMed Google Scholar
de Hond, A. A. H., Steyerberg, E. W. & van Calster, B. Interpreting area under the receiver operating characteristic curve. Lancet Digit Health. 4(12), e853–e855. https://doi.org/10.1016/S2589-7500(22)00188-1 (2022).
Article PubMed Google Scholar
Gawel, D. R. et al. A validated single-cell-based strategy to identify diagnostic and therapeutic targets in complex diseases. Genome Med. 11(1), 47. https://doi.org/10.1186/s13073-019-0657-3 (2019).
Article PubMed PubMed Central Google Scholar
Lilja, S. et al. Multi-organ single-cell analysis reveals an on/off switch system with potential for personalized treatment of immunological diseases. Cell Rep. Med. 4(3), 100956. https://doi.org/10.1016/j.xcrm.2023.100956 (2023).
Article CAS PubMed PubMed Central Google Scholar
Benson, M. Digital twins for predictive, preventive personalized, and participatory treatment of immune-mediated diseases. Arterioscler. Thromb. Vasc. Biol. 43(3), 410–416. https://doi.org/10.1161/ATVBAHA.122.318331 (2023).
Article CAS PubMed Google Scholar
Schafer, S. et al. scDrugPrio: A framework for the analysis of single-cell transcriptomics to address multiple problems in precision medicine in immune-mediated inflammatory diseases. Genome Med. 16(1), 42. https://doi.org/10.1186/s13073-024-01314-7 (2024).
Article PubMed PubMed Central Google Scholar
Barata, C. et al. A reinforcement learning model for AI-based decision support in skin cancer. Nat. Med. 29(8), 1941–1946. https://doi.org/10.1038/s41591-023-02475-5 (2023).
Article CAS PubMed PubMed Central Google Scholar
Mansour Aly, D. et al. Genome-wide association analyses highlight etiological differences underlying newly defined subtypes of diabetes. Nat. Genet. 53(11), 1534–1542. https://doi.org/10.1038/s41588-021-00948-2 (2021).
Article CAS PubMed Google Scholar
Ferkingstad, E. et al. Large-scale integration of the plasma proteome with genetics and disease. Nat. Genet. 53(12), 1712–1721. https://doi.org/10.1038/s41588-021-00978-w (2021).
Article CAS PubMed Google Scholar
Oke, V. et al. High levels of circulating interferons type I, type II and type III associate with distinct clinical features of active systemic lupus erythematosus. Arthritis Res. Ther. 21(1), 107. https://doi.org/10.1186/s13075-019-1878-y (2019).
Article PubMed PubMed Central Google Scholar
Tang, W. H. & Hazen, S. L. Atherosclerosis in 2016: Advances in new therapeutic targets for atherosclerosis. Nat. Rev. Cardiol. 14(2), 71–72. https://doi.org/10.1038/nrcardio.2016.216 (2017).
Article PubMed Google Scholar
Schoeler, T. et al. Participation bias in the UK Biobank distorts genetic associations and downstream analyses. Nat. Hum. Behav. 7(7), 1216–1227. https://doi.org/10.1038/s41562-023-01579-9 (2023).
Article PubMed PubMed Central Google Scholar
Ritchie, S. C. et al. Quality control and removal of technical variation of NMR metabolic biomarker data in ~ 120,000 UK Biobank participants. Sci. Data. 10(1), 64. https://doi.org/10.1038/s41597-023-01949-y (2023).
Article CAS PubMed PubMed Central Google Scholar
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 562(7726), 203–209. https://doi.org/10.1038/s41586-018-0579-z (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
Stuart, E. A., King, G., Imai, K. & Ho, D. Nonparametric preprocessing for parametric causal inference. J. Stat. Softw. 42(8), 1–28. https://doi.org/10.18637/jss.v042.i08 (2011).
Article Google Scholar
Chang, C. C. et al. Second-generation PLINK: Rising to the challenge of larger and richer datasets. Gigascience. 4, 7. https://doi.org/10.1186/s13742-015-0047-8 (2015).
Article CAS PubMed PubMed Central Google Scholar
Olga Troyanskaya, M. C. et al. Missing value estimation methods for DNA microarrays. Bioinformatics. 17(6), 520–525 (2001).
Article Google Scholar
Ge, Y., Li, Z. & Zhang, J. A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods. Sci. Rep. 13(1), 9432. https://doi.org/10.1038/s41598-023-36509-2 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Geurts, P., Ernst, D. & Wehenkel, L. Extremely randomized trees. Mach. Learn. 63(1), 3–42. https://doi.org/10.1007/s10994-006-6226-1 (2006).
Article Google Scholar
Hoerl, A. E. & Kennard, R. W. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 12(1), 55–67. https://doi.org/10.1080/00401706.1970.10488634 (1970).
Article Google Scholar
Pedregosa FaV, G. et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet Google Scholar
Anderson, C. A. et al. Data quality control in genetic case–control association studies. Nat. Protoc. 5(9), 1564–1573. https://doi.org/10.1038/nprot.2010.116 (2010).
Article CAS PubMed PubMed Central Google Scholar
A K. ggpubr: 'ggplot2' Based Publication Ready Plots_. R package version 0.5.0. https://CRAN.R-project.org/package=ggpubr. 2022;
Fang, Z., Liu, X. & Peltz, G. GSEApy: A comprehensive package for performing gene set enrichment analysis in Python. Bioinformatics. https://doi.org/10.1093/bioinformatics/btac757 (2023).
Article PubMed PubMed Central Google Scholar
Chang, W.C.J., Allaire, J., Sievert, C., Schloerke, B., Xie, Y., Allen, J., McPherson, J., Dipert, A., Borges, B. shiny: Web Application Framework for R. https://shiny.posit.co/, https://github.com/rstudio/shiny. 2023.

Download references

Acknowledgements

The authors are thankful to the participants and the team of the UK Biobank study.

Funding

Open access funding provided by Karolinska Institute. This work was supported by funding from the European Union’s H2020 research and innovation program under Grant agreement No 848028 (DoCTIS; Decision On Optimal Combinatorial Therapies In Imids Using Systems Approaches); Swedish Cancer Society CAN 2017/411; The Swedish Research Council: US National Institutes of Health Grants HL155107 and HL155096; American Heart Association Grant 957729; European Union’s Horizon 2021 Research and Innovation Programme Grant 101057619.

Author information

These authors contributed equally: Martin Smelik, Yelin Zhao, Firoj Mahmud, Dina Mansour Aly and Mikael Benson.

Authors and Affiliations

Medical Digital Twin Research Group, Department of Clinical Science, Intervention and Technology (CLINTEC), Karolinska Institute, Stockholm, Sweden
Martin Smelik, Yelin Zhao, Xinxiu Li, Firoj Mahmud, Dina Mansour Aly & Mikael Benson
Division of Cardiovascular Medicine, Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA
Joseph Loscalzo
Division of Statistics and Machine Learning, Department of Computer and Information Science, Linköping University, Linköping, Sweden
Oleg Sysoev

Authors

Martin Smelik
View author publications
You can also search for this author in PubMed Google Scholar
Yelin Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Xinxiu Li
View author publications
You can also search for this author in PubMed Google Scholar
Joseph Loscalzo
View author publications
You can also search for this author in PubMed Google Scholar
Oleg Sysoev
View author publications
You can also search for this author in PubMed Google Scholar
Firoj Mahmud
View author publications
You can also search for this author in PubMed Google Scholar
Dina Mansour Aly
View author publications
You can also search for this author in PubMed Google Scholar
Mikael Benson
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

MS and YZ had a primary role in analysis and interpretation of the data. XL and OS contributed to those analyses, and JL with translational expertise. DA, FM and MB supervised these studies. All authors contributed to the writing of the manuscript. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Mikael Benson.

Ethics declarations

Competing interests

MB is the scientific founder of Mavatar, Inc. JL is co-scientific founder of Scipher Medicine, Inc. The other authors declare that they have no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Table 1.

Supplementary Table 2.

Supplementary Table 3.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Smelik, M., Zhao, Y., Li, X. et al. An interactive atlas of genomic, proteomic, and metabolomic biomarkers promotes the potential of proteins to predict complex diseases. Sci Rep 14, 12710 (2024). https://doi.org/10.1038/s41598-024-63399-9

Download citation

Received: 02 February 2024
Accepted: 28 May 2024
Published: 03 June 2024
DOI: https://doi.org/10.1038/s41598-024-63399-9
Springer Nature Limited

An interactive atlas of genomic, proteomic, and metabolomic biomarkers promotes the potential of proteins to predict complex diseases

Abstract

Similar content being viewed by others

Computational Methods for the Identification of Genetic Variants in Complex Diseases

An atlas of genetic scores to predict multi-omic traits

Human disease biomarker panels through systems biology

Introduction

Results

Patient cohort

Interactive web-based atlas enables the search for biomarkers

Proteomics biomarkers outperforms biomarkers from other omics

Gene ontology analysis of potential protein biomarkers showed significant enrichment of a wide variety of pathways

Discussion

Methods

Data source and participants

Data processing

Patient stratification

Weighted PRS analyses

Modelling the probability of health status

Gene ontology enrichment

Construction of the multiomics atlas

Ethics approval and consent to participate

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Table 1.

Supplementary Table 2.

Supplementary Table 3.

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation