Human Multi-omics Data Pre-processing for Predictive Purposes Using Machine Learning: A Case Study in Childhood Obesity

Torres-Martos, Álvaro; Anguita-Ruiz, Augusto; Bustos-Aibar, Mireia; Cámara-Sánchez, Sofia; Alcalá, Rafael; Aguilera, Concepción M.; Alcalá-Fdez, Jesús

doi:10.1007/978-3-031-07802-6_31

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 13347))

Included in the following conference series:

International Work-Conference on Bioinformatics and Biomedical Engineering

807 Accesses
1 Citations
1 Altmetric

Abstract

The Machine Learning applications in the medical field using omics data are countless and promising, highlighting the possibility of creating long-term predictive models for highly prevalent diseases. Nevertheless, to take advantage of the virtues of omics data and machine learning tools, we first need to perform adequate data pre-processing just as taking some considerations for the constructions of the models. The present paper is an example of how to face the main challenges encountered when constructing machine learning predictive models with multi-omics human data. Some topics covered in this work include a description of the main particularities of each omics data layer, the most appropriate pre-processing approaches for each source, and a collection of good practices and tips for applying machine learning to this kind of data with predictive purposes. Using real data examples (blood samples), we illustrate how some of the key issues are addressed in this kind of research (technical noise, biological heterogeneity, class imbalance, high dimensionality, and presence of missing values, among others). Additionally, we set the basis for future work incorporating some proposals to improve models, arguing their need according to encountered insights.

Supported organization in part by ERDF/Regional Government of Andalusia/Ministry of Economic Transformation, Industry, Knowledge and Universities (grant numbers P18-RT-2248 and B-CTS-536-UGR20) and by the ERDF/Health Institute Carlos III/Spanish Ministry of Science, Innovation and Universities (grant number PI20/00711, PI16/00871 and PI20/00563).

Á. Torres-Martos and A. Anguita-Ruiz—Equal contributors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anguita-Ruiz, A.: Multi-omics integration and machine learning for the identification of molecular markers of insulin resistance in prepubertal and pubertal children with obesity (2021)
Google Scholar
Barredo Arrieta, A., et al.: Explainable Artificial Intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 58, 82–115 (2020). https://doi.org/10.1016/J.INFFUS.2019.12.012
Article Google Scholar
Browning, B.L., Tian, X., Zhou, Y., Browning, S.R.: Fast two-stage phasing of large-scale sequence data. Am. J. Hum. Genetics 108(10), 1880–1890 (2021). https://doi.org/10.1016/J.AJHG.2021.08.005
Article CAS Google Scholar
Deelen, P., et al.: Genotype harmonizer: automatic strand alignment and format conversion for genotype data integration. BMC. Res. Notes 7(1), 1–4 (2014). https://doi.org/10.1186/1756-0500-7-901
Article CAS Google Scholar
Fernández-Delgado, M., et al.: Do we need hundreds of classifiers to solve real world classification problems? J. Mach. Learn. Res. 15, 3133–3181 (2014). https://jmlr.org/papers/v15/delgado14a.html
Fortin, J.P., et al.: Functional normalization of 450k methylation array data improves replication in large cancer studies. Genome Biol. 15(12) (2014). https://doi.org/10.1186/S13059-014-0503-2
Goecks, J., et al.: How machine learning will transform biomedicine. Cell 181(1), 92–101 (2020). https://doi.org/10.1016/J.CELL.2020.03.022
Article CAS PubMed PubMed Central Google Scholar
Goodarzi, M.O.: Genetics of obesity: what genetic association studies have taught us about the biology of obesity and its complications. Lancet Diabetes Endocrinol. 6(3), 223–236 (2018). https://doi.org/10.1016/S2213-8587(17)30200-0
Article CAS PubMed Google Scholar
Hvitfeldt, E.: themis: Extra Recipes Steps for Dealing with Unbalanced Data (2020) https://CRAN.R-project.org/package=themis, r package version 0.1.0
James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning - with Applications in R (2013). https://doi.org/10.1007/978-1-4614-7138-7
Mahajan, A., et al.: Refining the accuracy of validated target identification through coding variant fine-mapping in type 2 diabetes article. Nat. Genet. 50(4), 559–571 (2018). https://doi.org/10.1038/s41588-018-0084-1
Article CAS PubMed PubMed Central Google Scholar
Maksimovic, J., Phipson, B., Oshlack, A.: A cross-package Bioconductor workflow for analysing methylation array data. F1000Research 5 (2016). https://doi.org/10.12688/F1000RESEARCH.8839.3
Purcell, S., et al.: PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81(3), 559 (2007). https://doi.org/10.1086/519795
Article CAS PubMed PubMed Central Google Scholar
Rohart, F., Gautier, B., Singh, A., Le, C.: mixomics: an r package for ’omics feature selection and multiple data integration. PLoS Comput. Biol. 13(11), e1005752 (2017). https://doi.org/10.1371/journal.pcbi.1005752
Article CAS PubMed PubMed Central Google Scholar
Saxena, R., et al.: Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science 316(5829), 1331–1336 (2007). https://doi.org/10.1126/science.1142358
Article CAS PubMed Google Scholar
Scott, L.J., et al.: A genome-wide association study of type 2 diabetes in finns detects multiple susceptibility variants. Science 316(5829), 1341–1345 (2007). https://doi.org/10.1126/science.1142382
Article CAS PubMed PubMed Central Google Scholar
Scott, R.A., et al.: An expanded genome-wide association study of type 2 diabetes in Europeans. Diabetes 66(11), 2888–2902 (2017). https://doi.org/10.2337/db16-1253
Article CAS PubMed PubMed Central Google Scholar
Singh, A., et al.: DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics 35(17), 3055–3062 (2019). https://doi.org/10.1093/BIOINFORMATICS/BTY1054
Article CAS PubMed PubMed Central Google Scholar
Sladek, R., et al.: A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature 445(7130), 881–885 (2007). https://doi.org/10.1038/nature05616
Article CAS PubMed Google Scholar
Stekhoven, D.J., Bühlmann, P.: MissForest-non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1), 112–118 (2012). https://doi.org/10.1093/BIOINFORMATICS/BTR597
Article CAS PubMed Google Scholar
Teschendorff, A.E., et al.: A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data. Bioinformatics 29(2), 189–196 (2013). https://doi.org/10.1093/BIOINFORMATICS/BTS680
Article CAS PubMed Google Scholar
Van Buuren, S.: Multiple imputation of discrete and continuous data by fully conditional specification. Stat. Methods Med. Res. 16(3), 219–242 (2007). https://doi.org/10.1177/0962280206074463
Article PubMed Google Scholar
Zhao, W., et al.: Identification of new susceptibility loci for type 2 diabetes and shared etiological pathways with coronary heart disease. Nat. Genet. 49(10), 1450–1457 (2017). https://doi.org/10.1038/ng.3943
Article CAS PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

Department of Biochemistry and Molecular Biology II, School of Pharmacy, University of Granada, 18071, Granada, Spain
Álvaro Torres-Martos, Augusto Anguita-Ruiz, Mireia Bustos-Aibar, Sofia Cámara-Sánchez & Concepción M. Aguilera
Institute of Nutrition and Food Technology “José Mataix” Center of Biomedical Research, Instituto de Investigación Biosanitaria IBS.GRANADA, Complejo Hospitalario Universitario de Granada, University of Granada, Avda. del Conocimiento s/n., 18016, 18012, Granada, Spain
Augusto Anguita-Ruiz & Concepción M. Aguilera
CIBEROBN (CIBER Physiopathology of Obesity and Nutrition), Instituto de Salud Carlos III, 28029, Madrid, Spain
Augusto Anguita-Ruiz & Concepción M. Aguilera
Barcelona Institute for Global Health (ISGlobal), Doctor Aiguader 88, 08003, Barcelona, Spain
Augusto Anguita-Ruiz
Department of Computer Science and Artificial Intelligence, Andalusian Research Institute in Data Science and Computational Intelligence (DaSCI), University of Granada, 18071, Granada, Spain
Rafael Alcalá & Jesús Alcalá-Fdez

Authors

Álvaro Torres-Martos
View author publications
You can also search for this author in PubMed Google Scholar
Augusto Anguita-Ruiz
View author publications
You can also search for this author in PubMed Google Scholar
Mireia Bustos-Aibar
View author publications
You can also search for this author in PubMed Google Scholar
Sofia Cámara-Sánchez
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Alcalá
View author publications
You can also search for this author in PubMed Google Scholar
Concepción M. Aguilera
View author publications
You can also search for this author in PubMed Google Scholar
Jesús Alcalá-Fdez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Álvaro Torres-Martos , Augusto Anguita-Ruiz , Mireia Bustos-Aibar , Sofia Cámara-Sánchez or Concepción M. Aguilera .

Editor information

Editors and Affiliations

Marcelina Siebold Guest Relations Dept., University of Granada, Granada, Spain
Ignacio Rojas
Faculty of Sciences, University of Granada, Granada, Spain
Olga Valenzuela
ETSIIT. CITIC-UGR, University of Granada, Granada, Spain
Fernando Rojas
ETSIIT, University of Granada, Granada, Spain
Luis Javier Herrera
University of Granada, Granada, Spain
Francisco Ortuño

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Torres-Martos, Á. et al. (2022). Human Multi-omics Data Pre-processing for Predictive Purposes Using Machine Learning: A Case Study in Childhood Obesity. In: Rojas, I., Valenzuela, O., Rojas, F., Herrera, L.J., Ortuño, F. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2022. Lecture Notes in Computer Science(), vol 13347. Springer, Cham. https://doi.org/10.1007/978-3-031-07802-6_31

Download citation

DOI: https://doi.org/10.1007/978-3-031-07802-6_31
Published: 08 June 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-07801-9
Online ISBN: 978-3-031-07802-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Human Multi-omics Data Pre-processing for Predictive Purposes Using Machine Learning: A Case Study in Childhood Obesity