Abstract
One of the key challenges in metagenomic forensics is to establish a microbial fingerprint that can help in the prediction of the geographical origin of the metagenomic samples. Understanding a combination of different aspects such as microbiome sample sources, sequencing technologies, bioinformatics processing, statistical and machine learning methods, play a vital role in a comprehensive analysis of metagenomic data. We demonstrate the analysis of metagenomic samples from 23 different cities around the world to construct classification models that can be utilized to predict the geographical location for unknown samples obtained from these regions. We also describe the bioinformatics pre-processing of the raw sequencing data and estimate the abundance profiles of microbes in the samples using multiple tools. For the prediction of the geographical origin of samples, we trained and evaluated a variety of supervised learning classifiers including an adaptive optimal ensemble classifier that performs well on different data generating procedures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Andrews, S.: FastQC. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed: 2020-08-28
Bharti, R., Grimm, D.G.: Current challenges and best-practice protocols for microbiome analysis. Brief. Bioinform. 22(1), 178–193 (2019). https://doi.org/10.1093/bib/bbz155
Bolger, A.M., Lohse, M., Usadel, B.: Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics 30(15), 2114–2120 (2014)
Branco, P., Ribeiro, R.P., Torgo, L.: UBL: an R package for utility-based learning. Preprint (2016). arXiv:1604.08079
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and regression trees. CRC Press, Boca Raton, FL (1984)
Breitwieser, F.P., Lu, J., Salzberg, S.L.: A review of methods and databases for metagenomic classification and assembly. Brief. Bioinform. 20(4), 1125–1136 (2019)
Buchfink, B., Xie, C., Huson, D.H.: Fast and sensitive protein alignment using DIAMOND. Nature Methods 12(1), 59–60 (2015)
Casimiro-Soriguer, C.S., Loucera, C., Perez Florido, J., López-López, D., Dopazo, J.: Antibiotic resistance and metabolic profiles as functional biomarkers that accurately predict the geographic origin of city metagenomics sample. Biology Direct 14, 15 (2019)
Chase, J., Fouquier, J., Zare, M., Sonderegger, D.L., Knight, R., Kelley, S.T., Siegel, J., Caporaso, J.G.: Geography and location are the primary drivers of office microbiome composition. mSystems 1(2), e00022-16 (2016)
Chen, C., Liaw, A., Breiman, L.: Using random forest to learn imbalanced data. Statistics Department of University of California at Berkeley, Berkeley. Technical Report 666 (2004)
Chen, T., Guestrin, C.: XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2016), KDD ’16, pp. 785–794. Association for Computing Machinery (2016)
Claesson, M.J., Clooney, A.G., O’Toole, P.W.: A clinician’s guide to microbiome analysis. Nat. Rev. Gastroenterol. Hepatol. 14(10), 585–595 (2017)
Clarke, T.H., Gomez, A., Singh, H., Nelson, K.E., Brinkac, L.M.: Integrating the microbiome as a resource in the forensics toolkit. Forensic Sci. Int. Genet. 30, 141–147 (2017)
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960)
Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995)
Datta, S., Pihur, V., Datta, S.: An adaptive optimal ensemble classifier via bagging and rank aggregation with applications to high dimensional data. BMC Bioinformatics 11, 427 (2010)
Ditzler, G., Morrison, J.C., Lan, Y., Rosen, G.L.: Fizzy: feature subset selection for metagenomics. BMC Bioinformatics 16(1), 358 (2015)
Duvallet, C., Gibbons, S.M., Gurry, T., Irizarry, R.A., Alm, E.J.: Meta-analysis of gut microbiome studies identifies disease-specific and shared responses. Nature Communications 8(1), 1–10 (2017)
Ewels, P., Magnusson, M., Lundin, S., Käller, M.: MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32(19), 3047–3048 (2016)
Ewing, B., Green, P.: Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res. 8(3), 186–194 (1998)
Freitas, T.A.K., Li, P.-E., Scholz, M.B., Chain, P.S.G.: Accurate read-based metagenome characterization using a hierarchical suite of unique signatures. Nucleic Acids Res. 43(10), e69 (2015)
Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat., 1189–1232 (2001)
Gilbert, J.A., Jansson, J.K., Knight, R.: The earth microbiome project: successes and aspirations. BMC Biology 12(1), 69 (2014)
Giraud, E., Xu, L., Chaintreuil, C., Gargani, D., Gully, D., Sadowsky, M.J.: Photosynthetic Bradyrhizobium sp. strain ORS285 is capable of forming nitrogen-fixing root nodules on soybeans (glycine max). Appl. Environ. Microbiol. 79(7), 2459–2462 (2013)
Grice, E.A., Segre, J.A.: The human microbiome: Our second genome. Annu. Rev. Genomics Hum. Genet. 13(1), 151–170 (2012). PMID: 22703178
Hand, D.J.: Breast cancer diagnosis from proteomic mass spectrometry data: A comparative evaluation. Stat. Appl. Genet. Mol. Biol. 7(2), Article 15 (2008)
Hand, D.J., Till, R.J.: A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning 45(2), 171–186 (2001)
Harris, Z.N., Dhungel, Eliza Mosior, M., Ahn, T.-H.: Massive metagenomic data analysis using abundance-based machine learning. Biology Direct 14(12), Article 12 (2019)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer New York, New York, NY, USA (2001)
Huttenhower, C., et al.: Structure, function and diversity of the healthy human microbiome. Nature 486(7402), 207–214 (2012)
Joshi, M.V., Kumar, V., Agarwal, R.C.: Evaluating boosting algorithms to classify rare classes: Comparison and improvements. In: Proceedings 2001 IEEE International Conference on Data Mining, San Jose, CA, USA pp. 257–264. IEEE (2001)
Knights, D., Costello, E.K., Knight, R.: Supervised classification of human microbiota. FEMS Microbiol. Rev. 35(2), 343–359 (2011)
Kovács, G.: An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl. Soft Comput. 83, 105662 (2019a)
Kovács, G.: Smote-variants: a python implementation of 85 minority oversampling techniques. Neurocomputing 366, 352–354 (2019b)
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nature Methods 9(4), 357–359 (2012)
Lee, S.S.: Regularization in skewed binary classification. Computational Statistics 14, 277–292 (1999)
Li, H.: Microbiome, metagenomics, and high-dimensional compositional data analysis. Annu. Rev. Stat. Appl. 2, 73–94 (2015)
Lin, W., Shi, P., Feng, R., Li, H.: Variable selection in regression with compositional covariates. Biometrika 101(4), 785–797 (2014)
Lin, X., Zhang, Z., Zhang, L., Li, X.: Complete genome sequence of a denitrifying bacterium, Pseudomonas sp. CC6-YY-74, isolated from Arctic Ocean sediment. Marine Genomics 35, 47–49 (2017)
Lindgreen, S., Adair, K.L., Gardner, P.P.: An evaluation of the accuracy and speed of metagenome analysis tools. Scientific Reports 6, 19233 (2016)
Lu, J., Breitwieser, F.P., Thielen, P., Salzberg, S.L.: Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci. 3, e104 (2017)
Marchesi, J.R., Ravel, J.: The vocabulary of microbiome research: a proposal. Microbiome 3(1), 31 (2015)
Martin, M.: Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17(1), 10–12 (2011)
Mason, C., Hirschberg, D., Consortium, T.M.I.: The metagenomics and metadesign of the subways and urban biomes (MetaSub) International Consortium inaugural meeting report. Microbiome 4(1), 24 (2016)
McIntyre, A.B., Ounit, R., Afshinnekoo, E., Prill, R.J., Hénaff, E., Alexander, N., Minot, S.S., Danko, D., Foox, J., Ahsanuddin, S., et al.: Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Genome Biology 18(1), 182 (2017)
McIver, L.J., Abu-Ali, G., Franzosa, E.A., Schwager, R., Morgan, X.C., Waldron, L., Segata, N., Huttenhower, C.: bioBakery: a metaá’omic analysis environment. Bioinformatics 34(7), 1235–1237 (2017)
Menzel, P., Ng, K.L., Krogh, A.: Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications 7, 11257–11257 (2016)
Moitinho-Silva, L., Steinert, G., Nielsen, S., Hardoim, C.C., Wu, Y.-C., McCormack, G.P., López-Legentil, S., Marchant, R., Webster, N., Thomas, T., et al.: Predicting the HMA-LMA status in marine sponges by machine learning. Front. Microbiol. 8, 752 (2017)
Oudah, M., Henschel, A.: Taxonomy-aware feature engineering for microbiome classification. BMC Bioinformatics 19, 227 (2018)
Ounit, R., Wanamaker, S., Close, T.J., Lonardi, S.: Clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16(1), 236 (2015)
Pasolli, E., Truong, D.T., Malik, F., Waldron, L., Segata, N.: Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLoS Comput. Biol. 12(7), e1004977 (2016)
Paulson, J.N., Pop, M., Bravo, H. C.: metagenomeSeq: Statistical analysis for sparse high-throughput sequencing Bioconductor package 1(0), 191 (2013)
Pihur, V., Datta, S., Datta, S.: Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach. Bioinformatics 23(13), 1607–1615 (2007)
Ranjan, R., Rani, A., Metwally, A., McGee, H.S., Perkins, D.L.: Analysis of the microbiome: Advantages of whole genome shotgun versus 16s amplicon sequencing. Biochem. Biophys. Res. Commun. 469(4), 967–977 (2016)
Schmieder, R., Edwards, R.: Insights into antibiotic resistance through metagenomic approaches. Future Microbiology 7, 73–89 (2012)
Segata, N., Waldron, L., Ballarini, A., Narasimhan, V., Jousson, O., Huttenhower, C.: Metagenomic microbial community profiling using unique clade-specific marker genes. Nature Methods 9, 811–814 (2012)
Shi, P., Zhang, A., Li, H., et al.: Regression analysis for microbiome compositional data. Ann. Appl. Stat. 10(2), 1019–1040 (2016)
Singh, R.K., Chang, H.-W., Yan, D., Lee, K.M., Ucmak, D., Wong, K., Abrouk, M., Farahnik, B., Nakamura, M., Zhu, T.H., Bhutani, T., Liao, W.: Influence of diet on the gut microbiome and implications for human health. J. Transl. Med. 15(1), 73 (2017)
Sun, Y., Kamel, M.S., Wang, Y.: Boosting for learning multiple classes with imbalanced class distribution. In: Sixth International Conference on Data Mining (ICDM’06), Hong Kong, China. pp. 592–602. IEEE (2006)
Sunagawa, S., Mende, D.R., Zeller, G., Izquierdo-Carrasco, F., Berger, S.A., Kultima, J.R., Coelho, L.P., Arumugam, M., Tap, J., Nielsen, H.B., Rasmussen, S., Brunak, S., Pedersen, O., Guarner, F., de Vos, W.M., Wang, J., Li, J., Doré, J., Ehrlich, S.D., Stamatakis, A., Bork, P.: Metagenomic species profiling using universal phylogenetic marker genes. Nature Methods 10(12), 1196–1199 (2013)
Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. Roy. Stat. Soc. B (Methodol.) 58(1), 267–288 (1996)
Turnbaugh, P.J., Ley, R.E., Hamady, M., Fraser-Liggett, C.M., Knight, R., Gordon, J.I.: The human microbiome project. Nature 449(7164), 804–810 (2007)
Wade, W.: The oral microbiome in health and disease. Pharmacological Research 69(1), 137–143 (2013). Copyright 2012 Elsevier Ltd. All rights reserved
Wang, W.-L., Xu, S.-Y., Ren, Z.-G., Tao, L., Jiang, J.-W., Zheng, S.-S.: Application of metagenomics in the human gut microbiome. World J. Gastroenterol. 21(3), 803–814 (2015)
Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology 15, Article R46 (2014)
Wood, D.E., Lu, J., Langmead, B.: Improved metagenomic analysis with Kraken 2. Genome Biology 20(1), 257 (2019)
Ye, S.H., Siddle, K.J., Park, D., Sabeti, P.C.: Benchmarking metagenomics tools for taxonomic classification. Cell 178, 779–794 (2019)
Zhou, Q., Su, X., Ning, K.: Assessment of quality control approaches for metagenomic data analysis. Scientific Reports 4(1), 6957 (2014)
Zhou, Y.-H., Gallins, P.: A review and tutorial of machine learning methods for microbiome host trait prediction. Front. Genet. 10, 579 (2019)
Zou, H., Hastie, T.: Regularization and variable selection via the ElasticNet. J. Roy. Stat. Soc. B (Stat. Methodol.) 67(2), 301–320 (2005)
Acknowledgements
This work was partially supported by the National Science Foundation under Award DMS-1461948 to SG.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Anyaso-Samuel, S., Sachdeva, A., Guha, S., Datta, S. (2021). Bioinformatics Pre-Processing of Microbiome Data with An Application to Metagenomic Forensics. In: Datta, S., Guha, S. (eds) Statistical Analysis of Microbiome Data. Frontiers in Probability and the Statistical Sciences. Springer, Cham. https://doi.org/10.1007/978-3-030-73351-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-73351-3_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-73350-6
Online ISBN: 978-3-030-73351-3
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)