Skip to main content

Bioinformatics Pre-Processing of Microbiome Data with An Application to Metagenomic Forensics

  • Chapter
  • First Online:
Statistical Analysis of Microbiome Data

Part of the book series: Frontiers in Probability and the Statistical Sciences ((FROPROSTAS))

Abstract

One of the key challenges in metagenomic forensics is to establish a microbial fingerprint that can help in the prediction of the geographical origin of the metagenomic samples. Understanding a combination of different aspects such as microbiome sample sources, sequencing technologies, bioinformatics processing, statistical and machine learning methods, play a vital role in a comprehensive analysis of metagenomic data. We demonstrate the analysis of metagenomic samples from 23 different cities around the world to construct classification models that can be utilized to predict the geographical location for unknown samples obtained from these regions. We also describe the bioinformatics pre-processing of the raw sequencing data and estimate the abundance profiles of microbes in the samples using multiple tools. For the prediction of the geographical origin of samples, we trained and evaluated a variety of supervised learning classifiers including an adaptive optimal ensemble classifier that performs well on different data generating procedures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 16.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)

    Article  Google Scholar 

  2. Andrews, S.: FastQC. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. Accessed: 2020-08-28

  3. Bharti, R., Grimm, D.G.: Current challenges and best-practice protocols for microbiome analysis. Brief. Bioinform. 22(1), 178–193 (2019). https://doi.org/10.1093/bib/bbz155

    Article  Google Scholar 

  4. Bolger, A.M., Lohse, M., Usadel, B.: Trimmomatic: a flexible trimmer for illumina sequence data. Bioinformatics 30(15), 2114–2120 (2014)

    Article  Google Scholar 

  5. Branco, P., Ribeiro, R.P., Torgo, L.: UBL: an R package for utility-based learning. Preprint (2016). arXiv:1604.08079

    Google Scholar 

  6. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  7. Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and regression trees. CRC Press, Boca Raton, FL (1984)

    MATH  Google Scholar 

  8. Breitwieser, F.P., Lu, J., Salzberg, S.L.: A review of methods and databases for metagenomic classification and assembly. Brief. Bioinform. 20(4), 1125–1136 (2019)

    Article  Google Scholar 

  9. Buchfink, B., Xie, C., Huson, D.H.: Fast and sensitive protein alignment using DIAMOND. Nature Methods 12(1), 59–60 (2015)

    Article  Google Scholar 

  10. Casimiro-Soriguer, C.S., Loucera, C., Perez Florido, J., López-López, D., Dopazo, J.: Antibiotic resistance and metabolic profiles as functional biomarkers that accurately predict the geographic origin of city metagenomics sample. Biology Direct 14, 15 (2019)

    Article  Google Scholar 

  11. Chase, J., Fouquier, J., Zare, M., Sonderegger, D.L., Knight, R., Kelley, S.T., Siegel, J., Caporaso, J.G.: Geography and location are the primary drivers of office microbiome composition. mSystems 1(2), e00022-16 (2016)

    Google Scholar 

  12. Chen, C., Liaw, A., Breiman, L.: Using random forest to learn imbalanced data. Statistics Department of University of California at Berkeley, Berkeley. Technical Report 666 (2004)

    Google Scholar 

  13. Chen, T., Guestrin, C.: XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (New York, NY, USA, 2016), KDD ’16, pp. 785–794. Association for Computing Machinery (2016)

    Google Scholar 

  14. Claesson, M.J., Clooney, A.G., O’Toole, P.W.: A clinician’s guide to microbiome analysis. Nat. Rev. Gastroenterol. Hepatol. 14(10), 585–595 (2017)

    Article  Google Scholar 

  15. Clarke, T.H., Gomez, A., Singh, H., Nelson, K.E., Brinkac, L.M.: Integrating the microbiome as a resource in the forensics toolkit. Forensic Sci. Int. Genet. 30, 141–147 (2017)

    Article  Google Scholar 

  16. Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960)

    Article  Google Scholar 

  17. Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995)

    Article  MATH  Google Scholar 

  18. Datta, S., Pihur, V., Datta, S.: An adaptive optimal ensemble classifier via bagging and rank aggregation with applications to high dimensional data. BMC Bioinformatics 11, 427 (2010)

    Article  Google Scholar 

  19. Ditzler, G., Morrison, J.C., Lan, Y., Rosen, G.L.: Fizzy: feature subset selection for metagenomics. BMC Bioinformatics 16(1), 358 (2015)

    Article  Google Scholar 

  20. Duvallet, C., Gibbons, S.M., Gurry, T., Irizarry, R.A., Alm, E.J.: Meta-analysis of gut microbiome studies identifies disease-specific and shared responses. Nature Communications 8(1), 1–10 (2017)

    Article  Google Scholar 

  21. Ewels, P., Magnusson, M., Lundin, S., Käller, M.: MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32(19), 3047–3048 (2016)

    Article  Google Scholar 

  22. Ewing, B., Green, P.: Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res. 8(3), 186–194 (1998)

    Article  Google Scholar 

  23. Freitas, T.A.K., Li, P.-E., Scholz, M.B., Chain, P.S.G.: Accurate read-based metagenome characterization using a hierarchical suite of unique signatures. Nucleic Acids Res. 43(10), e69 (2015)

    Article  Google Scholar 

  24. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  25. Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Ann. Stat., 1189–1232 (2001)

    Google Scholar 

  26. Gilbert, J.A., Jansson, J.K., Knight, R.: The earth microbiome project: successes and aspirations. BMC Biology 12(1), 69 (2014)

    Article  Google Scholar 

  27. Giraud, E., Xu, L., Chaintreuil, C., Gargani, D., Gully, D., Sadowsky, M.J.: Photosynthetic Bradyrhizobium sp. strain ORS285 is capable of forming nitrogen-fixing root nodules on soybeans (glycine max). Appl. Environ. Microbiol. 79(7), 2459–2462 (2013)

    Google Scholar 

  28. Grice, E.A., Segre, J.A.: The human microbiome: Our second genome. Annu. Rev. Genomics Hum. Genet. 13(1), 151–170 (2012). PMID: 22703178

    Article  Google Scholar 

  29. Hand, D.J.: Breast cancer diagnosis from proteomic mass spectrometry data: A comparative evaluation. Stat. Appl. Genet. Mol. Biol. 7(2), Article 15 (2008)

    Google Scholar 

  30. Hand, D.J., Till, R.J.: A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning 45(2), 171–186 (2001)

    Article  MATH  Google Scholar 

  31. Harris, Z.N., Dhungel, Eliza Mosior, M., Ahn, T.-H.: Massive metagenomic data analysis using abundance-based machine learning. Biology Direct 14(12), Article 12 (2019)

    Google Scholar 

  32. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer New York, New York, NY, USA (2001)

    Google Scholar 

  33. Huttenhower, C., et al.: Structure, function and diversity of the healthy human microbiome. Nature 486(7402), 207–214 (2012)

    Article  Google Scholar 

  34. Joshi, M.V., Kumar, V., Agarwal, R.C.: Evaluating boosting algorithms to classify rare classes: Comparison and improvements. In: Proceedings 2001 IEEE International Conference on Data Mining, San Jose, CA, USA pp. 257–264. IEEE (2001)

    Google Scholar 

  35. Knights, D., Costello, E.K., Knight, R.: Supervised classification of human microbiota. FEMS Microbiol. Rev. 35(2), 343–359 (2011)

    Article  Google Scholar 

  36. Kovács, G.: An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl. Soft Comput. 83, 105662 (2019a)

    Article  Google Scholar 

  37. Kovács, G.: Smote-variants: a python implementation of 85 minority oversampling techniques. Neurocomputing 366, 352–354 (2019b)

    Article  Google Scholar 

  38. Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nature Methods 9(4), 357–359 (2012)

    Article  Google Scholar 

  39. Lee, S.S.: Regularization in skewed binary classification. Computational Statistics 14, 277–292 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  40. Li, H.: Microbiome, metagenomics, and high-dimensional compositional data analysis. Annu. Rev. Stat. Appl. 2, 73–94 (2015)

    Article  Google Scholar 

  41. Lin, W., Shi, P., Feng, R., Li, H.: Variable selection in regression with compositional covariates. Biometrika 101(4), 785–797 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  42. Lin, X., Zhang, Z., Zhang, L., Li, X.: Complete genome sequence of a denitrifying bacterium, Pseudomonas sp. CC6-YY-74, isolated from Arctic Ocean sediment. Marine Genomics 35, 47–49 (2017)

    Article  Google Scholar 

  43. Lindgreen, S., Adair, K.L., Gardner, P.P.: An evaluation of the accuracy and speed of metagenome analysis tools. Scientific Reports 6, 19233 (2016)

    Article  Google Scholar 

  44. Lu, J., Breitwieser, F.P., Thielen, P., Salzberg, S.L.: Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci. 3, e104 (2017)

    Article  Google Scholar 

  45. Marchesi, J.R., Ravel, J.: The vocabulary of microbiome research: a proposal. Microbiome 3(1), 31 (2015)

    Article  Google Scholar 

  46. Martin, M.: Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17(1), 10–12 (2011)

    Article  Google Scholar 

  47. Mason, C., Hirschberg, D., Consortium, T.M.I.: The metagenomics and metadesign of the subways and urban biomes (MetaSub) International Consortium inaugural meeting report. Microbiome 4(1), 24 (2016)

    Article  Google Scholar 

  48. McIntyre, A.B., Ounit, R., Afshinnekoo, E., Prill, R.J., Hénaff, E., Alexander, N., Minot, S.S., Danko, D., Foox, J., Ahsanuddin, S., et al.: Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Genome Biology 18(1), 182 (2017)

    Article  Google Scholar 

  49. McIver, L.J., Abu-Ali, G., Franzosa, E.A., Schwager, R., Morgan, X.C., Waldron, L., Segata, N., Huttenhower, C.: bioBakery: a metaá’omic analysis environment. Bioinformatics 34(7), 1235–1237 (2017)

    Google Scholar 

  50. Menzel, P., Ng, K.L., Krogh, A.: Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications 7, 11257–11257 (2016)

    Article  Google Scholar 

  51. Moitinho-Silva, L., Steinert, G., Nielsen, S., Hardoim, C.C., Wu, Y.-C., McCormack, G.P., López-Legentil, S., Marchant, R., Webster, N., Thomas, T., et al.: Predicting the HMA-LMA status in marine sponges by machine learning. Front. Microbiol. 8, 752 (2017)

    Article  Google Scholar 

  52. Oudah, M., Henschel, A.: Taxonomy-aware feature engineering for microbiome classification. BMC Bioinformatics 19, 227 (2018)

    Article  Google Scholar 

  53. Ounit, R., Wanamaker, S., Close, T.J., Lonardi, S.: Clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16(1), 236 (2015)

    Article  Google Scholar 

  54. Pasolli, E., Truong, D.T., Malik, F., Waldron, L., Segata, N.: Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLoS Comput. Biol. 12(7), e1004977 (2016)

    Article  Google Scholar 

  55. Paulson, J.N., Pop, M., Bravo, H. C.: metagenomeSeq: Statistical analysis for sparse high-throughput sequencing Bioconductor package 1(0), 191 (2013)

    Google Scholar 

  56. Pihur, V., Datta, S., Datta, S.: Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach. Bioinformatics 23(13), 1607–1615 (2007)

    Article  Google Scholar 

  57. Ranjan, R., Rani, A., Metwally, A., McGee, H.S., Perkins, D.L.: Analysis of the microbiome: Advantages of whole genome shotgun versus 16s amplicon sequencing. Biochem. Biophys. Res. Commun. 469(4), 967–977 (2016)

    Article  Google Scholar 

  58. Schmieder, R., Edwards, R.: Insights into antibiotic resistance through metagenomic approaches. Future Microbiology 7, 73–89 (2012)

    Article  Google Scholar 

  59. Segata, N., Waldron, L., Ballarini, A., Narasimhan, V., Jousson, O., Huttenhower, C.: Metagenomic microbial community profiling using unique clade-specific marker genes. Nature Methods 9, 811–814 (2012)

    Article  Google Scholar 

  60. Shi, P., Zhang, A., Li, H., et al.: Regression analysis for microbiome compositional data. Ann. Appl. Stat. 10(2), 1019–1040 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  61. Singh, R.K., Chang, H.-W., Yan, D., Lee, K.M., Ucmak, D., Wong, K., Abrouk, M., Farahnik, B., Nakamura, M., Zhu, T.H., Bhutani, T., Liao, W.: Influence of diet on the gut microbiome and implications for human health. J. Transl. Med. 15(1), 73 (2017)

    Article  Google Scholar 

  62. Sun, Y., Kamel, M.S., Wang, Y.: Boosting for learning multiple classes with imbalanced class distribution. In: Sixth International Conference on Data Mining (ICDM’06), Hong Kong, China. pp. 592–602. IEEE (2006)

    Google Scholar 

  63. Sunagawa, S., Mende, D.R., Zeller, G., Izquierdo-Carrasco, F., Berger, S.A., Kultima, J.R., Coelho, L.P., Arumugam, M., Tap, J., Nielsen, H.B., Rasmussen, S., Brunak, S., Pedersen, O., Guarner, F., de Vos, W.M., Wang, J., Li, J., Doré, J., Ehrlich, S.D., Stamatakis, A., Bork, P.: Metagenomic species profiling using universal phylogenetic marker genes. Nature Methods 10(12), 1196–1199 (2013)

    Article  Google Scholar 

  64. Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. Roy. Stat. Soc. B (Methodol.) 58(1), 267–288 (1996)

    Google Scholar 

  65. Turnbaugh, P.J., Ley, R.E., Hamady, M., Fraser-Liggett, C.M., Knight, R., Gordon, J.I.: The human microbiome project. Nature 449(7164), 804–810 (2007)

    Article  Google Scholar 

  66. Wade, W.: The oral microbiome in health and disease. Pharmacological Research 69(1), 137–143 (2013). Copyright 2012 Elsevier Ltd. All rights reserved

    Google Scholar 

  67. Wang, W.-L., Xu, S.-Y., Ren, Z.-G., Tao, L., Jiang, J.-W., Zheng, S.-S.: Application of metagenomics in the human gut microbiome. World J. Gastroenterol. 21(3), 803–814 (2015)

    Article  Google Scholar 

  68. Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology 15, Article R46 (2014)

    Article  Google Scholar 

  69. Wood, D.E., Lu, J., Langmead, B.: Improved metagenomic analysis with Kraken 2. Genome Biology 20(1), 257 (2019)

    Article  Google Scholar 

  70. Ye, S.H., Siddle, K.J., Park, D., Sabeti, P.C.: Benchmarking metagenomics tools for taxonomic classification. Cell 178, 779–794 (2019)

    Article  Google Scholar 

  71. Zhou, Q., Su, X., Ning, K.: Assessment of quality control approaches for metagenomic data analysis. Scientific Reports 4(1), 6957 (2014)

    Article  Google Scholar 

  72. Zhou, Y.-H., Gallins, P.: A review and tutorial of machine learning methods for microbiome host trait prediction. Front. Genet. 10, 579 (2019)

    Article  Google Scholar 

  73. Zou, H., Hastie, T.: Regularization and variable selection via the ElasticNet. J. Roy. Stat. Soc. B (Stat. Methodol.) 67(2), 301–320 (2005)

    Google Scholar 

Download references

Acknowledgements

This work was partially supported by the National Science Foundation under Award DMS-1461948 to SG.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Somnath Datta .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Anyaso-Samuel, S., Sachdeva, A., Guha, S., Datta, S. (2021). Bioinformatics Pre-Processing of Microbiome Data with An Application to Metagenomic Forensics. In: Datta, S., Guha, S. (eds) Statistical Analysis of Microbiome Data. Frontiers in Probability and the Statistical Sciences. Springer, Cham. https://doi.org/10.1007/978-3-030-73351-3_3

Download citation

Publish with us

Policies and ethics