Predicting the Metagenomics Content with Multiple CART Trees



Metagenomics is a technique for the characterization and identification of microbial genomes using direct isolation of genomic DNA from the environment without cultivation. One of the key step in this process is the taxonomic classification and clustering of the DNA fragments, process also known as binning. To date, the most common practice is classifying through alignments to public databases. When a representing specie is present in this database the process is simple and successful, if not, an underestimation of taxonomic abundances is produced. In this work we propose a alignment-free method capable of assign taxa to each read in the sample by analyzing the statistical properties of the reads. Given an environment, we collect genomes from public available databases and generate genomic fragments libraries. Then, statistics of k-mer frequencies, GC ratio and GC skew are computed for each read and stored in an environment-associated dataset used to build a robust machine learning procedure based on multiple CART trees. Finally, for each read the CART trees are asked about their taxa and the most voted ones are selected. The method was tested using simulated and public human gut microbiome data sets. The database was constructed using 98 genera present in Gastrointestinal Tract available at Human Microbiome Project. A multiple CART tree with 558-trees predictor was generated, capable to estimate the genus and abundance in the sample with 47 % of accuracy in read assignments. Performance rates are comparable with those from semi-supervised methods and also the computation times were reduced due to alignment-free methodology. Restricted to 17 early considered genera, our method increases its accuracy to 77 %.


Metagenomics content prediction Human gut microbiome CART trees K-mer frequencies 



This work was financed by ICBM U. Chile,Project CIRIC-INRIA Chile, Fondap Grant 15090007, Fondecyt 3130762, Basal Grant to the Center for Mathematical Modeling (grant no: PFB 03). We are also grateful to excellence PhD fellowships of U. Adolfo Ibañez. We acknowledge the support of the National Laboratory for High Performance Computing at the Center for Mathematical Modeling (PIA ECM-02- CONICYT).


  1. 1.
    Abubucker, S., Segata, N, Goll, J., Schubert, A.M., Izard, J., Cantarel, B.L., Rodriguez-Mueller, B., Zucker, J., Thiagarajan, M., Henrissat, B., White, O., Kelley, S.T., Meth, B., Schloss, P.D., Gevers, D., Mitreva, M., Huttenhower, C.: Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLoS Comput. Biol. 8(6), e1002358 (2012)CrossRefGoogle Scholar
  2. 2.
    Bentley, S.D., Parkhill, J.: Comparative genomic structure of prokaryotes. Ann. Rev. Genet. 38(1), 771–791 (2004)CrossRefGoogle Scholar
  3. 3.
    Brady, A., Salzberg, S.L.: Phymm and phymmBL: metagenomic phylogenetic classification with interpolated Markov models. Nat. Methods 6(9), 673–676 (2009)CrossRefGoogle Scholar
  4. 4.
    Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A.: Classification and Regression Trees, 1st edn. Chapman and Hall/CRC, New York (1984)zbMATHGoogle Scholar
  5. 5.
    Chatterji, S., Yamazaki, I., Bai, Z., Eisen,J.A.: Compostbin: a dna composition-based algorithm for binning environmental shotgun reads. In: Research in Computational Molecular Biology, pp. 17–28. Springer, Heidelberg (2008)Google Scholar
  6. 6.
    Chernov, A.V., Reyes, L., Xu, Z., Gonzalez, B., Golovko, G., Peterson, S., Perucho, M., Fofanov, Y., Strongin, A.Y.: Mycoplasma CG- and GATC-specific DNA methyltransferases selectively and efficiently methylate the host genome and alter the epigenetic landscape in human cells. Epigenetics 10(4), 303–318 (2015)CrossRefGoogle Scholar
  7. 7.
    Dong, H., Chen, Y., Shen, Y., Wang, S., Zhao, G., Jin, W.: Artificial duplicate reads in sequencing data of 454 genome sequencer flx system. Acta Biochim. Biophys. Sin. 43(6), 496–500 (2011)CrossRefGoogle Scholar
  8. 8.
    Drezen, E., Rizk, G., Chikhi, R., Deltel, C., Lemaitre, C., Peterlongo, P., Lavenier, D.: Gatb: genome assembly & analysis tool box. Bioinformatics 30(20), 2959–2961 (2014)CrossRefGoogle Scholar
  9. 9.
    Fu, L., Niu, B., Zhu, Z., Wu, S., Li, W.: Cd-hit: accelerated for clustering the next-generation sequencing data. Bioinformatics 28(23), 3150–3152 (2012)CrossRefGoogle Scholar
  10. 10.
    Gordon, A., Hannon, G.J.: Fastx-toolkit. FASTQ/A short-reads pre-processing tools. (2010, unpublished)
  11. 11.
    Hdar, C., Assar, R., Colombres, M., Aravena, A., Pavez, L., Gonzlez, M., Martnez, S., Inestrosa, N.C., Maass, A.: Genome-wide identification of new Wnt/-catenin target genes in the human genome using CART method. BMC Genomics 11(1), 348 (2010)CrossRefGoogle Scholar
  12. 12.
    Hugenholtz, P., Tyson, G.W.: Microbiology: metagenomics. Nature 455(7212), 481–483 (2008)CrossRefGoogle Scholar
  13. 13.
    Johnson, S., Trost, B., Long, J.R., Pittet, V., Kusalik, A.: A better sequence-read simulator program for metagenomics. BMC Bioinf. 15(Suppl 9), S14 (2014)CrossRefGoogle Scholar
  14. 14.
    Lan, R., Reeves, P.R.: Escherichia coli in disguise: molecular origins of shigella. Microbes Infect. 4(11), 1125–1132 (2002)CrossRefGoogle Scholar
  15. 15.
    Leonard, M.T., Davis-Richardson, A.G., Ardissone, A.N., Kemppainen, K.M., Drew, J.C., Ilonen, J., Knip, M., Simell, O., Toppari, J., Veijola, R. et al.: The methylome of the gut microbiome: disparate dam methylation patterns in intestinal bacteroides dorei. Front. Microbiol. 5, 361 (2014)CrossRefGoogle Scholar
  16. 16.
    Lysholm, F., Andersson, B., Persson, B.: An efficient simulator of 454 data using configurable statistical models. BMC Res. Notes 4(1), 449 (2011)CrossRefGoogle Scholar
  17. 17.
    Meyer, F., Paarmann, D., D’Souza, M., Olson, R., Glass, E.M., Kubal, M., Paczian, T., Rodriguez, A, Stevens, R., Wilke, A. et al.: The metagenomics rast server–a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinf. 9(1), 386 (2008)CrossRefGoogle Scholar
  18. 18.
    Poinar, H.N., Schwarz, C., Qi, J., Shapiro, B., MacPhee, R.D.E., Buigues, B., Tikhonov, A., Huson, D.H., Tomsho, L.P., Auch, A., Rampp, M., Miller, W., Schuster, S.C.: Metagenomics to paleogenomics: large-scale sequencing of mammoth DNA. Science 311(5759), 392–394 (2006)CrossRefGoogle Scholar
  19. 19.
    Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K.S., Manichanh, C., Nielsen, T., Pons, N., Levenez, F., Yamada, T., Mende, D.R., Li, J., Xu, J., Li, S., Li, D., Cao, J., Wang, B., Liang, H., Zheng, H., Xie, Y., Tap, J., Lepage, P., Bertalan, M., Batto, J.-M., Hansen, T., Paslier, D.L., Linneberg, A., Nielsen, H.B., Pelletier, E., Renault, P., Sicheritz-Ponten, T., Turner, K., Zhu, H., Yu, C., Li, S., Jian, M., Zhou, Y., Li, Y., Zhang, X., Li, S., Qin, N., Yang, H., Wang, J., Brunak, S., Dor, J., Guarner, F., Kristiansen, K., Pedersen, O., Parkhill, J., Weissenbach, J., Bork, P., Ehrlich, S.D., Wang, J.: A human gut microbial gene catalog established by metagenomic sequencing. Nature 464(7285), 59–65 (2010)Google Scholar
  20. 20.
    Richter, D.C., Ott, F., Auch, A.F., Schmid, R., Huson, D.H.: MetaSimA sequencing simulator for genomics and metagenomics. PLoS ONE 3(10), e3373 (2008)CrossRefGoogle Scholar
  21. 21.
    Riva, A., Delorme, M.-O., Chevalier, T., Guilhot, N., Hénaut, C., Hénaut, A.: The difficult interpretation of transcriptome data: the case of the gatc regulatory network. Comput. Biol. Chem. 28(2), 109–118 (2004)zbMATHCrossRefGoogle Scholar
  22. 22.
    Rodrigue, S., Materna, A.C., Timberlake, S.C., Blackburn, M.C., Malmstrom, R.R., Alm, E.J., Chisholm, S.W.: Unlocking short read sequencing for metagenomics. PLoS ONE 5(7), e11840 (2010)CrossRefGoogle Scholar
  23. 23.
    Segata, N., Boernigen, D., Tickle, T.L., Morgan, X.C., Garrett, W.S., Huttenhower, C.: Computational meta’omics for microbial community studies. Mol. Syst. Biol. 9(1), 666 (2013)CrossRefGoogle Scholar
  24. 24.
    Tringe, S.G., Rubin, E.M.: Metagenomics: DNA sequencing of environmental samples. Nat. Rev. Genet. 6(11), 805–814 (2005)CrossRefGoogle Scholar
  25. 25.
    Turnbaugh, P.J., Ley, R.E., Hamady, M., Fraser-Liggett, C.M., Knight, R., Gordon, J.I.: The human microbiome project. Nature 449(7164), 804–810 (2007)CrossRefGoogle Scholar
  26. 26.
    Valenzuela, M., Bravo, D., Canales, J., Sanhueza, C., Daz, N., Almarza, O., Toledo, H., Quest, A.F.G.: Helicobacter pyloriInduced loss of survivin and gastric cell viability is attributable to secreted bacterial Gamma-glutamyl transpeptidase activity. J. Infect. Dis. 208(7), jit286 (2013)Google Scholar
  27. 27.
    Venter, J.C., Remington, K., Heidelberg, J.F., Halpern, A.L., Rusch, D., Eisen, J.A., Wu, D., Paulsen, I., Nelson, K.E., Nelson, W., Fouts, D.E., Levy, S., Knap, A.H., Lomas, M.W., Nealson, K., White, O., Peterson, J., Hoffman, J., Parsons, R., Baden-Tillson, H., Pfannkoch, C., Rogers, Y.-H., Smith, H.O.: Environmental genome shotgun sequencing of the sargasso Sea. Science 304(5667), 66–74 (2004)CrossRefGoogle Scholar
  28. 28.
    Weitschek, E., Santoni, D., Fiscon, G., De Cola, M.C., Bertolazzi, P., Felici, G.: Next generation sequencing reads comparison with an alignment-free distance. BMC Research Notes 7, 869 (2014)CrossRefGoogle Scholar
  29. 29.
    Wu, Y.-W., Ye, Y.: A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. J. Comput. Biol. 18(3), 523–534 (2011)MathSciNetCrossRefGoogle Scholar
  30. 30.
    Yatsunenko, T., Rey, F.E., Manary, M.J., Trehan, I., Dominguez-Bello, M.G., Contreras, M., Magris, M., Hidalgo, G., Baldassano, R.N., Anokhin, A.P., Heath, A.C., Warner, B., Reeder, J., Kuczynski, J., Caporaso, J.G., Lozupone, C.A., Lauber, C., Clemente, J.C., Knights, D., Knight, R., Gordon. J.I.: Human gut microbiome viewed across age and geography. Nature 486(7402), 222–227 (2012)Google Scholar
  31. 31.
    Zuo, G., Xu, Z., Hao, B.: Shigella strains are not clones of escherichia coli but sister species in the genus escherichia. Genomics Proteomics Bioinformatics 11(1), 61–65 (2013)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Departamento de Ingeniería Matemática, Center for Mathematical ModelingUniversidad de ChileSantiagoChile
  2. 2.Instituto de Ciencias Biomédicas, Escuela de MedicinaUniversidad de ChileSantiagoChile

Personalised recommendations