Predicting the Metagenomics Content with Multiple CART Trees
Metagenomics is a technique for the characterization and identification of microbial genomes using direct isolation of genomic DNA from the environment without cultivation. One of the key step in this process is the taxonomic classification and clustering of the DNA fragments, process also known as binning. To date, the most common practice is classifying through alignments to public databases. When a representing specie is present in this database the process is simple and successful, if not, an underestimation of taxonomic abundances is produced. In this work we propose a alignment-free method capable of assign taxa to each read in the sample by analyzing the statistical properties of the reads. Given an environment, we collect genomes from public available databases and generate genomic fragments libraries. Then, statistics of k-mer frequencies, GC ratio and GC skew are computed for each read and stored in an environment-associated dataset used to build a robust machine learning procedure based on multiple CART trees. Finally, for each read the CART trees are asked about their taxa and the most voted ones are selected. The method was tested using simulated and public human gut microbiome data sets. The database was constructed using 98 genera present in Gastrointestinal Tract available at Human Microbiome Project. A multiple CART tree with 558-trees predictor was generated, capable to estimate the genus and abundance in the sample with 47 % of accuracy in read assignments. Performance rates are comparable with those from semi-supervised methods and also the computation times were reduced due to alignment-free methodology. Restricted to 17 early considered genera, our method increases its accuracy to 77 %.
KeywordsMetagenomics content prediction Human gut microbiome CART trees K-mer frequencies
This work was financed by ICBM U. Chile,Project CIRIC-INRIA Chile, Fondap Grant 15090007, Fondecyt 3130762, Basal Grant to the Center for Mathematical Modeling (grant no: PFB 03). We are also grateful to excellence PhD fellowships of U. Adolfo Ibañez. We acknowledge the support of the National Laboratory for High Performance Computing at the Center for Mathematical Modeling (PIA ECM-02- CONICYT).
- 1.Abubucker, S., Segata, N, Goll, J., Schubert, A.M., Izard, J., Cantarel, B.L., Rodriguez-Mueller, B., Zucker, J., Thiagarajan, M., Henrissat, B., White, O., Kelley, S.T., Meth, B., Schloss, P.D., Gevers, D., Mitreva, M., Huttenhower, C.: Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLoS Comput. Biol. 8(6), e1002358 (2012)CrossRefGoogle Scholar
- 5.Chatterji, S., Yamazaki, I., Bai, Z., Eisen,J.A.: Compostbin: a dna composition-based algorithm for binning environmental shotgun reads. In: Research in Computational Molecular Biology, pp. 17–28. Springer, Heidelberg (2008)Google Scholar
- 6.Chernov, A.V., Reyes, L., Xu, Z., Gonzalez, B., Golovko, G., Peterson, S., Perucho, M., Fofanov, Y., Strongin, A.Y.: Mycoplasma CG- and GATC-specific DNA methyltransferases selectively and efficiently methylate the host genome and alter the epigenetic landscape in human cells. Epigenetics 10(4), 303–318 (2015)CrossRefGoogle Scholar
- 10.Gordon, A., Hannon, G.J.: Fastx-toolkit. FASTQ/A short-reads pre-processing tools. http://hannonlab.cshl.edu/fastx_toolkit (2010, unpublished)
- 15.Leonard, M.T., Davis-Richardson, A.G., Ardissone, A.N., Kemppainen, K.M., Drew, J.C., Ilonen, J., Knip, M., Simell, O., Toppari, J., Veijola, R. et al.: The methylome of the gut microbiome: disparate dam methylation patterns in intestinal bacteroides dorei. Front. Microbiol. 5, 361 (2014)CrossRefGoogle Scholar
- 19.Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K.S., Manichanh, C., Nielsen, T., Pons, N., Levenez, F., Yamada, T., Mende, D.R., Li, J., Xu, J., Li, S., Li, D., Cao, J., Wang, B., Liang, H., Zheng, H., Xie, Y., Tap, J., Lepage, P., Bertalan, M., Batto, J.-M., Hansen, T., Paslier, D.L., Linneberg, A., Nielsen, H.B., Pelletier, E., Renault, P., Sicheritz-Ponten, T., Turner, K., Zhu, H., Yu, C., Li, S., Jian, M., Zhou, Y., Li, Y., Zhang, X., Li, S., Qin, N., Yang, H., Wang, J., Brunak, S., Dor, J., Guarner, F., Kristiansen, K., Pedersen, O., Parkhill, J., Weissenbach, J., Bork, P., Ehrlich, S.D., Wang, J.: A human gut microbial gene catalog established by metagenomic sequencing. Nature 464(7285), 59–65 (2010)Google Scholar
- 26.Valenzuela, M., Bravo, D., Canales, J., Sanhueza, C., Daz, N., Almarza, O., Toledo, H., Quest, A.F.G.: Helicobacter pyloriInduced loss of survivin and gastric cell viability is attributable to secreted bacterial Gamma-glutamyl transpeptidase activity. J. Infect. Dis. 208(7), jit286 (2013)Google Scholar
- 27.Venter, J.C., Remington, K., Heidelberg, J.F., Halpern, A.L., Rusch, D., Eisen, J.A., Wu, D., Paulsen, I., Nelson, K.E., Nelson, W., Fouts, D.E., Levy, S., Knap, A.H., Lomas, M.W., Nealson, K., White, O., Peterson, J., Hoffman, J., Parsons, R., Baden-Tillson, H., Pfannkoch, C., Rogers, Y.-H., Smith, H.O.: Environmental genome shotgun sequencing of the sargasso Sea. Science 304(5667), 66–74 (2004)CrossRefGoogle Scholar
- 30.Yatsunenko, T., Rey, F.E., Manary, M.J., Trehan, I., Dominguez-Bello, M.G., Contreras, M., Magris, M., Hidalgo, G., Baldassano, R.N., Anokhin, A.P., Heath, A.C., Warner, B., Reeder, J., Kuczynski, J., Caporaso, J.G., Lozupone, C.A., Lauber, C., Clemente, J.C., Knights, D., Knight, R., Gordon. J.I.: Human gut microbiome viewed across age and geography. Nature 486(7402), 222–227 (2012)Google Scholar