A Concurrent Subtractive Assembly Approach for Identification of Disease Associated Sub-metagenomes

  • Wontack Han
  • Mingjie Wang
  • Yuzhen YeEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10229)


Comparative analysis of metagenomes can be used to detect sub-metagenomes (species or gene sets) that are associated with specific phenotypes (e.g., host status). The typical workflow is to assemble and annotate metagenomic datasets individually or as a whole, followed by statistical tests to identify differentially abundant species/genes. We previously developed subtractive assembly (SA), a de novo assembly approach for comparative metagenomics that first detects differential reads that distinguish between two groups of metagenomes and then only assembles these reads. Application of SA to type 2 diabetes (T2D) microbiomes revealed new microbial genes associated with T2D. Here we further developed a Concurrent Subtractive Assembly (CoSA) approach, which uses a Wilcoxon rank-sum (WRS) test to detect k-mers that are differentially abundant between two groups of microbiomes (by contrast, SA only checks ratios of k-mer counts in one pooled sample versus the other). It then uses identified differential k-mers to extract reads that are likely sequenced from the sub-metagenome with consistent abundance differences between the groups of microbiomes. Further, CoSA attempts to reduce the redundancy of reads (from abundant common species) by excluding reads containing abundant k-mers. Using simulated microbiome datasets and T2D datasets, we show that CoSA achieves strikingly better performance in detecting consistent changes than SA does, and it enables the detection and assembly of genomes and genes with minor abundance difference. A SVM classifier built upon the microbial genes detected by CoSA from the T2D datasets can accurately discriminates patients from healthy controls, with an AUC of 0.94 (10-fold cross-validation), and therefore these differential genes (207 genes) may serve as potential microbial marker genes for T2D.


Metagenome Concurrent Subtractive Assembly Wilcoxon rank-sum test Comparative metagenomics 



This work was supported by the NIH grant 1R01AI108888 to Ye.


  1. 1.
    Albertsen, M., Hugenholtz, P., Skarshewski, A., Nielsen, K.L., Tyson, G.W., Nielsen, P.H.: Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 31(6), 533–538 (2013)CrossRefGoogle Scholar
  2. 2.
    Alneberg, J., Bjarnason, B.S., de Bruijn, I., Schirmer, M., Quick, J., Ijaz, U.Z., Lahti, L., Loman, N.J., Andersson, A.F., Quince, C.: Binning metagenomic contigs by coverage and composition. Nat. Methods 11(11), 1144–1146 (2014)CrossRefGoogle Scholar
  3. 3.
    Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., Dvorkin, M., Kulikov, A.S., Lesin, V.M., Nikolenko, S.I., Pham, S., Prjibelski, A.D., Pyshkin, A.V., Sirotkin, A.V., Vyahhi, N., Tesler, G., Alekseyev, M.A., Pevzner, P.A.: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19(5), 455–477 (2012)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Ben-Hur, A., Ong, C.S., Sonnenburg, S., Scholkopf, B., Ratsch, G.: Support vector machines and kernels for computational biology. PLoS Comput. Biol. 4(10), e1000173 (2008)CrossRefGoogle Scholar
  5. 5.
    Cho, I., Blaser, M.J.: The human microbiome: at the interface of health and disease. Nat. Rev. Genet. 13(4), 260–270 (2012)Google Scholar
  6. 6.
    de Martel, C., Ferlay, J., Franceschi, S., Vignat, J., Bray, F., Forman, D., Plummer, M.: Global burden of cancers attributable to infections in 2008: a review and synthetic analysis. Lancet Oncol. 13(6), 607–615 (2012)CrossRefGoogle Scholar
  7. 7.
    Deorowicz, S., Kokot, M., Grabowski, S., Debudaj-Grabysz, A.: KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31(10), 1569–1576 (2015)CrossRefGoogle Scholar
  8. 8.
    Finn, R.D., Clements, J., Eddy, S.R.: HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39(Web Server issue), 29–37 (2011)CrossRefGoogle Scholar
  9. 9.
    Garrett, W.S.: Cancer and the microbiota. Science 348(6230), 80–86 (2015)CrossRefGoogle Scholar
  10. 10.
    Ge, X., Rodriguez, R., Trinh, M., Gunsolley, J., Xu, P.: Oral microbiome of deep and shallow dental pockets in chronic periodontitis. PLoS One 8(6), e65520 (2013)CrossRefGoogle Scholar
  11. 11.
    Gilbert, J.A., Quinn, R.A., Debelius, J., Xu, Z.Z., Morton, J., Garg, N., Jansson, J.K., Dorrestein, P.C., Knight, R.: Microbiome-wide association studies link dynamic microbial consortia to disease. Nature 535(7610), 94–103 (2016)CrossRefGoogle Scholar
  12. 12.
    Gurevich, A., Saveliev, V., Vyahhi, N., Tesler, G.: QUAST: quality assessment tool for genome assemblies. Bioinformatics 29(8), 1072–1075 (2013)CrossRefGoogle Scholar
  13. 13.
    Iverson, V., Morris, R.M., Frazar, C.D., Berthiaume, C.T., Morales, R.L., Armbrust, E.V.: Untangling genomes from metagenomes: revealing an uncultured class of marine Euryarchaeota. Science 335(6068), 587–590 (2012)CrossRefGoogle Scholar
  14. 14.
    Jiang, B., Song, K., Ren, J., Deng, M., Sun, F., Zhang, X.: Comparison of metagenomic samples using sequence signatures. BMC Genomics 13, 730 (2012)CrossRefGoogle Scholar
  15. 15.
    Jorth, P., Turner, K.H., Gumus, P., Nizam, N., Buduneli, N., Whiteley, M.: Metatranscriptomics of the human oral microbiome during health and disease. MBio 5(2), e01012–e01014 (2014)CrossRefGoogle Scholar
  16. 16.
    Kang, D.W., Park, J.G., Ilhan, Z.E., Wallstrom, G., Labaer, J., Adams, J.B., Krajmalnik-Brown, R.: Reduced incidence of Prevotella and other fermenters in intestinal microflora of autistic children. PLoS One 8(7), e68322 (2013)CrossRefGoogle Scholar
  17. 17.
    Karlsson, F.H., Tremaroli, V., Nookaew, I., Bergstrom, G., Behre, C.J., Fagerberg, B., Nielsen, J., Backhed, F.: Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature 498(7452), 99–103 (2013)CrossRefGoogle Scholar
  18. 18.
    Knights, D., Costello, E.K., Knight, R.: Supervised classification of human microbiota. FEMS Microbiol. Rev. 35(2), 343–359 (2011)CrossRefGoogle Scholar
  19. 19.
    Koeth, R.A., Wang, Z., Levison, B.S., Buffa, J.A., Org, E., Sheehy, B.T., Britt, E.B., Fu, X., Wu, Y., Li, L., Smith, J.D., DiDonato, J.A., Chen, J., Li, H., Wu, G.D., Lewis, J.D., Warrier, M., Brown, J.M., Krauss, R.M., Tang, W.H., Bushman, F.D., Lusis, A.J., Hazen, S.L.: Intestinal microbiota metabolism of L-carnitine, a nutrient in red meat, promotes atherosclerosis. Nat. Med. 19(5), 576–585 (2013)CrossRefGoogle Scholar
  20. 20.
    Kostic, A.D., Howitt, M.R., Garrett, W.S.: Exploring host-microbiota interactions in animal models and humans. Genes Dev. 27(7), 701–718 (2013)CrossRefGoogle Scholar
  21. 21.
    Kurtz, S., Phillippy, A., Delcher, A.L., Smoot, M., Shumway, M., Antonescu, C., Salzberg, S.L.: Versatile and open software for comparing large genomes. Genome Biol. 5(2), R12 (2004)CrossRefGoogle Scholar
  22. 22.
    Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9(4), 357–359 (2012)CrossRefGoogle Scholar
  23. 23.
    Lewis, J.D., Chen, E.Z., Baldassano, R.N., Otley, A.R., Griffiths, A.M., Lee, D., Bittinger, K., Bailey, A., Friedman, E.S., Hoffmann, C., Albenberg, L., Sinha, R., Compher, C., Gilroy, E., Nessel, L., Grant, A., Chehoud, C., Li, H., Wu, G.D., Bushman, F.D.: Inflammation, antibiotics, and diet as environmental stressors of the gut microbiome in pediatric Crohn’s disease. Cell Host Microbe 18(4), 489–500 (2015)CrossRefGoogle Scholar
  24. 24.
    Li, D., Luo, R., Liu, C.M., Leung, C.M., Ting, H.F., Sadakane, K., Yamashita, H., Lam, T.W.: Megahit v1.0: a fast and scalable metagenome assembler driven by advanced methodologies and community practices. Methods 102, 3–11 (2016)CrossRefGoogle Scholar
  25. 25.
    Li, X., Andersen, D.G., Kaminsky, M., Freedman, M.J.: Algorithmic improvements for fast concurrent cuckoo hashing. In: Proceedings of the 9th ACM European Conference on Computer Systems (EuroSys), April 2014Google Scholar
  26. 26.
    Marcais, G., Kingsford, C.: A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27(6), 764–770 (2011)CrossRefGoogle Scholar
  27. 27.
    Mavromatis, K., Ivanova, N., Barry, K., Shapiro, H., Goltsman, E., McHardy, A.C., Rigoutsos, I., Salamov, A., Korzeniewski, F., Land, M., Lapidus, A., Grigoriev, I., Richardson, P., Hugenholtz, P., Kyrpides, N.C.: Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat. Methods 4(6), 495–500 (2007)CrossRefGoogle Scholar
  28. 28.
    Melsted, P., Pritchard, J.K.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinform. 12, 333 (2011)CrossRefGoogle Scholar
  29. 29.
    Nielsen, H.B., Almeida, M., Juncker, A.S., Rasmussen, S., Li, J., Sunagawa, S., Plichta, D.R., Gautier, L., Pedersen, A.G., Le Chatelier, E., et al.: Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32(8), 822–828 (2014)CrossRefGoogle Scholar
  30. 30.
    Overbeek, R., Olson, R., Pusch, G.D., Olsen, G.J., Davis, J.J., Disz, T., Edwards, R.A., Gerdes, S., Parrello, B., Shukla, M., Vonstein, V., Wattam, A.R., Xia, F., Stevens, R.: The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Res. 42(Database issue), D206–D214 (2014)CrossRefGoogle Scholar
  31. 31.
    Paulson, J.N., Stine, O.C., Bravo, H.C., Pop, M.: Differential abundance analysis for microbial marker-gene surveys. Nat. Methods 10(12), 1200–1202 (2013)CrossRefGoogle Scholar
  32. 32.
    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)MathSciNetzbMATHGoogle Scholar
  33. 33.
    Peng, Y., Leung, H.C., Yiu, S.M., Chin, F.Y.: IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28(11), 1420–1428 (2012)CrossRefGoogle Scholar
  34. 34.
    Qin, N., Yang, F., Li, A., Prifti, E., Chen, Y., Shao, L., Guo, J., Le Chatelier, E., Yao, J., Wu, L., Zhou, J., Ni, S., Liu, L., Pons, N., Batto, J.M., Kennedy, S.P., Leonard, P., Yuan, C., Ding, W., Chen, Y., Hu, X., Zheng, B., Qian, G., Xu, W., Ehrlich, S.D., Zheng, S., Li, L.: Alterations of the human gut microbiome in liver cirrhosis. Nature 513(7516), 59–64 (2014)CrossRefGoogle Scholar
  35. 35.
    Rho, M., Tang, H., Ye, Y.: FragGeneScan: predicting genes in short and error-prone reads. Nucleic Acids Res. 38(20), e191 (2010)CrossRefGoogle Scholar
  36. 36.
    Richter, D.C., Ott, F., Auch, A.F., Schmid, R., Huson, D.H.: MetaSim: a sequencing simulator for genomics and metagenomics. PLoS One 3(10), e3373 (2008)CrossRefGoogle Scholar
  37. 37.
    Scheperjans, F., Aho, V., Pereira, P.A., Koskinen, K., Paulin, L., Pekkonen, E., Haapaniemi, E., Kaakkola, S., Eerola-Rautio, J., Pohja, M., Kinnunen, E., Murros, K., Auvinen, P.: Gut microbiota are related to Parkinson’s disease and clinical phenotype. Mov. Disord. 30(3), 350–358 (2015)CrossRefGoogle Scholar
  38. 38.
    Scher, J.U., Sczesnak, A., Longman, R.S., Segata, N., Ubeda, C., Bielski, C., Rostron, T., Cerundolo, V., Pamer, E.G., Abramson, S.B., Huttenhower, C., Littman, D.R.: Expansion of intestinal Prevotella copri correlates with enhanced susceptibility to arthritis. Elife 2, e01202 (2013)CrossRefGoogle Scholar
  39. 39.
    Sears, C.L., Garrett, W.S.: Microbes, microbiota, and colon cancer. Cell Host Microbe 15(3), 317–328 (2014)CrossRefGoogle Scholar
  40. 40.
    Sender, R., Fuchs, S., Milo, R.: Revised estimates for the number of human and bacteria cells in the body. PLoS Biol. 14(8), e1002533 (2016)CrossRefGoogle Scholar
  41. 41.
    Strimmer, K.: fdrtool: a versatile R package for estimating local and tail area-based false discovery rates. Bioinformatics 24(12), 1461–1462 (2008)CrossRefGoogle Scholar
  42. 42.
    Wang, M., Doak, T.G., Ye, Y.: Subtractive assembly for comparative metagenomics, and its application to type 2 diabetes metagenomes. Genome Biol. 16, 243 (2015)CrossRefGoogle Scholar
  43. 43.
    Wu, Y.W., Simmons, B.A., Singer, S.W.: MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32(4), 605–607 (2016)CrossRefGoogle Scholar
  44. 44.
    Wu, Y.W., Ye, Y.: A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. J. Comput. Biol. 18(3), 523–534 (2011)MathSciNetCrossRefGoogle Scholar
  45. 45.
    Zeller, G., Tap, J., Voigt, A.Y., Sunagawa, S., Kultima, J.R., Costea, P.I., Amiot, A., Bohm, J., Brunetti, F., Habermann, N., Hercog, R., Koch, M., Luciani, A., Mende, D.R., Schneider, M.A., Schrotz-King, P., Tournigand, C., Tran Van Nhieu, J., Yamada, T., Zimmermann, J., Benes, V., Kloor, M., Ulrich, C.M., von Knebel Doeberitz, M., Sobhani, I., Bork, P.: Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol. Syst. Biol. 10, 766 (2014)CrossRefGoogle Scholar
  46. 46.
    Zhang, Q., Pell, J., Canino-Koning, R., Howe, A.C., Brown, C.T.: These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure. PLoS One 9(7), e101271 (2014)CrossRefGoogle Scholar
  47. 47.
    Zhu, B., Wang, X., Li, L.: Human gut microbiome: the second genome of human body. Protein Cell 1(8), 718–725 (2010)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  1. 1.Indiana UniversityBloomingtonUSA

Personalised recommendations