Higher Classification Accuracy of Short Metagenomic Reads by Discriminative Spaced k-mers

  • Rachid Ounit
  • Stefano LonardiEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9289)


The growing number of metagenomic studies in medicine and environmental sciences is creating new computational demands in the analysis of these very large datasets. We have recently proposed a time-efficient algorithm called Clark that can accurately classify metagenomic sequences against a set of reference genomes. The competitive advantage of Clark depends on the use of discriminative contiguous k-mers. In default mode, Clark’s speed is currently unmatched and its precision is comparable to the state-of-the-art, however, its sensitivity still does not match the level of the most sensitive (but slowest) metagenomic classifier. In this paper, we introduce an algorithmic improvement that allows Clark’s classification sensitivity to match the best metagenomic classifier, without a significant loss of speed or precision compared to the original version. Finally, on real metagenomes, Clark can assign with high accuracy a much higher proportion of short reads than its closest competitor. The improved version of Clark, based on discriminative spaced k-mers, is freely available at


Metagenomics Microbiome Classification Discriminative spaced k-mers Short metagenomic reads 



This work was supported in part by the U.S. National Science Foundation [IIS-1302134]. We are thankful to the anonymous reviewers for their constructive feedback.


  1. 1.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)CrossRefGoogle Scholar
  2. 2.
    Bao, E., Jiang, T., Kaloshian, I., Girke, T.: Seed: efficient clustering of next-generation sequences. Bioinformatics 27(18), 2502–2509 (2011)Google Scholar
  3. 3.
    Bazinet, A.L., Cummings, M.P.: A comparative evaluation of sequence classification programs. BMC Bioinformatics 13(1), 92 (2012)CrossRefGoogle Scholar
  4. 4.
    Brady, A., Salzberg, S.: PhymmBL expanded: confidence scores, custom databases, parallelization and more. Nat. Methods 8(5), 367–367 (2011)CrossRefGoogle Scholar
  5. 5.
    Brown, D.G., Li, M., Ma, B.: A tutorial of recent developments in the seeding of local alignment. J. Bioinform. Comput. Biol. 2(04), 819–842 (2004)CrossRefGoogle Scholar
  6. 6.
    Choi, K.P., Zeng, F., Zhang, L.: Good spaced seeds for homology search. In: Proceedings of Fourth IEEE Symposium on Bioinformatics and Bioengineering, BIBE 2004, pp. 379–386. IEEE (2004)Google Scholar
  7. 7.
    Human Microbiome Project Consortium: A framework for human microbiome research. Nature 486(7402), 215–221 (2012)Google Scholar
  8. 8.
    Felczykowska, A., Bloch, S.K., Nejman-Falenczyk, B., Baranska, S.: Metagenomic approach in the investigation of new bioactive compounds in the marine environment. Acta Biochim. Pol. 59, 501–505 (2012)Google Scholar
  9. 9.
    Huson, D.H., Auch, A.F., Qi, J., Schuster, S.C.: MEGAN analysis of metagenomic data. Genome Res. 17(3), 377–386 (2007)CrossRefGoogle Scholar
  10. 10.
    Huttenhower, C., Gevers, D., Knight, R., Abubucker, S., Badger, J., Chinwalla, A., et al.: Structure, function and diversity of the healthy human microbiome. Nature 486(7402), 207–214 (2012)CrossRefGoogle Scholar
  11. 11.
    Ilie, L., Ilie, S.: Multiple spaced seeds for homology search. Bioinformatics 23(22), 2969–2977 (2007)CrossRefGoogle Scholar
  12. 12.
    Ilie, L., Ilie, S., Bigvand, A.M.: Speed: fast computation of sensitive spaced seeds. Bioinformatics 27(17), 2433–2434 (2011)CrossRefGoogle Scholar
  13. 13.
    Li, M., Ma, B., Kisman, D., Tromp, J.: Patternhunter ii: highly sensitive and fast homology search. J. Bioinform. Comput. Biol. 2(03), 417–439 (2004)CrossRefGoogle Scholar
  14. 14.
    Li, M., Ma, B., Zhang, L.: Superiority and complexity of the spaced seeds. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm. Society for Industrial and Applied Mathematics, pp. 444–453 (2006)Google Scholar
  15. 15.
    Lindgreen, S., Adair, K.L., Gardner, P.: An Evaluation of the Accuracy and Speed of Metagenome Analysis Tools. Cold Spring Harbor Laboratory Press (2015). doi: 10.1101/017830
  16. 16.
    Liu, B., Gibbons, T., Ghodsi, M., Treangen, T., Pop, M.: Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics 12(Suppl 2), S4 (2011)CrossRefGoogle Scholar
  17. 17.
    Ma, B., Tromp, J., Li, M.: Patternhunter: faster and more sensitive homology search. Bioinformatics 18(3), 440–445 (2002)CrossRefGoogle Scholar
  18. 18.
    Mueller, R.S., Bryson, S., Kieft, B., Li, Z., Pett-Ridge, J., Chavez, F., Hettich, R.L., Pan, C., Mayali, X.: Metagenome sequencing of a coastal marine microbial community from Monterey Bay, California. Genome Announc. 3(2), e00341-15 (2015)CrossRefGoogle Scholar
  19. 19.
    Ounit, R., Wanamaker, S., Close, T.J., Lonardi, S.: Clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16(1), 236 (2015)CrossRefGoogle Scholar
  20. 20.
    Pace, N.R.: Mapping the tree of life: progress and prospects. Microbiol. Mol. Biol. Rev. 73(4), 565–576 (2009)CrossRefGoogle Scholar
  21. 21.
    Rosen, G.L., Reichenberger, E.R., Rosenfeld, A.M.: NBC: the naive bayes classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics 27(1), 127–129 (2011)CrossRefGoogle Scholar
  22. 22.
    Segata, N., Waldron, L., Ballarini, A., Narasimhan, V., Jousson, O., Huttenhower, C.: Metagenomic microbial community profiling using unique clade-specific marker genes. Nat. Methods 9(8), 811–814 (2012)CrossRefGoogle Scholar
  23. 23.
    Sunagawa, S., Mende, D.R., Zeller, G., Izquierdo-Carrasco, F., Berger, S.A., Kultima, J.R., Coelho, L.P., Arumugam, M., Tap, J., Nielsen, H.B., et al.: Metagenomic species profiling using universal phylogenetic marker genes. Nat. Methods 10(12), 1196–1199 (2013)CrossRefGoogle Scholar
  24. 24.
    Venter, J.C., Remington, K., Heidelberg, J.F., Halpern, A.L., Rusch, D., Eisen, J.A., Wu, D., Paulsen, I., Nelson, K.E., Nelson, W., et al.: Environmental genome shotgun sequencing of the Sargasso Sea. Science 304(5667), 66–74 (2004)CrossRefGoogle Scholar
  25. 25.
    Wood, D., Salzberg, S.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3), R46 (2014)CrossRefGoogle Scholar
  26. 26.
    Zhang, Z., Schwartz, S., Wagner, L., Miller, W.: A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7(1–2), 203–214 (2000)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringUniversity of CaliforniaRiversideUSA

Personalised recommendations