Abstract
The growing number of metagenomic studies in medicine and environmental sciences is creating new computational demands in the analysis of these very large datasets. We have recently proposed a time-efficient algorithm called Clark that can accurately classify metagenomic sequences against a set of reference genomes. The competitive advantage of Clark depends on the use of discriminative contiguous k-mers. In default mode, Clark’s speed is currently unmatched and its precision is comparable to the state-of-the-art, however, its sensitivity still does not match the level of the most sensitive (but slowest) metagenomic classifier. In this paper, we introduce an algorithmic improvement that allows Clark’s classification sensitivity to match the best metagenomic classifier, without a significant loss of speed or precision compared to the original version. Finally, on real metagenomes, Clark can assign with high accuracy a much higher proportion of short reads than its closest competitor. The improved version of Clark, based on discriminative spaced k-mers, is freely available at http://clark.cs.ucr.edu/Spaced/.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)
Bao, E., Jiang, T., Kaloshian, I., Girke, T.: Seed: efficient clustering of next-generation sequences. Bioinformatics 27(18), 2502–2509 (2011)
Bazinet, A.L., Cummings, M.P.: A comparative evaluation of sequence classification programs. BMC Bioinformatics 13(1), 92 (2012)
Brady, A., Salzberg, S.: PhymmBL expanded: confidence scores, custom databases, parallelization and more. Nat. Methods 8(5), 367–367 (2011)
Brown, D.G., Li, M., Ma, B.: A tutorial of recent developments in the seeding of local alignment. J. Bioinform. Comput. Biol. 2(04), 819–842 (2004)
Choi, K.P., Zeng, F., Zhang, L.: Good spaced seeds for homology search. In: Proceedings of Fourth IEEE Symposium on Bioinformatics and Bioengineering, BIBE 2004, pp. 379–386. IEEE (2004)
Human Microbiome Project Consortium: A framework for human microbiome research. Nature 486(7402), 215–221 (2012)
Felczykowska, A., Bloch, S.K., Nejman-Falenczyk, B., Baranska, S.: Metagenomic approach in the investigation of new bioactive compounds in the marine environment. Acta Biochim. Pol. 59, 501–505 (2012)
Huson, D.H., Auch, A.F., Qi, J., Schuster, S.C.: MEGAN analysis of metagenomic data. Genome Res. 17(3), 377–386 (2007)
Huttenhower, C., Gevers, D., Knight, R., Abubucker, S., Badger, J., Chinwalla, A., et al.: Structure, function and diversity of the healthy human microbiome. Nature 486(7402), 207–214 (2012)
Ilie, L., Ilie, S.: Multiple spaced seeds for homology search. Bioinformatics 23(22), 2969–2977 (2007)
Ilie, L., Ilie, S., Bigvand, A.M.: Speed: fast computation of sensitive spaced seeds. Bioinformatics 27(17), 2433–2434 (2011)
Li, M., Ma, B., Kisman, D., Tromp, J.: Patternhunter ii: highly sensitive and fast homology search. J. Bioinform. Comput. Biol. 2(03), 417–439 (2004)
Li, M., Ma, B., Zhang, L.: Superiority and complexity of the spaced seeds. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm. Society for Industrial and Applied Mathematics, pp. 444–453 (2006)
Lindgreen, S., Adair, K.L., Gardner, P.: An Evaluation of the Accuracy and Speed of Metagenome Analysis Tools. Cold Spring Harbor Laboratory Press (2015). doi:10.1101/017830
Liu, B., Gibbons, T., Ghodsi, M., Treangen, T., Pop, M.: Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics 12(Suppl 2), S4 (2011)
Ma, B., Tromp, J., Li, M.: Patternhunter: faster and more sensitive homology search. Bioinformatics 18(3), 440–445 (2002)
Mueller, R.S., Bryson, S., Kieft, B., Li, Z., Pett-Ridge, J., Chavez, F., Hettich, R.L., Pan, C., Mayali, X.: Metagenome sequencing of a coastal marine microbial community from Monterey Bay, California. Genome Announc. 3(2), e00341-15 (2015)
Ounit, R., Wanamaker, S., Close, T.J., Lonardi, S.: Clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16(1), 236 (2015)
Pace, N.R.: Mapping the tree of life: progress and prospects. Microbiol. Mol. Biol. Rev. 73(4), 565–576 (2009)
Rosen, G.L., Reichenberger, E.R., Rosenfeld, A.M.: NBC: the naive bayes classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics 27(1), 127–129 (2011)
Segata, N., Waldron, L., Ballarini, A., Narasimhan, V., Jousson, O., Huttenhower, C.: Metagenomic microbial community profiling using unique clade-specific marker genes. Nat. Methods 9(8), 811–814 (2012)
Sunagawa, S., Mende, D.R., Zeller, G., Izquierdo-Carrasco, F., Berger, S.A., Kultima, J.R., Coelho, L.P., Arumugam, M., Tap, J., Nielsen, H.B., et al.: Metagenomic species profiling using universal phylogenetic marker genes. Nat. Methods 10(12), 1196–1199 (2013)
Venter, J.C., Remington, K., Heidelberg, J.F., Halpern, A.L., Rusch, D., Eisen, J.A., Wu, D., Paulsen, I., Nelson, K.E., Nelson, W., et al.: Environmental genome shotgun sequencing of the Sargasso Sea. Science 304(5667), 66–74 (2004)
Wood, D., Salzberg, S.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3), R46 (2014)
Zhang, Z., Schwartz, S., Wagner, L., Miller, W.: A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7(1–2), 203–214 (2000)
Acknowledgments
This work was supported in part by the U.S. National Science Foundation [IIS-1302134]. We are thankful to the anonymous reviewers for their constructive feedback.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ounit, R., Lonardi, S. (2015). Higher Classification Accuracy of Short Metagenomic Reads by Discriminative Spaced k-mers. In: Pop, M., Touzet, H. (eds) Algorithms in Bioinformatics. WABI 2015. Lecture Notes in Computer Science(), vol 9289. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48221-6_21
Download citation
DOI: https://doi.org/10.1007/978-3-662-48221-6_21
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-48220-9
Online ISBN: 978-3-662-48221-6
eBook Packages: Computer ScienceComputer Science (R0)