Higher Classification Accuracy of Short Metagenomic Reads by Discriminative Spaced k-mers
The growing number of metagenomic studies in medicine and environmental sciences is creating new computational demands in the analysis of these very large datasets. We have recently proposed a time-efficient algorithm called Clark that can accurately classify metagenomic sequences against a set of reference genomes. The competitive advantage of Clark depends on the use of discriminative contiguous k-mers. In default mode, Clark’s speed is currently unmatched and its precision is comparable to the state-of-the-art, however, its sensitivity still does not match the level of the most sensitive (but slowest) metagenomic classifier. In this paper, we introduce an algorithmic improvement that allows Clark’s classification sensitivity to match the best metagenomic classifier, without a significant loss of speed or precision compared to the original version. Finally, on real metagenomes, Clark can assign with high accuracy a much higher proportion of short reads than its closest competitor. The improved version of Clark, based on discriminative spaced k-mers, is freely available at http://clark.cs.ucr.edu/Spaced/.
KeywordsMetagenomics Microbiome Classification Discriminative spaced k-mers Short metagenomic reads
This work was supported in part by the U.S. National Science Foundation [IIS-1302134]. We are thankful to the anonymous reviewers for their constructive feedback.
- 2.Bao, E., Jiang, T., Kaloshian, I., Girke, T.: Seed: efficient clustering of next-generation sequences. Bioinformatics 27(18), 2502–2509 (2011)Google Scholar
- 6.Choi, K.P., Zeng, F., Zhang, L.: Good spaced seeds for homology search. In: Proceedings of Fourth IEEE Symposium on Bioinformatics and Bioengineering, BIBE 2004, pp. 379–386. IEEE (2004)Google Scholar
- 7.Human Microbiome Project Consortium: A framework for human microbiome research. Nature 486(7402), 215–221 (2012)Google Scholar
- 8.Felczykowska, A., Bloch, S.K., Nejman-Falenczyk, B., Baranska, S.: Metagenomic approach in the investigation of new bioactive compounds in the marine environment. Acta Biochim. Pol. 59, 501–505 (2012)Google Scholar
- 14.Li, M., Ma, B., Zhang, L.: Superiority and complexity of the spaced seeds. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm. Society for Industrial and Applied Mathematics, pp. 444–453 (2006)Google Scholar
- 15.Lindgreen, S., Adair, K.L., Gardner, P.: An Evaluation of the Accuracy and Speed of Metagenome Analysis Tools. Cold Spring Harbor Laboratory Press (2015). doi: 10.1101/017830