Fast and Sensitive Classification of Short Metagenomic Reads with SKraken

  • Jia Qian
  • Davide Marchiori
  • Matteo CominEmail author
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 881)


The major problem when analyzing a metagenomic sample is to taxonomically annotate its reads in order to identify the species and their relative abundances. Many tools have been developed recently, however they are not always adequate for the increasing database volume. In this paper we propose an efficient method, called SKraken, that combines taxonomic tree and k-mers frequency counting. SKraken extracts the most representative k-mers for each species and filter out less representative ones. SKraken is inspired by Kraken, which is one of the state-of-art methods. We compare the performance of SKraken with Kraken on both real and synthetic datasets, and it exhibits a higher classification precision and a faster processing speed. Availability:



The authors would like to thank the anonymous reviewers for their valuable comments and suggestions. This work was supported by the Italian MIUR project PRIN20122F87B2.


  1. 1.
    Felczykowska, A., Bloch, S.K., Nejman-Faleczyk, B., Baraska, S.: Metagenomic approach in the investigation of new bioactive compounds in the marine environment. Acta Biochim. Pol. 59(4), 501–505 (2012)Google Scholar
  2. 2.
    Mande, S.S., Mohammed, M.H., Ghosh, T.S.: Classification of metagenomic sequences: methods and challenges. Briefings Bioinform. 13(6), 669–681 (2012)CrossRefGoogle Scholar
  3. 3.
    Qin, J., Li, R., Raes, J., et al.: A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010)CrossRefGoogle Scholar
  4. 4.
    Zeller, G., Tap, J., Voigt, A.Y., Sunagawa, S., Kultima, J.R., Costea, P.I., Amiot, A., Böhm, J., Brunetti, F., Habermann, N., Hercog, R., Koch, M., Luciani, A., Mende, D.R., Schneider, M.A., Schrotz-King, P., Tournigand, C., Tran Van Nhieu, J., Yamada, T., Zimmermann, J., Benes, V., Kloor, M., Ulrich, C.M., von Knebel Doeberitz, M., Sobhani, I., Bork, P.: Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol. Syst. Biol. 10(11), 766 (2014)CrossRefGoogle Scholar
  5. 5.
    Human Microbiome Project Consortium: Structure, function and diversity of the healthy human microbiome. Nature 486(7402), 207–214 (2012)Google Scholar
  6. 6.
    Said, H.S., Suda, W., Nakagome, S., Chinen, H., Oshima, K., Kim, S., Kimura, R., Iraha, A., Ishida, H., Fujita, J., Mano, S., Morita, H., Dohi, T., Oota, H., Hattori, M.: Dysbiosis of salivary microbiota in inflammatory bowel disease and its association with oral immunological biomarkers. DNA Res.: Int. J. Rapid Publ. Rep. Genes Genomes 21(1), 15–25 (2014)CrossRefGoogle Scholar
  7. 7.
    Brown, C., Hug, L., Thomas, B., Sharon, I., Castelle, C., Singh, A., et al.: Unusual biology across a group comprising more than 15% of domain Bacteria. Nature 523(7559), 208–211 (2015)CrossRefGoogle Scholar
  8. 8.
    Zhang, Z., Schwartz, S., Wagner, L., Miller, W.: A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7(1–2), 203–214 (2004)Google Scholar
  9. 9.
    Huson, D.H., Auch, A.F., Qi, J., Schuster, S.C.: Megan analysis of metagenomic data. Genome Res. 17, 377–386 (2007)CrossRefGoogle Scholar
  10. 10.
    Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Pea, A.G., Goodrich, J.K., Gordon, J.I., Huttley, G.A., Kelley, S.T., Knights, D., Koenig, J.E., Ley, R.E., Lozupone, C.A., McDonald, D., Muegge, B.D., Pirrung, M., Reeder, J., Sevinsky, J.R., Turnbaugh, P.J., Walters, W.A., Widmann, J., Yatsunenko, T., Zaneveld, J., Knight, R.: Qiime allows analysis of high-throughput community sequencing data. Nat. Methods 7(5), 335–336 (2010)CrossRefGoogle Scholar
  11. 11.
    Liu, B., Gibbons, T., Ghodsi, M., Treangen, T., Pop, M.: Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics 12, P11 (2011)CrossRefGoogle Scholar
  12. 12.
    Segata, N., Waldron, L., Ballarini, A., Narasimhan, V., Jousson, O., Huttenhower, C.: Metagenomic microbial community profiling using unique clade-specific marker genes. Nat. Methods 9, 811 (2012)CrossRefGoogle Scholar
  13. 13.
    Wood, D., Salzberg, S.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014)CrossRefGoogle Scholar
  14. 14.
    Ounit, R., Wanamaker, S., Close, T.J., Lonardi, S.: Clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16(1), 1–13 (2015)CrossRefGoogle Scholar
  15. 15.
    Ames, S.K., Hysom, D.A., Gardner, S.N., Lloyd, G.S., Gokhale, M.B., Allen, J.E.: Scalable metagenomic taxonomy classification using a reference genome database. Bioinformatics 29, 2253–2260 (2013)CrossRefGoogle Scholar
  16. 16.
    Lindgreen, S., Adair, K.L., Gardner, P.: An evaluation of the accuracy and speed of metagenome analysis tools. Sci. Rep. 6, 19233 (2016)CrossRefGoogle Scholar
  17. 17.
    Marchiori, D., Comin, M.: Skraken: fast and sensitive classification of short metagenomic reads based on filtering uninformative k-mers. In: Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2017), pp. 59–67 (2017)Google Scholar
  18. 18.
    Vinga, S., Almeida, J.: Alignment-free sequence comparison-a review. Bioinformatics 19, 513–523 (2003)CrossRefGoogle Scholar
  19. 19.
    Comin, M., Verzotto, D.: Whole-genome phylogeny by virtue of unic subwords. In: 2012 23rd International Workshop on Database and Expert Systems Applications (DEXA), pp. 190–194, September 2012Google Scholar
  20. 20.
    Sims, G.E., Jun, S.R., Wu, G.A., Kim, S.H.: Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. Nat. Acad. Sci. 106, 2677–2682 (2009)CrossRefGoogle Scholar
  21. 21.
    Antonello, M., Comin, M.: Fast alignment-free comparison for regulatory sequences using multiple resolution entropic profiles. In: Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOSTEC 2015), pp. 171–177 (2015)Google Scholar
  22. 22.
    Comin, M., Antonello, M.: On the comparison of regulatory sequences with multiple resolution entropic profiles. BMC Bioinf. 17(1), 130 (2016)CrossRefGoogle Scholar
  23. 23.
    Comin, M., Verzotto, D.: Beyond fixed-resolution alignment-free measures for mammalian enhancers sequence comparison. IEEE/ACM Trans. Comput. Biol. Bioinf. 11(4), 628–637 (2014)CrossRefGoogle Scholar
  24. 24.
    Goke, J., Schulz, M.H., Lasserre, J., Vingron, M.: Estimation of pairwise sequence similarity of mammalian enhancers with word neighbourhood counts. Bioinformatics 28(5), 656–663 (2012)CrossRefGoogle Scholar
  25. 25.
    Kantorovitz, M.R., Robinson, G.E., Sinha, S.: A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics 23, i249–i255 (2007)CrossRefGoogle Scholar
  26. 26.
    Comin, M., Antonello, M.: Fast computation of entropic profiles for the detection of conservation in genomes. In: Ngom, A., Formenti, E., Hao, J.-K., Zhao, X.-M., van Laarhoven, T. (eds.) PRIB 2013. LNCS, vol. 7986, pp. 277–288. Springer, Heidelberg (2013). Scholar
  27. 27.
    Antonello, M., Comin, M.: Fast entropic profiler: an information theoretic approach for the discovery of patterns in genomes. IEEE/ACM Trans. Comput. Biol. Bioinf. 11(3), 500–509 (2014)CrossRefGoogle Scholar
  28. 28.
    Schimd, M., Comin, M.: Fast comparison of genomic and meta-genomic reads with alignment-free measures based on quality values. BMC Med. Genomics 9(1), 41–50 (2016)Google Scholar
  29. 29.
    Comin, M., Leoni, A., Schimd, M.: Clustering of reads with alignment-free measures and quality values. Algorithms Mol. Biol. 10(1), 1–10 (2015)CrossRefGoogle Scholar
  30. 30.
    Comin, M., Schimd, M.: Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns. BMC Bioinf. 15(9), 1–10 (2014)CrossRefGoogle Scholar
  31. 31.
    Ondov, B.D., Treangen, T.J., Melsted, P., Mallonee, A.B., Bergman, N.H., Koren, S., Phillippy, A.M.: Mash: fast genome and metagenome distance estimation using MinHash. bioRxiv (2016)Google Scholar
  32. 32.
    Girotto, S., Pizzi, C., Comin, M.: Metaprob: accurate metagenomic reads binning based on probabilistic sequence signatures. Bioinformatics 32(17), i567–i575 (2016)CrossRefGoogle Scholar
  33. 33.
    Girotto, S., Comin, M., Pizzi, C.: Metagenomic reads binning with spaced seeds. Theor. Comput. Sci. 698, 88–99 (2017)MathSciNetCrossRefGoogle Scholar
  34. 34.
    Girotto, S., Comin, M., Pizzi, C.: Higher recall in metagenomic sequence classification exploiting overlapping reads. BMC Genomics 18, 917 (2017)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Department of Information EngineeringUniversity of PadovaPaduaItaly

Personalised recommendations