Accurate Reconstruction of Microbial Strains from Metagenomic Sequencing Using Representative Reference Genomes

  • Zhemin ZhouEmail author
  • Nina LuhmannEmail author
  • Nabil-Fareed Alikhan
  • Christopher Quince
  • Mark AchtmanEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10812)


Exploring the genetic diversity of microbes within the environment through metagenomic sequencing first requires classifying these reads into taxonomic groups. Current methods compare these sequencing data with existing biased and limited reference databases. Several recent evaluation studies demonstrate that current methods either lack sufficient sensitivity for species-level assignments or suffer from false positives, overestimating the number of species in the metagenome. Both are especially problematic for the identification of low-abundance microbial species, e. g. detecting pathogens in ancient metagenomic samples. We present a new method, SPARSE, which improves taxonomic assignments of metagenomic reads. SPARSE balances existing biased reference databases by grouping reference genomes into similarity-based hierarchical clusters, implemented as an efficient incremental data structure. SPARSE assigns reads to these clusters using a probabilistic model, which specifically penalizes non-specific mappings of reads from unknown sources and hence reduces false-positive assignments. Our evaluation on simulated datasets from two recent evaluation studies demonstrated the improved precision of SPARSE in comparison to other methods for species-level classification. In a third simulation, our method successfully differentiated multiple co-existing Escherichia coli strains from the same sample. In real archaeological datasets, SPARSE identified ancient pathogens with \({\le }0.02\%\) abundance, consistent with published findings that required additional sequencing data. In these datasets, other methods either missed targeted pathogens or reported non-existent ones.

SPARSE and all evaluation scripts are available at


Metagenomic Reads Species-level Assignment Metagenomic Samples Taxonomic Binning Core Genome 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



M.A., Z.Z., N.L. and N-F.A. were supported by Wellcome Trust (202792/Z/16/Z). Additional initial grant support was from BBSRC (BB/L020319/1).


  1. 1.
    Ahn, T.H., Chai, J., Pan, C.: Sigma: strain-level inference of genomes from metagenomic analysis for biosurveillance. Bioinformatics 31(2), 170–177 (2015)CrossRefGoogle Scholar
  2. 2.
    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)CrossRefGoogle Scholar
  3. 3.
    Ding, W., Baumdicker, F., Neher, R.A.: panX: pan-genome analysis and exploration. bioRxiv 10.1101/072082 (2016)Google Scholar
  4. 4.
    Dröge, J., Gregor, I., McHardy, A.C.: Taxator-tk: precise taxonomic assignment of metagenomes by fast approximation of evolutionary neighborhoods. Bioinformatics 31(6), 817–824 (2014)CrossRefGoogle Scholar
  5. 5.
    Herbig, A., Maixner, F., Bos, K.I., Zink, A., Krause, J., Huson, D.H.: Malt: fast alignment and analysis of metagenomic DNA sequence data applied to the Tyrolean Iceman. bioRxiv 10.1101/050559 (2016)Google Scholar
  6. 6.
    Huson, D.H., Beier, S., Flade, I., Górska, A., El-Hadidi, M., Mitra, S., Ruscheweyh, H.J., Tappu, R.: MEGAN community edition-interactive exploration and analysis of large-scale microbiome sequencing data. PLoS Comput. Biol. 12(6), e1004957 (2016)CrossRefGoogle Scholar
  7. 7.
    Kay, G.L., Sergeant, M.J., Zhou, Z., Chan, J.Z.M., Millard, A., Quick, J., Szikossy, I., Pap, I., Spigelman, M., Loman, N.J., Achtman, M., Donoghue, H.D., Pallen, M.J.: Eighteenth-century genomes show that mixed infections were common at time of peak tuberculosis in Europe. Nat. Commun. 6, 6717 (2015)CrossRefGoogle Scholar
  8. 8.
    Key, F.M., Posth, C., Krause, J., Herbig, A., Bos, K.I.: Mining metagenomic data sets for ancient DNA: recommended protocols for authentication. Trends Genet. 33(8), 508–520 (2017)CrossRefGoogle Scholar
  9. 9.
    Kim, D., Song, L., Breitwieser, F.P., Salzberg, S.L.: Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26(12), 1721–1729 (2016)CrossRefGoogle Scholar
  10. 10.
    Konstantinidis, K.T., Tiedje, J.M.: Genomic insights that advance the species definition for prokaryotes. Proc. Natl. Acad. Sci. 102(7), 2567–2572 (2005)CrossRefGoogle Scholar
  11. 11.
    Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9(4), 357–359 (2012)CrossRefGoogle Scholar
  12. 12.
    Maixner, F., Krause-Kyora, B., Turaev, D., Herbig, A., Hoopmann, M.R., Hallows, J.L., Kusebauch, U., Vigl, E.E., Malfertheiner, P., Megraud, F., et al.: The 5300-year-old Helicobacter pylori genome of the Iceman. Science 351(6269), 162–165 (2016)CrossRefGoogle Scholar
  13. 13.
    Marakeby, H., Badr, E., Torkey, H., Song, Y., Leman, S., Monteil, C.L., Heath, L.S., Vinatzer, B.A.: A system to automatically classify and name any individual genome-sequenced organism independently of current biological classification and nomenclature. PLoS One 9(2), e89142 (2014)CrossRefGoogle Scholar
  14. 14.
    McIntyre, A.B.R., Ounit, R., Afshinnekoo, E., Prill, R.J., Hénaff, E., Alexander, N., Minot, S.S., Danko, D., Foox, J., Ahsanuddin, S., et al.: Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Genome Biol. 18(1), 182 (2017)CrossRefGoogle Scholar
  15. 15.
    Minot, S.S., Krumm, N., Greenfield, N.B.: One Codex: A Sensitive and Accurate Data Platform for Genomic Microbial Identification. bioRxiv 10.1101/027607 (2015)Google Scholar
  16. 16.
    Nayfach, S., Rodriguez-Mueller, B., Garud, N., Pollard, K.S.: An integrated metagenomics pipeline for strain profiling reveals novel patterns of bacterial transmission and biogeography. Genome Res. 26(11), 1612–1625 (2016)CrossRefGoogle Scholar
  17. 17.
    O’Leary, N.A., Wright, M.W., Brister, J.R., Ciufo, S., Haddad, D., McVeigh, R., Rajput, B., Robbertse, B., Smith-White, B., Ako-Adjei, D., et al.: Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44(D1), D733–D745 (2015)CrossRefGoogle Scholar
  18. 18.
    Ondov, B.D., Treangen, T.J., Melsted, P., Mallonee, A.B., Bergman, N.H., Koren, S., Phillippy, A.M.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17(1), 132 (2016)CrossRefGoogle Scholar
  19. 19.
    Pielou, E.C.: Ecological Diversity. Wiley, New York (1975)Google Scholar
  20. 20.
    Quast, C., Pruesse, E., Yilmaz, P., Gerken, J., Schweer, T., Yarza, P., Peplies, J., Glöckner, F.O.: The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 41(D1), D590–D596 (2012)CrossRefGoogle Scholar
  21. 21.
    Quince, C., Delmont, T.O., Raguideau, S., Alneberg, J., Darling, A.E., Collins, G., Eren, A.M.: DESMAN: a new tool for de novo extraction of strains from metagenomes. Genome Biol. 18(1), 181 (2017)CrossRefGoogle Scholar
  22. 22.
    Rasmussen, S., Allentoft, M.E., Nielsen, K., Orlando, L., Sikora, M., Sjögren, K.G., Pedersen, A.G., Schubert, M., Van Dam, A., Kapel, C.M.O., et al.: Early divergent strains of Yersinia pestis in Eurasia 5,000 years ago. Cell 163(3), 571–582 (2015)CrossRefGoogle Scholar
  23. 23.
    Sczyrba, A., Hofmann, P., Belmann, P., Koslicki, D., Janssen, S., Dröge, J., Gregor, I., Majda, S., Fiedler, J., Dahms, E., Bremges, A., Fritz, A., Garrido-Oter, R., Jørgensen, T.S., et al.: Critical assessment of metagenome interpretation-a benchmark of metagenomics software. Nat. Methods 14(11), 1063 (2017)CrossRefGoogle Scholar
  24. 24.
    Truong, D.T., Franzosa, E.A., Tickle, T.L., Scholz, M., Weingart, G., Pasolli, E., Tett, A., Huttenhower, C., Segata, N.: Metaphlan2 for enhanced metagenomic taxonomic profiling. Nat. Methods 12(10), 902–903 (2015)CrossRefGoogle Scholar
  25. 25.
    Wood, D.E., Salzberg, S.L.: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15(3), R46 (2014)CrossRefGoogle Scholar
  26. 26.
    Zhou, Z., Lundstrøm, I., Tran-Dien, A., Duchêne, S., Alikhan, N.F., Sergeant, M.J., Langridge, G., Fotakis, A.K., Nair, S., Stenøien, H.K., et al.: Millennia of genomic stability within the invasive Para C Lineage of Salmonella enterica. bioRxiv 10.1101/105759 (2017)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Warwick Medical SchoolUniversity of WarwickCoventryUK

Personalised recommendations