Skip to main content

Unbiased Taxonomic Annotation of Metagenomic Samples

Part of the Lecture Notes in Computer Science book series (LNBI,volume 10330)

Abstract

The classification of reads from a metagenomic sample using a reference taxonomy is usually based on first mapping the reads to the reference sequences and then, classifying each read at a node under the lowest common ancestor of the candidate sequences in the reference taxonomy with the least classification error. However, this taxonomic annotation can be biased by an imbalanced taxonomy and also by the presence of multiple nodes in the taxonomy with the least classification error for a given read. In this paper, we show that the Rand index is a better indicator of classification error than the often used area under the ROC curve and F-measure for both balanced and imbalanced reference taxonomies, and we also address the second source of bias by reducing the taxonomic annotation problem for a whole metagenomic sample to a set cover problem, for which a logarithmic approximation can be obtained in linear time.

Keywords

  • Metagenomics
  • Classification
  • Taxonomic annotation
  • Correlation
  • Set cover

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-319-59575-7_15
  • Chapter length: 12 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   64.99
Price excludes VAT (USA)
  • ISBN: 978-3-319-59575-7
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   84.99
Price excludes VAT (USA)
Fig. 1.
Fig. 2.
Fig. 3.

References

  1. Alonso, D., Barré, A., Beretta, S., Bonizzoni, P., Nikolski, M., Valiente, G.: Further steps in TANGO: improved taxonomic assignment in metagenomics. Bioinformatics 30(1), 17–23 (2013)

    CrossRef  Google Scholar 

  2. Bar-Yehuda, R., Even, S.: A linear-time approximation algorithm for the weighted vertex cover problem. J. Algorithms 2(2), 198–203 (1981)

    MathSciNet  CrossRef  MATH  Google Scholar 

  3. Clemente, J.C., Jansson, J., Valiente, G.: Flexible taxonomic assignment of ambiguous sequencing reads. BMC Bioinform. 12(1), 8 (2011)

    CrossRef  Google Scholar 

  4. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. MIT Press, Cambridge (2009)

    MATH  Google Scholar 

  5. Federhen, S.: The NCBI taxonomy database. Nucleic Acids Res. 40(D1), D136–D143 (2012)

    CrossRef  Google Scholar 

  6. Federhen, S.: Type material in the NCBI taxonomy database. Nucleic Acids Res. 43(D1), D1086–D1098 (2015)

    CrossRef  Google Scholar 

  7. Fischer, J., Huson, D.H.: New common ancestor problems in trees and directed acyclic graphs. Inform. Process. Lett. 110(8–9), 331–335 (2010)

    MathSciNet  CrossRef  MATH  Google Scholar 

  8. Fosso, B., Santamaria, M., D’Antonio, M., Lovero, D., Corrado, G., Vizza, E., Passero, N., Garbuglia, A.R., Capobianchi, M.R., Crescenzi, M., Valiente, G., Pesole, G.: MetaShot: An accurate workflow for taxon classification of host-associated microbiome from shotgun metagenomic data. Bioinformatics (2017, in press)

    Google Scholar 

  9. Fosso, B., Santamaria, M., Marzano, M., Alonso, D., Valiente, G., Donvito, G., Monaco, A., Notarangelo, P., Pesole, G.: BioMaS: a modular pipeline for bioinformatic analysis of metagenomic amplicons. BMC Bioinform. 16(1), 203 (2015)

    CrossRef  Google Scholar 

  10. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to NP-Completeness. Freeman, Dallas (1979)

    MATH  Google Scholar 

  11. Huerta-Cepas, J., Serra, F., Bork, P.: ETE 3: reconstruction, analysis and visualization of phylogenomic data. Mol. Biol. Evol. 33(6), 1635–1638 (2016)

    CrossRef  Google Scholar 

  12. Huson, D.H., Auch, A., Qi, J., Schuster, S.C.: MEGAN analysis of metagenomic data. Genome Res. 17(3), 377–386 (2007)

    CrossRef  Google Scholar 

  13. Jaccard, P.: Étude comparative de la distribution florale dans une portion des Alpes et du Jura. Bull. Soc. Vaud. Sc. Nat. 37(142), 547–579 (1901)

    Google Scholar 

  14. Johnson, D.S.: Approximation algorithms for combinatorial problems. J. Comput. Syst. Sci. 9(3), 256–278 (1974)

    MathSciNet  CrossRef  MATH  Google Scholar 

  15. Kunin, V., Copeland, A., Lapidus, A., Mavromatis, K., Hugenholtz, P.: A bioinformatician’s guide to metagenomics. Microbiol. Mol. Biol. Rev. 72(4), 557–578 (2008)

    CrossRef  Google Scholar 

  16. López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inform. Sci. 250(1), 113–141 (2013)

    CrossRef  Google Scholar 

  17. Matthews, B.W.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 405(2), 442–451 (1975)

    CrossRef  Google Scholar 

  18. Powers, D.M.W.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Tech. 2(1), 37–63 (2011)

    MathSciNet  Google Scholar 

  19. Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)

    CrossRef  Google Scholar 

  20. Thomas, T., Gilbert, J., Meyer, F.: Metagenomics: a guide from sampling to data analysis. Microb. Inform. Exp. 2(1), 3 (2012)

    CrossRef  Google Scholar 

  21. Wooley, J.C., Godzik, A., Friedberg, I.: A primer on metagenomics. PLoS Comput. Biol. 6(2), e1000667 (2010)

    CrossRef  Google Scholar 

  22. Youden, W.J.: Index for rating diagnostic tests. Cancer 3(1), 32–35 (1950)

    CrossRef  Google Scholar 

  23. Yule, G.U.: On the methods of measuring association between two attributes. J. R. Statist. Soc. 75(6), 579–642 (1912)

    CrossRef  Google Scholar 

Download references

Acknowledgements

Partially supported by Spanish Ministry of Economy and Competitiveness and European Regional Development Fund project DPI2015-67082-P (MINECO/FEDER).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gabriel Valiente .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Fosso, B., Pesole, G., Rosselló, F., Valiente, G. (2017). Unbiased Taxonomic Annotation of Metagenomic Samples. In: Cai, Z., Daescu, O., Li, M. (eds) Bioinformatics Research and Applications. ISBRA 2017. Lecture Notes in Computer Science(), vol 10330. Springer, Cham. https://doi.org/10.1007/978-3-319-59575-7_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-59575-7_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-59574-0

  • Online ISBN: 978-3-319-59575-7

  • eBook Packages: Computer ScienceComputer Science (R0)