Unbiased Taxonomic Annotation of Metagenomic Samples

  • Bruno Fosso
  • Graziano Pesole
  • Francesc Rosselló
  • Gabriel Valiente
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10330)

Abstract

The classification of reads from a metagenomic sample using a reference taxonomy is usually based on first mapping the reads to the reference sequences and then, classifying each read at a node under the lowest common ancestor of the candidate sequences in the reference taxonomy with the least classification error. However, this taxonomic annotation can be biased by an imbalanced taxonomy and also by the presence of multiple nodes in the taxonomy with the least classification error for a given read. In this paper, we show that the Rand index is a better indicator of classification error than the often used area under the ROC curve and F-measure for both balanced and imbalanced reference taxonomies, and we also address the second source of bias by reducing the taxonomic annotation problem for a whole metagenomic sample to a set cover problem, for which a logarithmic approximation can be obtained in linear time.

Keywords

Metagenomics Classification Taxonomic annotation Correlation Set cover 

References

  1. 1.
    Alonso, D., Barré, A., Beretta, S., Bonizzoni, P., Nikolski, M., Valiente, G.: Further steps in TANGO: improved taxonomic assignment in metagenomics. Bioinformatics 30(1), 17–23 (2013)CrossRefGoogle Scholar
  2. 2.
    Bar-Yehuda, R., Even, S.: A linear-time approximation algorithm for the weighted vertex cover problem. J. Algorithms 2(2), 198–203 (1981)MathSciNetCrossRefMATHGoogle Scholar
  3. 3.
    Clemente, J.C., Jansson, J., Valiente, G.: Flexible taxonomic assignment of ambiguous sequencing reads. BMC Bioinform. 12(1), 8 (2011)CrossRefGoogle Scholar
  4. 4.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. MIT Press, Cambridge (2009)MATHGoogle Scholar
  5. 5.
    Federhen, S.: The NCBI taxonomy database. Nucleic Acids Res. 40(D1), D136–D143 (2012)CrossRefGoogle Scholar
  6. 6.
    Federhen, S.: Type material in the NCBI taxonomy database. Nucleic Acids Res. 43(D1), D1086–D1098 (2015)CrossRefGoogle Scholar
  7. 7.
    Fischer, J., Huson, D.H.: New common ancestor problems in trees and directed acyclic graphs. Inform. Process. Lett. 110(8–9), 331–335 (2010)MathSciNetCrossRefMATHGoogle Scholar
  8. 8.
    Fosso, B., Santamaria, M., D’Antonio, M., Lovero, D., Corrado, G., Vizza, E., Passero, N., Garbuglia, A.R., Capobianchi, M.R., Crescenzi, M., Valiente, G., Pesole, G.: MetaShot: An accurate workflow for taxon classification of host-associated microbiome from shotgun metagenomic data. Bioinformatics (2017, in press)Google Scholar
  9. 9.
    Fosso, B., Santamaria, M., Marzano, M., Alonso, D., Valiente, G., Donvito, G., Monaco, A., Notarangelo, P., Pesole, G.: BioMaS: a modular pipeline for bioinformatic analysis of metagenomic amplicons. BMC Bioinform. 16(1), 203 (2015)CrossRefGoogle Scholar
  10. 10.
    Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to NP-Completeness. Freeman, Dallas (1979)MATHGoogle Scholar
  11. 11.
    Huerta-Cepas, J., Serra, F., Bork, P.: ETE 3: reconstruction, analysis and visualization of phylogenomic data. Mol. Biol. Evol. 33(6), 1635–1638 (2016)CrossRefGoogle Scholar
  12. 12.
    Huson, D.H., Auch, A., Qi, J., Schuster, S.C.: MEGAN analysis of metagenomic data. Genome Res. 17(3), 377–386 (2007)CrossRefGoogle Scholar
  13. 13.
    Jaccard, P.: Étude comparative de la distribution florale dans une portion des Alpes et du Jura. Bull. Soc. Vaud. Sc. Nat. 37(142), 547–579 (1901)Google Scholar
  14. 14.
    Johnson, D.S.: Approximation algorithms for combinatorial problems. J. Comput. Syst. Sci. 9(3), 256–278 (1974)MathSciNetCrossRefMATHGoogle Scholar
  15. 15.
    Kunin, V., Copeland, A., Lapidus, A., Mavromatis, K., Hugenholtz, P.: A bioinformatician’s guide to metagenomics. Microbiol. Mol. Biol. Rev. 72(4), 557–578 (2008)CrossRefGoogle Scholar
  16. 16.
    López, V., Fernández, A., García, S., Palade, V., Herrera, F.: An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inform. Sci. 250(1), 113–141 (2013)CrossRefGoogle Scholar
  17. 17.
    Matthews, B.W.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 405(2), 442–451 (1975)CrossRefGoogle Scholar
  18. 18.
    Powers, D.M.W.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. J. Mach. Learn. Tech. 2(1), 37–63 (2011)MathSciNetGoogle Scholar
  19. 19.
    Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)CrossRefGoogle Scholar
  20. 20.
    Thomas, T., Gilbert, J., Meyer, F.: Metagenomics: a guide from sampling to data analysis. Microb. Inform. Exp. 2(1), 3 (2012)CrossRefGoogle Scholar
  21. 21.
    Wooley, J.C., Godzik, A., Friedberg, I.: A primer on metagenomics. PLoS Comput. Biol. 6(2), e1000667 (2010)CrossRefGoogle Scholar
  22. 22.
    Youden, W.J.: Index for rating diagnostic tests. Cancer 3(1), 32–35 (1950)CrossRefGoogle Scholar
  23. 23.
    Yule, G.U.: On the methods of measuring association between two attributes. J. R. Statist. Soc. 75(6), 579–642 (1912)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Bruno Fosso
    • 1
  • Graziano Pesole
    • 1
  • Francesc Rosselló
    • 2
  • Gabriel Valiente
    • 3
  1. 1.Institute of Biomembranes and BioenergeticsConsiglio Nazionale delle RicercheBariItaly
  2. 2.Department of Mathematics and Computer Science, Research Institute of Health ScienceUniversity of the Balearic IslandsPalma de MallorcaSpain
  3. 3.Algorithms, Bioinformatics, Complexity and Formal Methods Research GroupTechnical University of CataloniaBarcelonaSpain

Personalised recommendations