MetaVW: Large-Scale Machine Learning for Metagenomics Sequence Classification

  • Kévin Vervier
  • Pierre Mahé
  • Jean-Philippe VertEmail author
Part of the Methods in Molecular Biology book series (MIMB, volume 1807)


Metagenomics is the study of microbial community diversity, especially the uncultured microorganisms by shotgun sequencing environmental samples. As the sequencers throughput and the data volume increase, it becomes challenging to develop scalable bioinformatics tools that reconstruct microbiome structure by binning sequencing reads to reference genomes. Standard alignment-based methods, such as BWA-MEM, provide state-of-the-art performance, but we demonstrate in Vervier et al. (2016) that compositional approaches using nucleotides motifs have faster analysis time, for comparable accuracy. In this work, we describe how to use MetaVW, a scalable machine learning implementation for short sequencing reads binning, based on their k-mers profile. We provide a step-by-step guideline on how we trained the classification models and how it can easily generalize to user-defined reference genomes and specific applications. We also give additional details on what effect parameters in the algorithm have on performances.

Key words

Metagenomics Machine learning Classification Next-generation sequencing Microbiology Binning 



This work was supported by the European Research Council (SMAC-ERC-280032 to J-P.V.).


  1. 1.
    Handelsman J (2004) Metagenomics: application of genomics to uncultured microorganisms. Microbiol Mol Biol Rev 68(4):669–685CrossRefPubMedPubMedCentralGoogle Scholar
  2. 2.
    Quince C et al (2017) Shotgun metagenomics, from sampling to analysis. Nat Biotechnol 35(9):833–844CrossRefPubMedGoogle Scholar
  3. 3.
    Vervier K et al (2016) Large-scale machine learning for metagenomics sequence classification. Bioinformatics 32(7):1023–1032CrossRefPubMedGoogle Scholar
  4. 4.
    Wood DE, Salzberg SL (2014) Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 15:R46CrossRefPubMedPubMedCentralGoogle Scholar
  5. 5.
    Simner PJ et al (2018) Understanding the promises and hurdles of metagenomic next-generation sequencing as a diagnostic tool for infectious diseases. Clin Infect Dis 66(5): 778–788CrossRefPubMedGoogle Scholar
  6. 6.
    Sonnenburg S et al (2006) Large scale learning with string kernels. J Mach Learn Res 7:1531–1565Google Scholar
  7. 7.
    Gammerman A, Vovk V (2007) Hedging predictions in machine learning. Comp J 50(2):151–163CrossRefGoogle Scholar
  8. 8.
    Parks D et al (2011) Classifying short genomic fragments from novel lineages using composition and homology. BMC Bioinformatics 12:328–344CrossRefPubMedPubMedCentralGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Kévin Vervier
    • 1
  • Pierre Mahé
    • 2
  • Jean-Philippe Vert
    • 3
    • 4
    • 5
    • 6
    Email author
  1. 1.Department of PsychiatryUniversity of Iowa Hospital and ClinicsIowaUSA
  2. 2.Bioinformatics Research DepartmentBioMérieuxMarcy-l’ÉtoileFrance
  3. 3.MINES ParisTechPSL Research University, CBIO-Centre for Computational BiologyFontainebleauFrance
  4. 4.Institut CurieParis CedexFrance
  5. 5.INSERM U900Paris CedexFrance
  6. 6.Département de Mathématiques et ApplicationsÉcole Normale Supérieure, CNRS, PSL Research UniversityParisFrance

Personalised recommendations