Encyclopedia of Metagenomics

2015 Edition
| Editors: Sarah K. Highlander, Francisco Rodriguez-Valera, Bryan A. White

Clustering-Based HMP Sequence Comparison

Reference work entry
DOI: https://doi.org/10.1007/978-1-4899-7475-4_90


Comparison of sequences from human microbiome projects using sequence clustering methods


Sequence clustering is a computational method that groups similar sequences into families. Clustering sequences from multiple samples from human microbiome projects or other metagenomic projects can effectively compare these samples.


Numerous human microbiome projects and other metagenomic projects have sequenced many microbiome samples using high-throughput sequencing platforms. One of the key goals of these projects is to compare samples or groups of samples according to their composition and abundance profiles by taxon, gene, function, and pathway. These profiles are often calculated by comparing the sequences against various reference databases. However, reference-based methods cannot analyze the large number of novel sequences that are frequently found in metagenomics samples.

Clustering analysis is a data mining and classification method that assigns similar...

This is a preview of subscription content, log in to check access.


  1. Altschul SF, Madden TL, Schaffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402.PubMedCentralPubMedGoogle Scholar
  2. Huang Y, Niu B, Gao Y, et al. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010;26:680–2.PubMedCentralPubMedGoogle Scholar
  3. Huse SM, Welch DM, Morrison HG, et al. Ironing out the wrinkles in the rare biosphere through improved OTU clustering. Environ Microbiol. 2010;12:1889–98.PubMedCentralPubMedGoogle Scholar
  4. Kunin V, Engelbrektson A, Ochman H, et al. Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates. Environ Microbiol. 2010;12:118–23.PubMedGoogle Scholar
  5. Li W. Analysis and comparison of very large metagenomes with fast clustering and functional annotation. BMC Bioinforma. 2009;10:359.Google Scholar
  6. Li WZ, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22:1658–9.PubMedGoogle Scholar
  7. Li W, Wooley JC, Godzik A. Probing metagenomics by rapid cluster analysis of very large datasets. PLoS ONE. 2008;3:e3375.PubMedCentralPubMedGoogle Scholar
  8. Li W, Fu L, Niu B, Wu S, Wooley J. Ultrafast clustering algorithms for metagenomic sequence analysis. Brief Bioinform. 2012. doi:10.1093/bib/bbs035.Google Scholar
  9. Niu B, Fu L, Sun S, et al. Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinforma. 2010;11:187.Google Scholar
  10. Qin J, Li R, Raes J, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010;464:59–65.PubMedCentralPubMedGoogle Scholar
  11. Quince C, Lanzén A, Curtis TP, et al. Accurate determination of microbial diversity from 454 pyrosequencing data. Nat Methods. 2009;6:639.PubMedGoogle Scholar
  12. Quince C, Lanzen A, Davenport RJ, et al. Removing noise from pyrosequenced amplicons. BMC Bioinforma. 2011;12:38.Google Scholar
  13. Reeder J, Knight R. Rapidly denoising pyrosequencing amplicon reads by exploiting rank-abundance distributions. Nat Methods. 2010;7:668–9.PubMedCentralPubMedGoogle Scholar
  14. Schloss PD, Handelsman J. Introducing DOTUR, a computer program for defining operational taxonomic units and estimating species richness. Appl Environ Microbiol. 2005;71:1501–6.PubMedCentralPubMedGoogle Scholar
  15. Schloss PD, Westcott SL, Ryabin T, et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009;75:7537–41.PubMedCentralPubMedGoogle Scholar
  16. Schloss PD, Gevers D, Westcott SL. Reducing the effects of PCR amplification and sequencing artifacts on 16S rRNA-based studies. PLoS ONE. 2011;6(12):e27310.PubMedCentralPubMedGoogle Scholar
  17. Turnbaugh PJ, Hamady M, Yatsunenko T, et al. A core gut microbiome in obese and lean twins. Nature. 2009;457:480–U487.PubMedCentralPubMedGoogle Scholar
  18. White JR, Nagarajan N, Pop M. Statistical methods for detecting differentially abundant features in clinical metagenomic samples. PLoS Comput Biol. 2009;5:e1000352.PubMedCentralPubMedGoogle Scholar
  19. Wu S, Zhu Z, Fu L, et al. WebMGA: a customizable web server for fast metagenomic sequence analysis. BMC Genomics. 2011;12:444.PubMedCentralPubMedGoogle Scholar
  20. Yooseph S, Sutton G, Rusch DB, et al. The sorcerer II global ocean sampling expedition: expanding the universe of protein families. PLoS Biol. 2007;5:e16.PubMedCentralPubMedGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Center for Research in Biological Systems (CRBS)University of CaliforniaLa JollaUSA