The Journal of Microbiology

, Volume 50, Issue 2, pp 181–185 | Cite as

TBC: A clustering algorithm based on prokaryotic taxonomy

  • Jae-Hak Lee
  • Hana Yi
  • Yoon-Seong Jeon
  • Sungho Won
  • Jongsik Chun


High-throughput DNA sequencing technologies have revolutionized the study of microbial ecology. Massive sequencing of PCR amplicons of the 16S rRNA gene has been widely used to understand the microbial community structure of a variety of environmental samples. The resulting sequencing reads are clustered into operational taxonomic units that are then used to calculate various statistical indices that represent the degree of species diversity in a given sample. Several algorithms have been developed to perform this task, but they tend to produce different outcomes. Herein, we propose a novel sequence clustering algorithm, namely Taxonomy-Based Clustering (TBC). This algorithm incorporates the basic concept of prokaryotic taxonomy in which only comparisons to the type strain are made and used to form species while omitting full-scale multiple sequence alignment. The clustering quality of the proposed method was compared with those of MOTHUR, BLASTClust, ESPRIT-Tree, CD-HIT, and UCLUST. A comprehensive comparison using three different experimental datasets produced by pyrosequencing demonstrated that the clustering obtained using TBC is comparable to those obtained using MOTHUR and ESPRIT-Tree and is computationally efficient. The program was written in JAVA and is available from


TBC clustering algorithm OTU CD-HIT UCLUST MOTHUR ESPRIT-Tree BLASTClust pyrosequencing metagenome 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

12275_2012_1214_MOESM1_ESM.pdf (68 kb)
Supplementary material, approximately 68.0 KB.


  1. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res.25, 3389–3402.PubMedCrossRefGoogle Scholar
  2. Bacon, D.J. and Anderson, W.F. 1986. Multiple sequence alignment. J. Mol. Biol.191, 153–161.PubMedCrossRefGoogle Scholar
  3. Cai, Y. and Sun, Y. 2011. ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time. Nucleic Acids Res. doi:10.1093/nar/gkr349.Google Scholar
  4. Cameron, M., Bernstein, Y., and Williams, H.E. 2007. Clustered sequence representation for fast homology search. J. Comput. Biol.14, 594–614.PubMedCrossRefGoogle Scholar
  5. Chao, A. 1984. Non-parametric estimation of the number of classes in a population. Scand. J. Stat.11, 265–270.Google Scholar
  6. Chao, A.L. and Lee, S.M. 1992. Estimating the number of classes via sample coverage. J. Am. Stat. Assoc.87, 210–217.Google Scholar
  7. Chao, A.M., Ma, M.C., and Yang, M.C.K. 1993. Stopping rules and estimation for recapture debugging with unequal failure rates. Biometrika80, 193–201.CrossRefGoogle Scholar
  8. Chun, J., Kim, K.Y., Lee, J.H., and Choi, Y. 2010. The analysis of oral microbial communities of wild-type and toll-like receptor 2-deficient mice using a 454 GS FLX Titanium pyrosequencer. BMC Microbiol.10, 101.PubMedCrossRefGoogle Scholar
  9. Edgar, R.C. 2004. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res.32, 1792–1797.PubMedCrossRefGoogle Scholar
  10. Edgar, R.C. 2010. Search and clustering orders of magnitude faster than BLAST. Bioinformatics26, 2460–2461.PubMedCrossRefGoogle Scholar
  11. Hamady, M. and Knight, R. 2009. Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. Genome Res.19, 1141–1152.PubMedCrossRefGoogle Scholar
  12. Hurlbert, S.H. 1971. The non-concept of species diversity: a critique and alternative parameters. Ecology52, 577–586.CrossRefGoogle Scholar
  13. Kuenne, C.T., Ghai, R., Chakraborty, T., and Hain, T. 2007. GECO — linear visualization for comparative genomics. Bioinformatics23, 125–126.PubMedCrossRefGoogle Scholar
  14. Li, W. and Godzik, A. 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics22, 1658–1659.PubMedCrossRefGoogle Scholar
  15. Li, W., Jaroszewski, L., and Godzik, A. 2001. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics17, 282–283.PubMedCrossRefGoogle Scholar
  16. Li, W., Jaroszewski, L., and Godzik, A. 2002. Sequence clustering strategies improve remote homology recognitions while reducing search times. Protein Eng.15, 643–649.PubMedCrossRefGoogle Scholar
  17. Li, W., Wooley, J.C., and Godzik, A. 2008. Probing metagenomics by rapid cluster analysis of very large datasets. PLoS One3, e3375.PubMedCrossRefGoogle Scholar
  18. Ling, Z., Kong, J., Liu, F., Zhu, H., Chen, X., Wang, Y., Li, L., Nelson, K.E., Xia, Y., and Xiang, C. 2010. Molecular analysis of the diversity of vaginal microbiota associated with bacterial vaginosis. BMC Genomics11, 488.PubMedCrossRefGoogle Scholar
  19. Metzker, M.L. 2010. Sequencing technologies — the next generation. Nat. Rev. Genet.11, 31–46.PubMedCrossRefGoogle Scholar
  20. Myers, E.W. and Miller, W. 1988. Optimal alignments in linear space. Comput. Appl. Biosci.4, 11–17.PubMedGoogle Scholar
  21. Petrosino, J.F., Highlander, S., Luna, R.A., Gibbs, R.A., and Versalovic, J. 2009. Metagenomic pyrosequencing and microbial identification. Clin. Chem.55, 856–866.PubMedCrossRefGoogle Scholar
  22. Retief, J.D. 2000. Phylogenetic analysis using PHYLIP. Methods Mol. Biol.132, 243–258.PubMedGoogle Scholar
  23. Schloss, P.D., Westcott, S.L., Ryabin, T., Hall, J.R., Hartmann, M., Hollister, E.B., Lesniewski, R.A., Oakley, B.B., Parks, D.H., Robinson, C.J., andet al. 2009. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol.75, 7537–7541.PubMedCrossRefGoogle Scholar
  24. Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res.22, 4673–4680.PubMedCrossRefGoogle Scholar
  25. Wayne, L.G., Brenner, D.J., Colwell, R.R., Grimont, P.A.D., Kandler, O., Krichevsky, M.I., Moore, L.H., Moore, W.E.C., Murray, R.G.E., Stackebrandt, E., andet al. 1987. Report of the ad hoc committee on reconciliation of approaches to bacterial systematics. Int. J. Syst. Bacteriol.37, 463–464.CrossRefGoogle Scholar
  26. Yang, F., Zhu, Q., Tang, D., and Zhao, M. 2009. Using affinity propagation combined post-processing to cluster protein sequences. Protein Pept. Lett.17, 681–689.CrossRefGoogle Scholar

Copyright information

© The Microbiological Society of Korea and Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Jae-Hak Lee
    • 1
  • Hana Yi
    • 2
  • Yoon-Seong Jeon
    • 1
    • 5
  • Sungho Won
    • 3
  • Jongsik Chun
    • 1
    • 2
    • 4
    • 5
  1. 1.Interdisciplinary Graduate Program in BioinformaticsSeoul National UniversitySeoulRepublic of Korea
  2. 2.Inst. of Molecular Biology and GeneticsSeoul National UniversitySeoulRepublic of Korea
  3. 3.Department of StatisticsChung-Ang UniversitySeoulRepublic of Korea
  4. 4.School of Biological Sciences and Advanced Inst. of Convergence Tech.Seoul National UniversitySeoulRepublic of Korea
  5. 5.Chunlab, Inc.Seoul National UniversitySeoulRepublic of Korea

Personalised recommendations