Journal of Biosciences

, 44:148 | Cite as

A critical analysis of state-of-the-art metagenomics OTU clustering algorithms

  • Ashaq Hussain Bhat
  • Puniethaa PrabhuEmail author
  • Kalpana Balakrishnan


Taxonomic profiling, using hyper-variable regions of 16S rRNA, is one of the important goals in metagenomics analysis. Operational taxonomic unit (OTU) clustering algorithms are the important tools to perform taxonomic profiling by grouping 16S rRNA sequence reads into OTU clusters. Presently various OTU clustering algorithms are available within different pipelines, even some pipelines have implemented more than one clustering algorithms, but there is less literature available for the relative performance and features of these algorithms. This makes the choice of using these methods unclear. In this study five current state-of-the-art OTU clustering algorithms (CDHIT, Mothur’s Average Neighbour, SUMACLUST, Swarm, and UCLUST) have been comprehensively evaluated on the metagenomics sequencing data. It was found that in all the datasets, Mothur’s average neighbour and Swarm created more number of OTU clusters. Based on normalized mutual information (NMI) and normalized information difference (NID), Swarm and Mothur’s average neighbour showed better clustering qualities than others. But in terms of time complexity the greedy algorithms (SUMACLUST, CDHIT, and UCLUST) performed well. So there is a trade-off between quality and time, and it is necessary while analysing large size of 16S rRNA gene sequencing data.


Bioinformatics pipelines clustering metagenomics algorithms microbiome next generation sequencing operational taxonomic units 



The authors are grateful to the Management and Principal of K.S.Rangasamy College of Technology, Tiruchengode, Namakkal, Tamil Nadu, India, and DST FIST (Fund for Infrastructure for Science and Technology) FIST No: 368 (SR/FST/College – 235/2014)] and DBT-STAR College Scheme (BT/HRD/11/09/2018) for proving laboratory support in the Institution. Also, we would like to thank Prof. M. Umaiarasi for proof-reading of the manuscript.


  1. Albanese D, Fontana P, De Filippo C, Cavalieri D and Donati C 2015 MICCA: A complete and accurate software for taxonomic profiling of metagenomic data. Sci. Rep. 5 1–7CrossRefGoogle Scholar
  2. Allali I, Arnold JW, Roach J, et al. 2017 A comparison of sequencing platforms and bioinformatics pipelines for compositional analysis of the gut microbiome. BMC Microbiol. 17 194PubMedPubMedCentralCrossRefGoogle Scholar
  3. Altschul SF, Gish W, Miller W, Myers EW and Lipman DJ 1990 Basic local alignment search tool. J. Mol. Biol. 215 403–410PubMedCrossRefPubMedCentralGoogle Scholar
  4. Angly FE, Willner D, Rohwer F, Hugenholtz P and Tyson GW 2012 Grinder: A versatile amplicon and shotgun sequence simulator. Nucleic Acids Res. 40 e94PubMedPubMedCentralCrossRefGoogle Scholar
  5. Bernard G, Pathmanathan JS, Lannes R, Lopez P and Bapteste E 2018 Microbial dark matter investigations: how microbial studies transform biological knowledge and empirically sketch a logic of scientific discovery. Genome Biol. Evol. 10 707–715PubMedPubMedCentralCrossRefGoogle Scholar
  6. Bhat AH and Prabhu P 2017 OTU clustering: A window to analyse uncultured microbial world. Int. J. Sci. Res. Comput. Sci. Eng. 5 62–68Google Scholar
  7. Bleidorn C 2016 Third generation sequencing: Technology and its potential impact on evolutionary biodiversity research. Syst. Biodivers 14 1–8CrossRefGoogle Scholar
  8. Cai Y and Sun Y 2011 ESPRIT-Tree: Hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time. Nucleic Acids Res. 39 1–10CrossRefGoogle Scholar
  9. Caporaso JG, Kuczynski J, Stombaugh J, et al. 2010 QIIME allows analysis of high-throughput community sequencing data. Nat. Methods 7 335–336PubMedPubMedCentralCrossRefGoogle Scholar
  10. Chen W, Zhang CK, Cheng Y, Zhang S and Zhao H 2013 A Comparison of methods for clustering 16S rRNA sequences into OTUs. PLoS One 8 e70837PubMedPubMedCentralCrossRefGoogle Scholar
  11. Clemente JC, Ursell LK, Parfrey LW and Knight R 2012 The impact of the gut microbiota on human health: An Integrative view. Cell 148 1258–1270PubMedPubMedCentralCrossRefGoogle Scholar
  12. Edgar RC 2004 MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32 1792–1797PubMedPubMedCentralCrossRefGoogle Scholar
  13. Edgar RC 2010 Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26 2460–2461CrossRefGoogle Scholar
  14. Ghodsi M, Liu B and Pop M 2011 DNACLUST: Accurate and efficient clustering of phylogenetic marker genes. BMC Bioinformatics 12 271PubMedPubMedCentralCrossRefGoogle Scholar
  15. Giongo A, Davis-richardson AG, Crabb DB and Triplett EW 2010 TaxCollector: Modifying current 16S rRNA databases for the rapid classification at six taxonomic levels. Diversity 2 1015–1025CrossRefGoogle Scholar
  16. Hao X, Jiang R and Chen T 2011 Clustering 16S rRNA for OTU prediction: A method of unsupervised Bayesian clustering. Bioinformatics 27 611–618PubMedPubMedCentralCrossRefGoogle Scholar
  17. Janda JM and Abbott SL 2007 16S rRNA gene sequencing for bacterial identification in the diagnostic laboratory: Pluses, perils, and pitfalls. J. Clin. Microbiol. 45 2761–2764PubMedPubMedCentralCrossRefGoogle Scholar
  18. Kellenberger E 2001 Exploring the unknown. EMBO Rep (Vol. 2) (Oxford, UK: Wiley Online Library)Google Scholar
  19. Li W and Godzik A 2006 Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22 1658–1659CrossRefGoogle Scholar
  20. Lok C 2015 Mining the microbial dark matter. Nature 522 270–273PubMedCrossRefPubMedCentralGoogle Scholar
  21. Mahé F, Rognes T, Quince C, de Vargas C and Dunthorn M 2014 Swarm: robust and fast clustering method for amplicon-based studies. PeerJ 2 e593PubMedPubMedCentralCrossRefGoogle Scholar
  22. McDonald D, Price MN, Goodrich J, et al. 2012 An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J. 6 610–618PubMedCrossRefPubMedCentralGoogle Scholar
  23. Mercier C, Boyer F, Bonin A and Coissac E 2013 SUMATRA and SUMACLUST: fast and exact comparison and clustering of sequences. Abstr SeqBio 25-26th Nov 2013 27Google Scholar
  24. Metzker ML 2009 Sequencing technologies – the next generation. Nat. Rev. Genet. 11 31PubMedCrossRefPubMedCentralGoogle Scholar
  25. Oakley BB, Fiedler TL, Marrazzo JM and Fredricks DN 2008 Diversity of human vaginal bacterial communities and associations with clinically defined bacterial vaginosis. Appl. Environ. Microbiol. 74 4898–4909PubMedPubMedCentralCrossRefGoogle Scholar
  26. Ounit R, Wanamaker S, Close TJ and Lonardi S 2015 CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16 1–13CrossRefGoogle Scholar
  27. Park S, Choi H, Lee B, Chun J, Won J and Yoon S 2018 hc-OTU: A fast and accurate method for clustering operational taxonomic units based on homopolymer compaction. IEEE/ACM Trans. Comput. Biol. Bioinforma 15 441–451CrossRefGoogle Scholar
  28. Prabhu P and Duraiswamy K 2013 An efficient visual analysis method for cluster tendency evaluation, data partitioning and internal cluster validation. Comput. Informatics 32 1013–1037Google Scholar
  29. Rognes T, Flouri T, Nichols B, Quince C and Mahé F 2016 VSEARCH: a versatile open source tool for metagenomics. PeerJ 4 e2584PubMedPubMedCentralCrossRefGoogle Scholar
  30. Russell DJ, Way SF, Benson AK and Sayood K 2010 A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences. BMC Bioinformatics 11 601PubMedPubMedCentralCrossRefGoogle Scholar
  31. Schloss PD 2010 The effects of alignment quality, distance calculation method, sequence filtering, and region on the analysis of 16S rRNA gene-based studies. PLoS Comput. Biol. 6 19CrossRefGoogle Scholar
  32. Schloss PD and Handelsman J 2005 Introducing DOTUR, a computer program for defining opera: onal taxonomic units and Es:ma:ng species richness. Appl. Environ. Microbiol. 71 1501–1506PubMedPubMedCentralCrossRefGoogle Scholar
  33. Schloss PD, Westcott SL, Ryabin T, et al. 2009 Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 75 7537–7541PubMedPubMedCentralCrossRefGoogle Scholar
  34. Scholz MB, Lo CC and Chain PSG 2012 Next generation sequencing and bioinformatic bottlenecks: The current state of metagenomic data analysis. Curr. Opin. Biotechnol. 23 9–15PubMedCrossRefPubMedCentralGoogle Scholar
  35. Shokralla S, Spall JL, Gibson JF and Hajibabaei M 2012 Next‐generation sequencing technologies for environmental DNA research. Mol. Ecol. 21 1794–1805PubMedCrossRefPubMedCentralGoogle Scholar
  36. Sun Y, Cai Y, Huse SM, Knight R, Farmerie WG, Wang X and Mai V 2012 A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis. Brief Bioinform. 13 107–121PubMedCrossRefPubMedCentralGoogle Scholar
  37. Sun Y, Cai Y, Liu L, Yu F, Farrell ML, McKendree W and Farmerie W 2009 ESPRIT: estimating species richness using large collections of 16S rRNA shotgun sequences (supplementary data). Nucleic Acids Res. 39 1–18Google Scholar
  38. Turnbaugh PJ, Hamady M, Yatsunenko T, et al. 2009 A core gut microbiom in obese and lean twins. Nature 457 480–484PubMedCrossRefPubMedCentralGoogle Scholar
  39. Van Rijsbergen CJ 1979 Information retrieval (2nd ed.) (Butterworth-Heinemann Newton, MA, USA)Google Scholar
  40. Westcott SL and Schloss PD 2015 De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units. PeerJ 3 e1487PubMedPubMedCentralCrossRefGoogle Scholar
  41. William H.Press, Saul A. Teukoisky, William T. Vetterling BPF 2007 NUMERICAL RECIPES The Art of Scientific Computing (Cambridge University Press)Google Scholar
  42. Woese CR 1987 Bacterial evolution. Microbiol. Rev. 51 221–71PubMedPubMedCentralGoogle Scholar
  43. Wood DE and Salzberg SL 2014 Kraken: Ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15 R46PubMedPubMedCentralCrossRefGoogle Scholar
  44. Zou Q, Lin G, Jiang X, Liu X and Zeng X 2018 Sequence clustering in bioinformatics: an empirical study. Brief. Bioinformatics bby090,

Copyright information

© Indian Academy of Sciences 2019

Authors and Affiliations

  • Ashaq Hussain Bhat
    • 1
  • Puniethaa Prabhu
    • 1
    Email author
  • Kalpana Balakrishnan
    • 2
  1. 1.K.S. Rangasamy College of TechnologyTiruchengodeIndia
  2. 2.Indian Institute of Technology MadrasChennaiIndia

Personalised recommendations