Advertisement

A Greedy Algorithm for Hierarchical Complete Linkage Clustering

  • Ernst Althaus
  • Andreas Hildebrandt
  • Anna Katharina Hildebrandt
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8542)

Abstract

We are interested in the greedy method to compute an hierarchical complete linkage clustering. There are two known methods for this problem, one having a running time of \({\mathcal O}(n^3)\) with a space requirement of \({\mathcal O}(n)\) and one having a running time of \({\mathcal O}(n^2 \log n)\) with a space requirement of Θ(n 2), where n is the number of points to be clustered. Both methods are not capable to handle large point sets. In this paper, we give an algorithm with a space requirement of \({\mathcal O}(n)\) which is able to cluster one million points in a day on current commodity hardware.

Keywords

bioinformatics algorithm-engineering clustering unsupervised machine learning 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bao, E., Jiang, T., Kaloshian, I., Girke, T.: Seed: Efficient clustering of next-generation sequences. Bioinformatics 27(18), 2502–2509 (2011), http://bioinformatics.oxfordjournals.org/content/27/18/2502.abstract Google Scholar
  2. 2.
    Bu, D., Li, S.C., Li, M.: Clustering 100,000 protein structure decoys in minutes. IEEE/ACM Transactions on Computational Biology and Bioinformatics 9(3), 765–773 (2012)CrossRefGoogle Scholar
  3. 3.
    Chong, Z., Ruan, J., Wu, C.I.: Rainbow: An integrated tool for efficient clustering and assembling rad-seq reads. Bioinformatics 28(21), 2732–2737 (2012), http://bioinformatics.oxfordjournals.org/content/28/21/2732.abstract CrossRefGoogle Scholar
  4. 4.
    Cormack, R.: A review of classification. Journal of the Royal Statistical Society, Series A 134(3), 321–367 (1971)CrossRefMathSciNetGoogle Scholar
  5. 5.
    Day, W.H., Edelsbrunner, H.: Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification 1, 1–24 (1984)CrossRefGoogle Scholar
  6. 6.
    Defays, D.: An efficient algorithm for a complete link method. Computer Journal 20, 364–366 (1977)CrossRefzbMATHMathSciNetGoogle Scholar
  7. 7.
    Ernst, J., Nau, G.J., Bar-Joseph, Z.: Clustering short time series gene expression data. Bioinformatics 21(suppl. 1), i159–i168 (2005), http://bioinformatics.oxfordjournals.org/content/21/suppl_1/i159.abstract
  8. 8.
    Feliu, E., Oliva, B.: How different from random are docking predictions when ranked by scoring functions? Proteins: Structure, Function, and Bioinformatics 78(16), 3376–3385 (2010)CrossRefGoogle Scholar
  9. 9.
    Gray, J., Moughan, S., Wang, C., Schueler-Furman, O., Kuhlman, B., Rohl, C., Baker, D.: Protein-protein docking with simultaneous optimization of rigid-body displacement and side-chain conformations. J. Mol. Biol. 331(1), 281–299 (2003)CrossRefGoogle Scholar
  10. 10.
    Hildebrandt, A., Dehof, A.K., Rurainski, A., Bertsch, A., Schumann, M., Toussaint, N., Moll, A., Stockel, D., Nickels, S., Mueller, S., Lenhof, H.P., Kohlbacher, O.: BALL - Biochemical Algorithms Library 1.3. BMC Bioinformatics 11(1), 531 (2010)CrossRefGoogle Scholar
  11. 11.
    Hildebrandt, A.K., Diezen, M., Lengauer, T., Lenhof, H.P., Althaus, E., Hildebrandt, A.: Efficient computation of root mean square deviations under rigid transformations (submitted)Google Scholar
  12. 12.
    Jamroz, M., Kolinski, A.: Clusco: Clustering and comparison of protein models. BMC Bioinformatics 14(1), 62 (2013)CrossRefGoogle Scholar
  13. 13.
    Miele, V., Penel, S., Duret, L.: Ultra-fast sequence clustering from similarity networks with silix. BMC Bioinformatics 12(1), 116 (2011), http://www.biomedcentral.com/1471-2105/12/116 CrossRefGoogle Scholar
  14. 14.
    Murtagh, F.: Complexities of hierarchic clustering algorithms: The state of the art. Computational Statistics Quarterly 1, 101–113 (1984)zbMATHGoogle Scholar
  15. 15.
    Shortle, D., Simons, K.T., Baker, D.: Clustering of low-energy conformations near the native structures of small proteins. Proceedings of the National Academy of Sciences 95(19), 11158–11162 (1998), http://www.pnas.org/content/95/19/11158.abstract CrossRefGoogle Scholar
  16. 16.
    Sibson, R.: SLINK: An optimally efficient algorithm for the single-link cluster method. The Computer Journal 16(1), 30–34 (1973)MathSciNetGoogle Scholar
  17. 17.
    Sivriver, J., Habib, N., Friedman, N.: An integrative clustering and modeling algorithm for dynamical gene expression data. Bioinformatics 27(13), i392–i400 (2011), http://bioinformatics.oxfordjournals.org/content/27/13/i392.abstract
  18. 18.
    Torda, A.E., van Gunsteren, W.F.: Algorithms for clustering molecular dynamics configurations. J. Comput. Chem. 15(12), 1331–1340 (1994), http://dx.doi.org/10.1002/jcc.540151203 CrossRefGoogle Scholar
  19. 19.
    Wang, Y., Xu, M., Wang, Z., Tao, M., Zhu, J., Wang, L., Li, R., Berceli, S.A., Wu, R.: How to cluster gene expression dynamics in response to environmental signals. Briefings in Bioinformatics 13(2), 162–174 (2012), http://bib.oxfordjournals.org/content/13/2/162.abstract Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Ernst Althaus
    • 1
  • Andreas Hildebrandt
    • 1
  • Anna Katharina Hildebrandt
    • 2
  1. 1.Institut für InformatikJohannes Gutenberg-UniversitätMainzGermany
  2. 2.Max-Planck Institute for InformaticsSaarbrückenGermany

Personalised recommendations