A New, Fast and Accurate Algorithm for Hierarchical Clustering on Euclidean Distances

  • Elio Masciari
  • Giuseppe Massimiliano Mazzeo
  • Carlo Zaniolo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7819)

Abstract

A simple hierarchical clustering algorithm called CLUBS (for CLustering Using Binary Splitting) is proposed. CLUBS is faster and more accurate than existing algorithms, including k-means and its recently proposed refinements. The algorithm consists of a divisive phase and an agglomerative phase; during these two phases, the samples are repartitioned using a least quadratic distance criterion possessing unique analytical properties that we exploit to achieve a very fast computation. CLUBS derives good clusters without requiring input from users, and it is robust and impervious to noise, while providing better speed and accuracy than methods, such as BIRCH, that are endowed with the same critical properties.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: Optics: Ordering points to identify the clustering structure. SIGMOD Record 28(2), 49–60 (1999)CrossRefGoogle Scholar
  2. 2.
    Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: SODA, pp. 1027–1035 (2007)Google Scholar
  3. 3.
    Ben-David, S., Ackerman, M.: Measures of clustering quality: A working set of axioms for clustering. In: NIPS, pp. 121–128 (2008)Google Scholar
  4. 4.
    Cheung, Y.M.: k*-means: A new generalized k-means clustering algorithm. Pattern Recognition Letters 24(15), 2883–2893 (2003)MATHCrossRefGoogle Scholar
  5. 5.
    Einbond, L.S., Su, T., Wu, H., Friedman, R., Wang, X., Ramirez, A., Kronenberg, F., Weinstein, I.B.: The growth inhibitory effect of actein on human breast cancer cells is associated with activation of stress response pathways. I. J. of Cancer 121(9), 2073–2083 (2007)Google Scholar
  6. 6.
    Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD (1996)Google Scholar
  7. 7.
    Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Fast detection of xml structural similarity. TKDE 17(2), 160–175 (2005)Google Scholar
  8. 8.
    Graham, K., De Las Morenas, A., Tripathi, A., King, C., Kavanah, M., Mendez, J., Stone, M., Slama, J., Miller, M., Antoine, G., Willers, H., Sebastiani, P., Rosenberg, C.L.: Gene expression in histologically normal epithelium from breast cancer patients and from cancer-free prophylactic mastectomy patients shares a similar profile. Br. J. Cancer 102(8), 1284–1293 (2010)CrossRefGoogle Scholar
  9. 9.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2000)Google Scholar
  10. 10.
    MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: 5-th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)Google Scholar
  11. 11.
    Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, C.: The effectiveness of lloyd-type methods for the k-means problem. In: FOCS (2006)Google Scholar
  12. 12.
    Veenman, C.J., Reinders, M.J.T.: The nearest subclass classifier: A compromise between the nearest mean and nearest neighbor classifier. IEEE Trans. Pattern Anal. Mach. Intell. 27(9), 1417–1429 (2005)CrossRefGoogle Scholar
  13. 13.
    Wang, W., Yang, J., Muntz, R.R.: Sting: A statistical information grid approach to spatial data mining. In: VLDB, pp. 186–195 (1997)Google Scholar
  14. 14.
    Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: SIGMOD, pp. 103–114 (1996)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Elio Masciari
    • 1
  • Giuseppe Massimiliano Mazzeo
    • 1
  • Carlo Zaniolo
    • 2
  1. 1.ICAR-CNRItaly
  2. 2.UCLAUSA

Personalised recommendations