SAHN Clustering in Arbitrary Metric Spaces Using Heuristic Nearest Neighbor Search

  • Nils Kriege
  • Petra Mutzel
  • Till Schäfer
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8344)

Abstract

Sequential agglomerative hierarchical non-overlapping (SAHN) clustering techniques belong to the classical clustering methods that are applied heavily in many application domains, e.g., in cheminformatics. Asymptotically optimal SAHN clustering algorithms are known for arbitrary dissimilarity measures, but their quadratic time and space complexity even in the best case still limits the applicability to small data sets. We present a new pivot based heuristic SAHN clustering algorithm exploiting the properties of metric distance measures in order to obtain a best case running time of \(\mathcal{O}(n\log n)\) for the input size n. Our approach requires only linear space and supports median and centroid linkage. It is especially suitable for expensive distance measures, as it needs only a linear number of exact distance computations. In extensive experimental evaluations on real-world and synthetic data sets, we compare our approach to exact state-of-the-art SAHN algorithms in terms of quality and running time. The evaluations show a subquadratic running time in practice and a very low memory footprint.

Keywords

SAHN clustering nearest neighbor heuristic data mining 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: OPTICS: Ordering Points To Identify the Clustering Structure. SIGMOD Rec. 28(2), 49–60 (1999)CrossRefGoogle Scholar
  2. 2.
    Breunig, M.M., Kriegel, H.P., Kröger, P., Sander, J.: Data bubbles: quality preserving performance boosting for hierarchical clustering. SIGMOD Rec. 30(2), 79–90 (2001)CrossRefGoogle Scholar
  3. 3.
    Chen, J., MacEachren, A.M., Peuquet, D.J.: Constructing overview + detail dendrogram-matrix views. TVCG 15(6), 889–896 (2009)Google Scholar
  4. 4.
    Downs, G.M., Barnard, J.M.: Clustering Methods and Their Uses in Computational Chemistry, pp. 1–40. John Wiley & Sons, Inc., New Jersey (2003)Google Scholar
  5. 5.
    Elkan, C.: Using the triangle inequality to accelerate k-means. In: ICML 2003, pp. 147–153. AAAI Press, Menlo Park (2003)Google Scholar
  6. 6.
    Eppstein, D.: Fast hierarchical clustering and other applications of dynamic closest pairs. Exp. Algorithmics 5(1) (2000)Google Scholar
  7. 7.
    Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. Journal of the American Statistical Association 78(383), 553–569 (1983)CrossRefMATHGoogle Scholar
  8. 8.
    Koga, H., Ishibashi, T., Watanabe, T.: Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing. Knowledge and Information Systems 12(1), 25–53 (2007)CrossRefGoogle Scholar
  9. 9.
    Lance, G.N., Williams, W.T.: A general theory of classificatory sorting strategies 1. hierarchical systems. The Computer Journal 9(4), 373–380 (1967)CrossRefGoogle Scholar
  10. 10.
    Meilă, M.: Comparing clusterings—an information based distance. JMVA 98(5), 873–895 (2007)MATHGoogle Scholar
  11. 11.
    Murtagh, F.: Multidimensional clustering algorithms. In: COMPSTAT Lectures 4. Physica-Verlag, Wuerzburg (1985)Google Scholar
  12. 12.
    Murtagh, F., Contreras, P.: Algorithms for hierarchical clustering: an overview. WIREs Data Mining Knowl. Discov. 2(1), 86–97 (2012)CrossRefMathSciNetGoogle Scholar
  13. 13.
    Müllner, D.: Modern hierarchical, agglomerative clustering algorithms, arXiv:1109.2378v1 (2011)Google Scholar
  14. 14.
    Nanni, M.: Speeding-up hierarchical agglomerative clustering in presence of expensive metrics. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 378–387. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  15. 15.
    Patra, B.K., Hubballi, N., Biswas, S., Nandi, S.: Distance based fast hierarchical clustering method for large datasets. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 50–59. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  16. 16.
    Rohlf, F.J.: Hierarchical clustering using the minimum spanning tree. Computer Journal 16, 93–95 (1973)Google Scholar
  17. 17.
    Wetzel, S., Klein, K., Renner, S., Rauh, D., Oprea, T.I., Mutzel, P., Waldmann, H.: Interactive exploration of chemical space with Scaffold Hunter. Nature Chemical Biology 5(8), 581–583 (2009)CrossRefGoogle Scholar
  18. 18.
    Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space Approach. In: Advances in Database Systems, vol. 32. Springer (2006)Google Scholar
  19. 19.
    Zhou, J.: Efficiently Searching and Mining Biological Sequence and Structure Data. Ph.D. thesis, University of Alberta (2009)Google Scholar
  20. 20.
    Zhou, J., Sander, J.: Speedup clustering with hierarchical ranking. In: Sixth International Conference on Data Mining, ICDM 2006, pp. 1205–1210 (2006)Google Scholar
  21. 21.
    Zhou, J., Sander, J.: Data Bubbles for Non-Vector Data: Speeding-up Hierarchical Clustering in Arbitrary Metric Spaces. In: Proceedings of the 29th International Conference on Very Large Data Bases, VLDB 2003, vol. 29, pp. 452–463, VLDB Endowment (2003)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Nils Kriege
    • 1
  • Petra Mutzel
    • 1
  • Till Schäfer
    • 1
  1. 1.Dept. of Computer ScienceTechnische Universität DortmundGermany

Personalised recommendations