Abstract
Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids.
In Euclidean geometry the mean—as used in k-means—is a good estimator for the cluster center, but this does not exist for arbitrary dissimilarities. PAM uses the medoid instead, the object with the smallest dissimilarity to all others in the cluster. This notion of centrality can be used with any (dis-)similarity, and thus is of high relevance to many domains and applications.
A key issue with PAM is its high run time cost. We propose modifications to the PAM algorithm that achieve an O(k)-fold speedup in the second (“SWAP”) phase of the algorithm, but will still find the same results as the original PAM algorithm. If we slightly relax the choice of swaps performed (while retaining comparable quality), we can further accelerate the algorithm by performing up to k swaps in each iteration. With the substantially faster SWAP, we can now explore faster intialization strategies. We also show how the CLARA and CLARANS algorithms benefit from the proposed modifications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Clearly, our O(k) fold speedup must be immediately measurable, not just asymptotically, because the constant overhead for maintaining the fixed array cache is small.
References
Arthur, D., Vassilvitskii, S.: How slow is the k-means method? In: ACM Symposium on Computational Geometry (2006). https://doi.org/10.1145/1137856.1137880
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: ACM-SIAM SODA (2007)
Bock, H.: Clustering methods: a history of k-means algorithms. In: Brito, P., Cucumel, G., Bertrand, P., de Carvalho, F. (eds.) Selected Contributions in Data Analysis and Classification. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-73560-1_15
Bradley, P.S., Mangasarian, O.L., Street, W.N.: Clustering via concave minimization. In: NIPS (1996)
Dheeru, D., Karra Taniskidou, E.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml
Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD (1996)
Estivill-Castro, V.: Why so many clustering algorithms: a position paper. SIGKDD Explor. 4(1), 65–75 (2002). https://doi.org/10.1145/568574.568575
Estivill-Castro, V., Houle, M.E.: Robust distance-based clustering with applications to spatial data mining. Algorithmica 30(2), 216–242 (2001). https://doi.org/10.1007/s00453-001-0010-1
Kaufman, L., Rousseeuw, P.J.: Clustering by means of medoids. In: Dodge, Y. (ed.) Statistical Data Analysis Based on the \(L_1\) Norm and Related Methods (1987). ISBN 0444702733
Kaufman, L., Rousseeuw, P.J.: Clustering large data sets. In: Pattern Recognition in Practice. Elsevier (1986). https://doi.org/10.1016/b978-0-444-87877-9.50039-x
Kriegel, H., Schubert, E., Zimek, A.: The (black) art of runtime evaluation: are we comparing algorithms or implementations? Knowl. Inf. Syst. 52(2), 341–378 (2017). https://doi.org/10.1007/s10115-016-1004-2
Lucasius, C., Dane, A., Kateman, G.: On k-medoid clustering of large data sets with the aid of a genetic algorithm. Anal. Chim. Acta 282(3), 647–669 (1993). https://doi.org/10.1016/0003-2670(93)80130-D
Ng, R.T., Han, J.: CLARANS: a method for clustering objects for spatial data mining. IEEE TKDE 14(5), 1003–1016 (2002). https://doi.org/10.1109/TKDE.2002.1033770
Overton, M.L.: A quadratically convergent method for minimizing a sum of euclidean norms. Math. Program. 27(1), 34–63 (1983). https://doi.org/10.1007/BF02591963
Park, H., Jun, C.: A simple and fast algorithm for k-medoids clustering. Expert Syst. Appl. 36(2), 3336–3341 (2009). https://doi.org/10.1016/j.eswa.2008.01.039
Reynolds, A.P., Richards, G., de la Iglesia, B., Rayward-Smith, V.J.: Clustering rules: a comparison of partitioning and hierarchical clustering algorithms. J. Math. Model. Algorithms 5(4), 475–504 (2006). https://doi.org/10.1007/s10852-005-9022-1
Schubert, E., Gertz, M.: Numerically stable parallel computation of (co-)variance. In: SSDBM (2018). https://doi.org/10.1145/3221269.3223036
Schubert, E., Hess, S., Morik, K.: The relationship of DBSCAN to matrix factorization and spectral clustering. In: LWDA. CEUR Workshop Proceedings, vol. 2191 (2018)
Schubert, E., Sander, J., Ester, M., Kriegel, H., Xu, X.: DBSCAN revisited, revisited: why and how you should (still) use DBSCAN. ACM Trans. Database Syst. 42(3), 19 (2017). https://doi.org/10.1145/3068335
Schubert, E., Zimek, A.: ELKI: a large open-source library for data analysis - ELKI release 0.7.5 “Heidelberg”. arXiv:1902.03616 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Schubert, E., Rousseeuw, P.J. (2019). Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms. In: Amato, G., Gennaro, C., Oria, V., Radovanović , M. (eds) Similarity Search and Applications. SISAP 2019. Lecture Notes in Computer Science(), vol 11807. Springer, Cham. https://doi.org/10.1007/978-3-030-32047-8_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-32047-8_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32046-1
Online ISBN: 978-3-030-32047-8
eBook Packages: Computer ScienceComputer Science (R0)