Abstract
Clustering has been one of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset.
In this paper, we present a new algorithm, called fast and exact k-means clustering (FEKM), which typically requires only one or a small number of passes on the entire dataset and provably produces the same cluster centres as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centres and then takes one or more passes over the entire dataset to adjust these cluster centres. We provide theoretical analysis to show that the cluster centres thus reported are the same as the ones computed by the original k-means algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared with k-means.
This paper also describes and evaluates a distributed version of FEKM, which we refer to as DFEKM. This algorithm is suitable for analysing data that is distributed across loosely coupled machines. Unlike the previous work in this area, DFEKM provably produces the same results as the original k-means algorithm. Our experimental results show that DFEKM is clearly better than two other possible options for exact clustering on distributed data, which are down loading all data and running sequential k-means or running parallel k-means on a loosely coupled configuration. Moreover, even in a tightly coupled environment, DFEKM can outperform parallel k-means if there is a significant load imbalance.
Similar content being viewed by others
References
Badoiu M, Har-Pelad S, Indyk P (2002) Approximate clustering via core-sets. In: Proceedings of the annual ACM symposium on theory of computing
Berkhin P (2002) Survey of clustering data mining techniques. Technical report, Accrue Software
Bottou L, Bengio Y (1995) Convergence properties of the K-means algorithms. In: Tesauro G, Touretzky D, Leen T (eds) Advances in neural information processing systems, vol 7. The MIT Press, pp 585–592
Bradley PS, Fayyad U, Reina C (1998) Scaling clustering algorithms to large databases. In: Proceedings of the 4th international conference on knowledge discovery and data mining
Charikar M, O'Callaghan L, Panigrahi R (2003) Better streaming algorithms for clustering problems. In: Proceedings of the 35th annual ACM symposium on theory of computing
Cherikar M, Chekuri C, Feder T, Motwani R (1997) Incremental clustering and dynamic information retrieval. In: Proceedings of symposium of theory of computing
Chervenak A, Foster I, Kesselman C, Salisbusy C, Tuecke S (2001) The data grid: towards an architecture for the distributed management and analysis of large scientific datasets. J Network Comput Appl
Lopez de Teruel PE, Garcia JM, Acacio M (1999) A parallel algorithm and its application to computer vision. In: Proceedings of PDPTA
Dhillon IS, Modha DS (1999) A data-clustering algorithm on distributed memory multiprocessors. In: Lecture notes in computer science, revised papers from large-scale parallel data mining, workshop on large-scale parallel KDD systems. SIGKDD, Springer-Verlag, Berlin Heidelberg New York, pp 245–260
Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering. In: Proceedings of the 18th international conference on machine learning
Farnstrom F, Lewis J, Elkan C (2000) Scalability for clustering algorithms revisited. SIGKDD Explor 2(1): 51–57
Forman G, Zhang B (2000) Distributed data clustering can be efficient and exact. SIGKDD Explor 2
Ghosh J (2003) Scalable clustering methods for data mining. In: Ye N (ed) Handbook of data mining. Lawrence Earlbaum Associates, pp 247–277
Guha S, Meyerson A, Mishra N, Motwani R, O'Callaghan L (2003) Clustering data streams: theory and practice. IEEE Trans on Knowl Data Eng 15(3): 515–528
Han J, Kamber M (2000) Data mining: concepts and techniques. Morgan Kaufmann
Hartigan JA, Wong MA (1979) A k-means clustering algorithm. Appl Stat 28:100–108
Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall International
Januzaj E, Kriegel H-P, Pfeifle M (2003) Towards effective and efficient distributed clustering. In: Proceedings of the ICDM 2003 workshop on clustering large datasets
Kargupta H, Huang W, Sivakumar K, Johnson E (2001) Distributed clustering using collective principal component analysis. Knowl Inf Syst 3(4): 422–448
Kargupta H, Chan P (1999) (eds) Advances in distributed data mining. AAI/MIT Press
Kruengkrai C, Jaruskulchai C (2002) A parallel learning algorithm for text classification. In: Proceedings of ACM SIGKDD 2002, ACM Press, pp 201–206
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, vol 1, pp 281–297
Nittel S, Leung KT, Braverman A (2003) Scaling clustering algorithms for massive data sets using data stream. In: Dayal U, Ramamritham K, Vijayaraman TM (eds) Proceedings of the 19th international conference on data engineering, March 5–8, 2003, Bangalore, India. IEEE Computer Society
OCallaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of international conference of data engineering
Parthasarathy S, Ogihara M (2000) Clustering distributed homogeneous datasets. In: Proceedings of the 4th European conference on principles of data mining and knowledge discovery, Lecture notes in computer science, vol 1910. pp 566–574. Springer-Verlag, Berlin Heidelberg New York
Pelleg D, Moore A (1999) Accelerating exact k-means algorithms with geometric reasoning. In: Proceedings of 5th international conference of knowledge discovery and data mining, pp 277–281
Samatova NF, Ostrouchov G, Geist A, Melechko A (2002) RACHET: an efficient cover-based merging of clustering hierarchies from distributed datasets. Distrib Parallel Databases 11(2):157–180
Author information
Authors and Affiliations
Corresponding author
Additional information
Ruoming Jin is currently an assistant professor in the Computer Science Department at Kent State University. He received a BE and a ME degree in computer engineering from Beihang University (BUAA), China in 1996 and 1999, respectively. He earned his MS degree in computer science from University of Delaware in 2001, and his Ph.D. degree in computer science from the Ohio State University in 2005. His research interests include data mining, databases, processing of streaming data, bioinformatics, and high performance computing. He has published more than 30 papers in these areas. He is a member of ACM and SIGKDD.
Anjan Goswami studied robotics at the Indian Institute of Technology at Kanpur. While working with IBM, he was interested in studying computer science. He then obtained a masters degree from the University of South Florida, where he worked on computer vision problems. He then transferred to the PhD program in computer science at OSU, where he did a Masters thesis on efficient clustering algorithms for massive, distributed and streaming data. On successful completion of this, he decided to join a web-service-provider company to do research in designing and developing high-performance search solutions for very large structured data. Anjan' favourite recreations are studying and predicting technology trends, nature photography, hiking, literature and soccer.
Gagan Agrawal is an Associate Professor of Computer Science and Engineering at the Ohio State University. He received his B.Tech degree from Indian Institute of Technology, Kanpur, in 1991, and M.S. and Ph.D degrees from University of Maryland, College Park, in 1994 and 1996, respectively. His research interests include parallel and distributed computing, compilers, data mining, grid computing, and data integration. He has published more than 110 refereed papers in these areas. He is a member of ACM and IEEE Computer Society. He received a National Science Foundation CAREER award in 1998.
Rights and permissions
About this article
Cite this article
Jin, R., Goswami, A. & Agrawal, G. Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10, 17–40 (2006). https://doi.org/10.1007/s10115-005-0210-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-005-0210-0