Skip to main content
Log in

Fast and exact out-of-core and distributed k-means clustering

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Clustering has been one of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset.

In this paper, we present a new algorithm, called fast and exact k-means clustering (FEKM), which typically requires only one or a small number of passes on the entire dataset and provably produces the same cluster centres as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centres and then takes one or more passes over the entire dataset to adjust these cluster centres. We provide theoretical analysis to show that the cluster centres thus reported are the same as the ones computed by the original k-means algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared with k-means.

This paper also describes and evaluates a distributed version of FEKM, which we refer to as DFEKM. This algorithm is suitable for analysing data that is distributed across loosely coupled machines. Unlike the previous work in this area, DFEKM provably produces the same results as the original k-means algorithm. Our experimental results show that DFEKM is clearly better than two other possible options for exact clustering on distributed data, which are down loading all data and running sequential k-means or running parallel k-means on a loosely coupled configuration. Moreover, even in a tightly coupled environment, DFEKM can outperform parallel k-means if there is a significant load imbalance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Badoiu M, Har-Pelad S, Indyk P (2002) Approximate clustering via core-sets. In: Proceedings of the annual ACM symposium on theory of computing

  2. Berkhin P (2002) Survey of clustering data mining techniques. Technical report, Accrue Software

  3. Bottou L, Bengio Y (1995) Convergence properties of the K-means algorithms. In: Tesauro G, Touretzky D, Leen T (eds) Advances in neural information processing systems, vol 7. The MIT Press, pp 585–592

  4. Bradley PS, Fayyad U, Reina C (1998) Scaling clustering algorithms to large databases. In: Proceedings of the 4th international conference on knowledge discovery and data mining

  5. Charikar M, O'Callaghan L, Panigrahi R (2003) Better streaming algorithms for clustering problems. In: Proceedings of the 35th annual ACM symposium on theory of computing

  6. Cherikar M, Chekuri C, Feder T, Motwani R (1997) Incremental clustering and dynamic information retrieval. In: Proceedings of symposium of theory of computing

  7. Chervenak A, Foster I, Kesselman C, Salisbusy C, Tuecke S (2001) The data grid: towards an architecture for the distributed management and analysis of large scientific datasets. J Network Comput Appl

    Google Scholar 

  8. Lopez de Teruel PE, Garcia JM, Acacio M (1999) A parallel algorithm and its application to computer vision. In: Proceedings of PDPTA

  9. Dhillon IS, Modha DS (1999) A data-clustering algorithm on distributed memory multiprocessors. In: Lecture notes in computer science, revised papers from large-scale parallel data mining, workshop on large-scale parallel KDD systems. SIGKDD, Springer-Verlag, Berlin Heidelberg New York, pp 245–260

  10. Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering. In: Proceedings of the 18th international conference on machine learning

  11. Farnstrom F, Lewis J, Elkan C (2000) Scalability for clustering algorithms revisited. SIGKDD Explor 2(1): 51–57

    Google Scholar 

  12. Forman G, Zhang B (2000) Distributed data clustering can be efficient and exact. SIGKDD Explor 2

  13. Ghosh J (2003) Scalable clustering methods for data mining. In: Ye N (ed) Handbook of data mining. Lawrence Earlbaum Associates, pp 247–277

  14. Guha S, Meyerson A, Mishra N, Motwani R, O'Callaghan L (2003) Clustering data streams: theory and practice. IEEE Trans on Knowl Data Eng 15(3): 515–528

    Article  Google Scholar 

  15. Han J, Kamber M (2000) Data mining: concepts and techniques. Morgan Kaufmann

    Google Scholar 

  16. Hartigan JA, Wong MA (1979) A k-means clustering algorithm. Appl Stat 28:100–108

    Article  MATH  Google Scholar 

  17. Jain AK, Dubes RC (1988) Algorithms for clustering data. Prentice-Hall International

  18. Januzaj E, Kriegel H-P, Pfeifle M (2003) Towards effective and efficient distributed clustering. In: Proceedings of the ICDM 2003 workshop on clustering large datasets

  19. Kargupta H, Huang W, Sivakumar K, Johnson E (2001) Distributed clustering using collective principal component analysis. Knowl Inf Syst 3(4): 422–448

    Article  MATH  Google Scholar 

  20. Kargupta H, Chan P (1999) (eds) Advances in distributed data mining. AAI/MIT Press

  21. Kruengkrai C, Jaruskulchai C (2002) A parallel learning algorithm for text classification. In: Proceedings of ACM SIGKDD 2002, ACM Press, pp 201–206

  22. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, vol 1, pp 281–297

  23. Nittel S, Leung KT, Braverman A (2003) Scaling clustering algorithms for massive data sets using data stream. In: Dayal U, Ramamritham K, Vijayaraman TM (eds) Proceedings of the 19th international conference on data engineering, March 5–8, 2003, Bangalore, India. IEEE Computer Society

  24. OCallaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of international conference of data engineering

  25. Parthasarathy S, Ogihara M (2000) Clustering distributed homogeneous datasets. In: Proceedings of the 4th European conference on principles of data mining and knowledge discovery, Lecture notes in computer science, vol 1910. pp 566–574. Springer-Verlag, Berlin Heidelberg New York

  26. Pelleg D, Moore A (1999) Accelerating exact k-means algorithms with geometric reasoning. In: Proceedings of 5th international conference of knowledge discovery and data mining, pp 277–281

  27. Samatova NF, Ostrouchov G, Geist A, Melechko A (2002) RACHET: an efficient cover-based merging of clustering hierarchies from distributed datasets. Distrib Parallel Databases 11(2):157–180

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruoming Jin.

Additional information

Ruoming Jin is currently an assistant professor in the Computer Science Department at Kent State University. He received a BE and a ME degree in computer engineering from Beihang University (BUAA), China in 1996 and 1999, respectively. He earned his MS degree in computer science from University of Delaware in 2001, and his Ph.D. degree in computer science from the Ohio State University in 2005. His research interests include data mining, databases, processing of streaming data, bioinformatics, and high performance computing. He has published more than 30 papers in these areas. He is a member of ACM and SIGKDD.

Anjan Goswami studied robotics at the Indian Institute of Technology at Kanpur. While working with IBM, he was interested in studying computer science. He then obtained a masters degree from the University of South Florida, where he worked on computer vision problems. He then transferred to the PhD program in computer science at OSU, where he did a Masters thesis on efficient clustering algorithms for massive, distributed and streaming data. On successful completion of this, he decided to join a web-service-provider company to do research in designing and developing high-performance search solutions for very large structured data. Anjan' favourite recreations are studying and predicting technology trends, nature photography, hiking, literature and soccer.

Gagan Agrawal is an Associate Professor of Computer Science and Engineering at the Ohio State University. He received his B.Tech degree from Indian Institute of Technology, Kanpur, in 1991, and M.S. and Ph.D degrees from University of Maryland, College Park, in 1994 and 1996, respectively. His research interests include parallel and distributed computing, compilers, data mining, grid computing, and data integration. He has published more than 110 refereed papers in these areas. He is a member of ACM and IEEE Computer Society. He received a National Science Foundation CAREER award in 1998.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jin, R., Goswami, A. & Agrawal, G. Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10, 17–40 (2006). https://doi.org/10.1007/s10115-005-0210-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-005-0210-0

Keywords

Navigation