Skip to main content

Distributed Data Clustering: A Comparative Analysis

  • Chapter
Foundations of Computational, IntelligenceVolume 6

Part of the book series: Studies in Computational Intelligence ((SCI,volume 206))

Abstract

Due to explosion in the number of autonomous data sources, there is a growing need for effective approaches to distributed clustering. This paper compares the performance of two distributed clustering algorithms namely, Improved Distributed Combining Algorithm and Distributed K-Means algorithm against traditional Centralized Clustering Algorithm. Both algorithms use cluster centroid to form a cluster ensemble, which is required to perform global clustering. The centroid based partitioned clustering algorithms K-Means, Fuzzy K-Means and Rough K-Means are used with each distributed clustering algorithm, in order to analyze the performance of both hard and soft clusters in distributed environment. The experiments are carried out for an artificial dataset and four benchmark datasets of UCI machine learning data repository.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chen, R., Sivakumar, K., Kargupta, H.: Collective Mining of Bayesian Networks from Distributed Heterogeneous Data. Knowledge and Information Systems Journal 6, 164–187 (2004)

    Google Scholar 

  2. Cormode, G., Muthukrishnan, S., Zhuang, W.: Conquering the Divide: Continuous Clustering of Distributed Data Streams. In: IEEE 23rd International Conference on Data Engineering, pp. 1036–1045 (2007)

    Google Scholar 

  3. Folino, G., Forestiero, A., Spezzano, G.: Swarm-Based Distributed Clustering in Peer-to-Peer Systems. In: Talbi, E.-G., Liardet, P., Collet, P., Lutton, E., Schoenauer, M. (eds.) EA 2005. LNCS, vol. 3871, pp. 37–48. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  4. Ghosh, J., Merugu, S.: Distributed Clustering with Limited Knowledge Sharing. In: Proceedings of the 5th International Conference on Advances in Pattern Recognition, pp. 48–53 (2003)

    Google Scholar 

  5. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Cluster validity methods: part II. ACM SIGMOD Record 31(3), 19–27 (2002)

    Article  Google Scholar 

  6. Hamerly, G., Elkan, C.: Alternatives to the K-Means algorithm that find better clusterings. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 600–607 (2002)

    Google Scholar 

  7. Hammouda, K.: A Comparative Study of Data Clustering Techniques. In: Tools of Intelligent Systems Design. Course Project SYDE 625 (2000)

    Google Scholar 

  8. Hore, P., Hall Lawrence, O.: Scalable Clustering: A Distributed Approach. In: IEEE International Conference on Fuzzy Systems, pp. 25–29 (2004)

    Google Scholar 

  9. Hore, P., Hall Lawrence, O., Goldgofz, D.: A Cluster Ensemble Framework for Large Datasets. In: Proceedings of IEEE Conference on Systems, Man Cybernetics B (2006)

    Google Scholar 

  10. Jain, A.K., Murthy, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3), 265–323 (1999)

    Article  Google Scholar 

  11. Januzaj, E., Kriegel Hans, P., Pfeifle, M.: Towards Effective and Efficient Distributed Clustering. In: Proceedings of International Workshop on Clustering Large Datasets, 3rd IEEE International Conference on Data Mining, pp. 49–58 (2003)

    Google Scholar 

  12. Januzaj, E., Kriegel Hans, P., Pfeifle, M.: DBDC: Density Based Distributed Clustering. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 88–105. Springer, Heidelberg (2004)

    Google Scholar 

  13. Jeong, J., Ryu, B., Shin, D., Shin, D.: Integration of Distributed Biological Data using modified K-means algorithm. In: Washio, T., Zhou, Z.-H., Huang, J.Z., Hu, X., Li, J., Xie, C., He, J., Zou, D., Li, K.-C., Freire, M.M. (eds.) PAKDD 2007. LNCS, vol. 4819, pp. 469–475. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  14. Genlin, J., Xiaohan, L.: Ensemble learning based distributed clustering. In: Washio, T., Zhou, Z.-H., Huang, J.Z., Hu, X., Li, J., Xie, C., He, J., Zou, D., Li, K.-C., Freire, M.M. (eds.) PAKDD 2007. LNCS, vol. 4819, pp. 312–321. Springer, Heidelberg (2007)

    Google Scholar 

  15. Jin, R., Goswami, A., Agarwal, G.: Fast and exact out-of-core and distributed K-means clustering. Knowledge and Information Systems 10(1), 17–40 (2006)

    Article  Google Scholar 

  16. Karthikeyani, N.V., Thangavel, K., Alagambigai, P.: Ensemble Approach to Distributed Clustering. In: Natarajan (ed.) Mathematical and Computational Model, pp. 252–261. Narosa Publishing House, New Delhi (2007)

    Google Scholar 

  17. Khanuja, J., Karlapalem, K.: CLOUD: Cluster Identification and Outlier Detection for Distributed Data. Technical report (2007)

    Google Scholar 

  18. Kuhn, H.W.: The Hungarian Method for the Assignment Problem. Naval. Res. Logist. Quart 2, 83–97 (1995)

    Article  MathSciNet  Google Scholar 

  19. Lamine, M.A., Le-Khac, N., Tahar, M.K.: Lightweight Clustering Technique for Distributed Data Mining Applications. In: Perner, P. (ed.) ICDM 2007. LNCS (LNAI), vol. 4597, pp. 120–134. Springer, Heidelberg (2007)

    Google Scholar 

  20. Le-Khac, N., Lamine, M.A., Tahar, M.K.: A New Approach for Distributed Density Based Clustering on Grid Platform. In: Cooper, R., Kennedy, J. (eds.) BNCOD 2007. LNCS, vol. 4587, pp. 247–258. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  21. Li, T., Zhu, S., Ogihara, M.: A New distributed data mining model based on similarity. In: Proceedings of the 2003 ACM symposium on Applied Computing, pp. 432–436 (2003)

    Google Scholar 

  22. Lingras, P., Chen, M., Miao, D.: Precision of Rough Set Clustering. In: The Sixth International Conference on Rough Sets and Current Trends in Computing Akron, Ohio, USA (submitted, 2008)

    Google Scholar 

  23. Lingras, P., West, C.: Interval set clustering of web users with rough k-means. Journal of Intelligent Information Systems 23(1), 5–16 (2004)

    Article  MATH  Google Scholar 

  24. Lingras, P., Yan, R., Jain, A.: Web usage mining: Comparison of conventional, fuzzy, and rough set clustering. In: Zhang, Y., Liu, J., Yao, Y. (eds.) Computational Web Intelligence: Intelligent Technology for Web Applications, ch. 7, pp. 133–148. Springer, Heidelberg (2004)

    Google Scholar 

  25. Merugu, S., Ghosh, J.: A Distributed Learning Framework for Heterogeneous Data Sources. In: Proceedings of the 11th International Conference on Knowledge Discovery and Data Mining (KDD 2005) (2005)

    Google Scholar 

  26. Merz, C.J., Murphy, P.M.: UCI Repository of Machine Learning Databases. Irvine, University of California (1998), http://www.ics.uci.eedu/~mlearn/

  27. Mitra, S., Banka, H., Pedrycz, W.: Rough-Fuzzy Collaborative Clustering. IEEE Transactions on Systems, Man, and Cybernetics –Part B: Cybernetics 36(4), 795–805 (2006)

    Article  Google Scholar 

  28. Park, B., Kargupta, H.: Distributed Data Mining. In: Ye, N. (ed.) The Hand Book of Data Mining. Lawrence Erlabum Associates, Publishers, Mahwah (2003)

    Google Scholar 

  29. Tan, P.-N., Steinbach, M., Kumar, V.: Cluster Analysis: Basic Concepts and Algorithms. In: Introduction to Data Mining. Pearson Addison Wesley, Boston (2006)

    Google Scholar 

  30. Pawlak, Z.: Rough sets. Internationl Journal of Information and Computer Sciences 11, 145–172 (1982)

    MathSciNet  Google Scholar 

  31. Perez, J.O., Pazos, R.R., Cruz, L.R., et al.: Improving the Efficiency and Efficacy of the K-Means Clustering Algorithm through a new convergence condition. In: Gervasi, O., Gavrilova, M.L. (eds.) ICCSA 2007, Part III. LNCS, vol. 4707, pp. 674–682. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  32. Peters, G.: Some Refinements of Rough K-Means clustering. Pattern Recognition 39(8), 1481–1491 (2006)

    Article  MATH  Google Scholar 

  33. Sanghamitra, B., Giannella, C., Maulik, U., et al.: Clustering Distributed Data Streams in Peer-to-Peer Environments. Information Science 176(4), 1952–1985 (2006)

    Article  Google Scholar 

  34. Strehl, A., Ghosh, J.: Cluster Ensembles – A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research 3, 583–617 (2002)

    Article  MathSciNet  Google Scholar 

  35. Xiong, X., Lee, K.T.: Similarity-Driven Cluster Merging method for Unsupervised fuzzy clustering. In: Proceedings of the 20th conference on Uncertainty in Artificial Intelligence, pp. 611–618 (2004)

    Google Scholar 

  36. Xu, R., Wunsch II, D.: Survey of clustering algorithms. IEEE Transaction on Neural Networks 16(3), 645–678 (2005)

    Article  Google Scholar 

  37. Zhou, A., Cao, F., Yan, Y., Sha, C., He, X.: Distributed Data Stream Clustering: A Fast EM-based Approach. In: ICDE 2007, IEEE 23rd International Conference on Data Engineering, pp. 736–745 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Visalakshi, N.K., Thangavel, K. (2009). Distributed Data Clustering: A Comparative Analysis. In: Abraham, A., Hassanien, AE., de Leon F. de Carvalho, A.P., Snášel, V. (eds) Foundations of Computational, IntelligenceVolume 6. Studies in Computational Intelligence, vol 206. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01091-0_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-01091-0_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-01090-3

  • Online ISBN: 978-3-642-01091-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics