Distributed Data Clustering: A Comparative Analysis

Visalakshi, N. Karthikeyani; Thangavel, K.

doi:10.1007/978-3-642-01091-0_16

N. Karthikeyani Visalakshi⁶ &
K. Thangavel⁷

Part of the book series: Studies in Computational Intelligence ((SCI,volume 206))

1135 Accesses
7 Citations

Abstract

Due to explosion in the number of autonomous data sources, there is a growing need for effective approaches to distributed clustering. This paper compares the performance of two distributed clustering algorithms namely, Improved Distributed Combining Algorithm and Distributed K-Means algorithm against traditional Centralized Clustering Algorithm. Both algorithms use cluster centroid to form a cluster ensemble, which is required to perform global clustering. The centroid based partitioned clustering algorithms K-Means, Fuzzy K-Means and Rough K-Means are used with each distributed clustering algorithm, in order to analyze the performance of both hard and soft clusters in distributed environment. The experiments are carried out for an artificial dataset and four benchmark datasets of UCI machine learning data repository.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chen, R., Sivakumar, K., Kargupta, H.: Collective Mining of Bayesian Networks from Distributed Heterogeneous Data. Knowledge and Information Systems Journal 6, 164–187 (2004)
Google Scholar
Cormode, G., Muthukrishnan, S., Zhuang, W.: Conquering the Divide: Continuous Clustering of Distributed Data Streams. In: IEEE 23rd International Conference on Data Engineering, pp. 1036–1045 (2007)
Google Scholar
Folino, G., Forestiero, A., Spezzano, G.: Swarm-Based Distributed Clustering in Peer-to-Peer Systems. In: Talbi, E.-G., Liardet, P., Collet, P., Lutton, E., Schoenauer, M. (eds.) EA 2005. LNCS, vol. 3871, pp. 37–48. Springer, Heidelberg (2006)
Chapter Google Scholar
Ghosh, J., Merugu, S.: Distributed Clustering with Limited Knowledge Sharing. In: Proceedings of the 5th International Conference on Advances in Pattern Recognition, pp. 48–53 (2003)
Google Scholar
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Cluster validity methods: part II. ACM SIGMOD Record 31(3), 19–27 (2002)
Article Google Scholar
Hamerly, G., Elkan, C.: Alternatives to the K-Means algorithm that find better clusterings. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 600–607 (2002)
Google Scholar
Hammouda, K.: A Comparative Study of Data Clustering Techniques. In: Tools of Intelligent Systems Design. Course Project SYDE 625 (2000)
Google Scholar
Hore, P., Hall Lawrence, O.: Scalable Clustering: A Distributed Approach. In: IEEE International Conference on Fuzzy Systems, pp. 25–29 (2004)
Google Scholar
Hore, P., Hall Lawrence, O., Goldgofz, D.: A Cluster Ensemble Framework for Large Datasets. In: Proceedings of IEEE Conference on Systems, Man Cybernetics B (2006)
Google Scholar
Jain, A.K., Murthy, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3), 265–323 (1999)
Article Google Scholar
Januzaj, E., Kriegel Hans, P., Pfeifle, M.: Towards Effective and Efficient Distributed Clustering. In: Proceedings of International Workshop on Clustering Large Datasets, 3rd IEEE International Conference on Data Mining, pp. 49–58 (2003)
Google Scholar
Januzaj, E., Kriegel Hans, P., Pfeifle, M.: DBDC: Density Based Distributed Clustering. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 88–105. Springer, Heidelberg (2004)
Google Scholar
Jeong, J., Ryu, B., Shin, D., Shin, D.: Integration of Distributed Biological Data using modified K-means algorithm. In: Washio, T., Zhou, Z.-H., Huang, J.Z., Hu, X., Li, J., Xie, C., He, J., Zou, D., Li, K.-C., Freire, M.M. (eds.) PAKDD 2007. LNCS, vol. 4819, pp. 469–475. Springer, Heidelberg (2007)
Chapter Google Scholar
Genlin, J., Xiaohan, L.: Ensemble learning based distributed clustering. In: Washio, T., Zhou, Z.-H., Huang, J.Z., Hu, X., Li, J., Xie, C., He, J., Zou, D., Li, K.-C., Freire, M.M. (eds.) PAKDD 2007. LNCS, vol. 4819, pp. 312–321. Springer, Heidelberg (2007)
Google Scholar
Jin, R., Goswami, A., Agarwal, G.: Fast and exact out-of-core and distributed K-means clustering. Knowledge and Information Systems 10(1), 17–40 (2006)
Article Google Scholar
Karthikeyani, N.V., Thangavel, K., Alagambigai, P.: Ensemble Approach to Distributed Clustering. In: Natarajan (ed.) Mathematical and Computational Model, pp. 252–261. Narosa Publishing House, New Delhi (2007)
Google Scholar
Khanuja, J., Karlapalem, K.: CLOUD: Cluster Identification and Outlier Detection for Distributed Data. Technical report (2007)
Google Scholar
Kuhn, H.W.: The Hungarian Method for the Assignment Problem. Naval. Res. Logist. Quart 2, 83–97 (1995)
Article MathSciNet Google Scholar
Lamine, M.A., Le-Khac, N., Tahar, M.K.: Lightweight Clustering Technique for Distributed Data Mining Applications. In: Perner, P. (ed.) ICDM 2007. LNCS (LNAI), vol. 4597, pp. 120–134. Springer, Heidelberg (2007)
Google Scholar
Le-Khac, N., Lamine, M.A., Tahar, M.K.: A New Approach for Distributed Density Based Clustering on Grid Platform. In: Cooper, R., Kennedy, J. (eds.) BNCOD 2007. LNCS, vol. 4587, pp. 247–258. Springer, Heidelberg (2007)
Chapter Google Scholar
Li, T., Zhu, S., Ogihara, M.: A New distributed data mining model based on similarity. In: Proceedings of the 2003 ACM symposium on Applied Computing, pp. 432–436 (2003)
Google Scholar
Lingras, P., Chen, M., Miao, D.: Precision of Rough Set Clustering. In: The Sixth International Conference on Rough Sets and Current Trends in Computing Akron, Ohio, USA (submitted, 2008)
Google Scholar
Lingras, P., West, C.: Interval set clustering of web users with rough k-means. Journal of Intelligent Information Systems 23(1), 5–16 (2004)
Article MATH Google Scholar
Lingras, P., Yan, R., Jain, A.: Web usage mining: Comparison of conventional, fuzzy, and rough set clustering. In: Zhang, Y., Liu, J., Yao, Y. (eds.) Computational Web Intelligence: Intelligent Technology for Web Applications, ch. 7, pp. 133–148. Springer, Heidelberg (2004)
Google Scholar
Merugu, S., Ghosh, J.: A Distributed Learning Framework for Heterogeneous Data Sources. In: Proceedings of the 11th International Conference on Knowledge Discovery and Data Mining (KDD 2005) (2005)
Google Scholar
Merz, C.J., Murphy, P.M.: UCI Repository of Machine Learning Databases. Irvine, University of California (1998), http://www.ics.uci.eedu/~mlearn/
Mitra, S., Banka, H., Pedrycz, W.: Rough-Fuzzy Collaborative Clustering. IEEE Transactions on Systems, Man, and Cybernetics –Part B: Cybernetics 36(4), 795–805 (2006)
Article Google Scholar
Park, B., Kargupta, H.: Distributed Data Mining. In: Ye, N. (ed.) The Hand Book of Data Mining. Lawrence Erlabum Associates, Publishers, Mahwah (2003)
Google Scholar
Tan, P.-N., Steinbach, M., Kumar, V.: Cluster Analysis: Basic Concepts and Algorithms. In: Introduction to Data Mining. Pearson Addison Wesley, Boston (2006)
Google Scholar
Pawlak, Z.: Rough sets. Internationl Journal of Information and Computer Sciences 11, 145–172 (1982)
MathSciNet Google Scholar
Perez, J.O., Pazos, R.R., Cruz, L.R., et al.: Improving the Efficiency and Efficacy of the K-Means Clustering Algorithm through a new convergence condition. In: Gervasi, O., Gavrilova, M.L. (eds.) ICCSA 2007, Part III. LNCS, vol. 4707, pp. 674–682. Springer, Heidelberg (2007)
Chapter Google Scholar
Peters, G.: Some Refinements of Rough K-Means clustering. Pattern Recognition 39(8), 1481–1491 (2006)
Article MATH Google Scholar
Sanghamitra, B., Giannella, C., Maulik, U., et al.: Clustering Distributed Data Streams in Peer-to-Peer Environments. Information Science 176(4), 1952–1985 (2006)
Article Google Scholar
Strehl, A., Ghosh, J.: Cluster Ensembles – A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research 3, 583–617 (2002)
Article MathSciNet Google Scholar
Xiong, X., Lee, K.T.: Similarity-Driven Cluster Merging method for Unsupervised fuzzy clustering. In: Proceedings of the 20th conference on Uncertainty in Artificial Intelligence, pp. 611–618 (2004)
Google Scholar
Xu, R., Wunsch II, D.: Survey of clustering algorithms. IEEE Transaction on Neural Networks 16(3), 645–678 (2005)
Article Google Scholar
Zhou, A., Cao, F., Yan, Y., Sha, C., He, X.: Distributed Data Stream Clustering: A Fast EM-based Approach. In: ICDE 2007, IEEE 23rd International Conference on Data Engineering, pp. 736–745 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Vellalar College for Women, Erode-12, Tamil Nadu, India
N. Karthikeyani Visalakshi
Department of Computer Science, Periyar University, Salem-11, Tamil Nadu, India
K. Thangavel

Authors

N. Karthikeyani Visalakshi
View author publications
You can also search for this author in PubMed Google Scholar
K. Thangavel
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Machine Intelligence Research Labs, (MIR Labs), Scientific Network for Innovation and Research Excellence, Auburn, P.O. Box 2259, 98071-2259, Washington, USA
Ajith Abraham
College of Business Administration, Quantitative and Information System Department, Kuwait University, P.O. Box 5486, 13055, Safat, Kuwait
Aboul-Ella Hassanien
Department of Computer Science, University of São Paulo, Caixa Postal 668, 13560-970, Sao Carlos, SP, Brazil
André Ponce de Leon F. de Carvalho
Dept. Computer Science, Technical University Ostrava, Tr. 17. Listopadu 15, 708 33, Ostrava, Czech Republic
Václav Snášel

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Visalakshi, N.K., Thangavel, K. (2009). Distributed Data Clustering: A Comparative Analysis. In: Abraham, A., Hassanien, AE., de Leon F. de Carvalho, A.P., Snášel, V. (eds) Foundations of Computational, IntelligenceVolume 6. Studies in Computational Intelligence, vol 206. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01091-0_16

Download citation

DOI: https://doi.org/10.1007/978-3-642-01091-0_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01090-3
Online ISBN: 978-3-642-01091-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics