Advertisement

Optimized Distributed Text Document Clustering Algorithm

  • J. E. Judith
  • J. Jayakumari
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 325)

Abstract

Due to scientific progression, a variety of challenges exist in the field of information retrieval (IR) . These challenges are due to the increased usage of large volumes of data. These enormous amounts of data are available from large-scale distributed networks. Centralization of these data to perform analysis is difficult. There exists a need for distributed text document clustering algorithms that overcomes challenges in clustering. The two main challenges are clustering accuracy and clustering quality. In this paper, an optimized distributed text document clustering algorithm is proposed that uses a distributed particle swarm optimization (DPSO) algorithm for the purpose of optimizing and generating initial centroids for the distributed K-means (DKMeans) clustering algorithm. This improves the quality of clustering. Similarity is determined using Jaccard coefficient that generates coherent clusters, thus improving the accuracy of the proposed algorithm. Extensive evaluations based on simulation are carried out with the given data sets to demonstrate the effectiveness of the algorithm. Data sets such as Reuters-21578 and 20 Newsgroups are used for evaluation.

Keywords

Distributed document clustering Distributed particle swarm optimization (DPSO) Distributed K-means (DKMeans) Similarity measure Jaccard coefficient Information retrieval (IR) 

References

  1. 1.
    J. Han, M. Kamber, Data Mining: Concepts and Technique (2006)Google Scholar
  2. 2.
    S. Datta, K. Bhaduri, C. Giannella, R. Wolff, H. Kargupta, Distributed data mining in peer-to-peer networks. IEEE Int. Comput. (2006), pp. 1–8Google Scholar
  3. 3.
    N. Narayanan, J.E. Judith, J. Jayakumari, Enhanced distributed document clustering algorithm using different similarity measures. in IEEE Conference on Information and Communications Technologies (ICT) (2013), pp. 545–550Google Scholar
  4. 4.
    J.E. Judith, J. Jayakumari, Performance evaluation of an effective hybrid distributed document clustering algorithm. Eur. J. Sci. Res. 86(2), 283–297 (2012)Google Scholar
  5. 5.
    J.E. Judith, J. Jayakumari, Enhanced distributed text document clustering based on semantics. Int. Rev. Comput. softw. 8(10) (2013)Google Scholar
  6. 6.
    K.M. Hammouda, M.S. Kamel, Hierarchically distributed peer-to-peer document clustering and cluster summarization. IEEE Trans. Knowl. Data Eng. 21(5), 681–698 (2009)CrossRefGoogle Scholar
  7. 7.
    S. Datta, C.R. Giannella, H. Kargupta, Approximate distributed k-means clustering over P2P network. IEEE Trans. Knowl. Data Eng. 2(10), 1372–1388 (2009)Google Scholar
  8. 8.
    O. Papapetrou, W. Siberski, W. Nejdl, Decentralized Probabilistic Text Clustering. IEEE Trans. Knowl. Data Eng. 24(10), 1848–1861 (2012)CrossRefGoogle Scholar
  9. 9.
    E. Januzaj, H.-P. Kriegel, M. Pfeifle, Towards effective and efficient distributed clustering, in Workshop on Clustering large Data Sets (2003)Google Scholar
  10. 10.
    M. Steinbach, G. Karypis, V. Kumar, A comparison of document clustering techniques. KDD Workshop on Text Mining (2000)Google Scholar
  11. 11.
    J. Kennedy, R.C. Eberhart, Particle swarm optimization, in IEEE International Conference on Neural Networks (1995), pp. 1942–1948Google Scholar
  12. 12.
    X. Cui, T.E. Potok, Document clustering analysis based on hybrid PSO + Kmeans Algorithm. J. Comput. Sci. 27–33 (2005)Google Scholar
  13. 13.
    A. Huang, Similarity measures for text document clustering, in Proceedings of the New Zealand Computer Science Research Student Conference (2008), pp. 49–56Google Scholar
  14. 14.
    M.F. Porter, An algorithm for suffix stripping. Program Electron. Libr. Info. Syst. 14(3), 130–137 (1980)Google Scholar
  15. 15.
    G. Salton, A. Wong, C.S. Yang, A vector space model for automatic Indexing. Commun. ACM 18, 613–620 (1975)CrossRefMATHGoogle Scholar

Copyright information

© Springer India 2015

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringNoorul Islam Centre for Higher EducationKumaracoil, KanyakumariIndia
  2. 2.Department of Electronics and Communication EngineeringNoorul Islam Centre for Higher EducationKumaracoil, KanyakumariIndia

Personalised recommendations