CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce

  • Giannakouris-Salalidis Victor
  • Plerou Antonia
  • Sioutas Spyros
Conference paper

DOI: 10.1007/978-3-662-44722-2_23

Volume 437 of the book series IFIP Advances in Information and Communication Technology (IFIPAICT)
Cite this paper as:
Victor GS., Antonia P., Spyros S. (2014) CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce. In: Iliadis L., Maglogiannis I., Papadopoulos H., Sioutas S., Makris C. (eds) Artificial Intelligence Applications and Innovations. AIAI 2014. IFIP Advances in Information and Communication Technology, vol 437. Springer, Berlin, Heidelberg

Abstract

As Internet develops rapidly huge amounts of texts need to be processed in a short time. This entails the necessity of fast, scalable methods for text processing. In this paper a method for pairwise text similarity on massive data-sets, using the Cosine Similarity metric and the tf-idf (Term Frequency-Inverse Document Frequency) normalization method is proposed. The research approach is mainly focused on the MapReduce paradigm, a model for processing large data-sets in parallel manner, with a distributed algorithm on computer clusters. Through MapReduce model application on each step of the proposed method, text processing speed and scalability is enhanced in reference to other traditional methods. The CSMR (Cosine Similarity with MapReduce) method’s implementation is currently at the implementation stage. Precise and analytical conclusions concerning the efficiency of the proposed method are to be reached upon completion and review of the overall project phases.

Keywords

MapReduce Hadoop TF-IDF Text Mining Cosine Similarity 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© IFIP International Federation for Information Processing 2014

Authors and Affiliations

  • Giannakouris-Salalidis Victor
    • 1
  • Plerou Antonia
    • 1
  • Sioutas Spyros
    • 1
  1. 1.Department of InformaticsIonian UniversityGreece