CSMR: A Scalable Algorithm for Text Clustering with Cosine Similarity and MapReduce

  • Giannakouris-Salalidis Victor
  • Plerou Antonia
  • Sioutas Spyros
Part of the IFIP Advances in Information and Communication Technology book series (IFIPAICT, volume 437)


As Internet develops rapidly huge amounts of texts need to be processed in a short time. This entails the necessity of fast, scalable methods for text processing. In this paper a method for pairwise text similarity on massive data-sets, using the Cosine Similarity metric and the tf-idf (Term Frequency-Inverse Document Frequency) normalization method is proposed. The research approach is mainly focused on the MapReduce paradigm, a model for processing large data-sets in parallel manner, with a distributed algorithm on computer clusters. Through MapReduce model application on each step of the proposed method, text processing speed and scalability is enhanced in reference to other traditional methods. The CSMR (Cosine Similarity with MapReduce) method’s implementation is currently at the implementation stage. Precise and analytical conclusions concerning the efficiency of the proposed method are to be reached upon completion and review of the overall project phases.


MapReduce Hadoop TF-IDF Text Mining Cosine Similarity 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Wu, X., Zhu, X., Member, S.: Data Mining with Big Data 26, 97–107 (2014)Google Scholar
  2. 2.
    Chen, H., Storey, V.C.: Business Intelligence and Analytics: From Big Data to Big Impact 36, 1165–1188 (2012)Google Scholar
  3. 3.
    Taylor, R.C.: An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics. BMC Bioinformatics 11(suppl. 1), S1 (2010), doi: 10.1186/1471-2105-11-S12-S1Google Scholar
  4. 4.
    Matsunaga, A., Tsugawa, M., Fortes, J.: CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications. In: 2008 IEEE Fourth Int. Conf. eScience, pp. 222–229 (2008), doi:10.1109/eScience.2008.62Google Scholar
  5. 5.
    Huang, A.: Similarity Measures for Text Document Clustering (2008)Google Scholar
  6. 6.
    Ramos, J., Eden, J.: Edu R Using TF-IDF to Determine Word Relevance in Document QueriesGoogle Scholar
  7. 7.
    Tata, S., Patel, J.M., Science, C., Arbor, A.: Estimating the Selectivity of tf-idf based Cosine Similarity Predicates 36, 7–12 (2007)Google Scholar
  8. 8.
    Elsayed, T., Lin, J., Oard, D.W.: Pairwise Document Similarity in Large Collections with MapReduce, 265–268 (2008)Google Scholar
  9. 9.
    Bin, L., Yuan, G.: Improvement of TF-IDF Algorithm Based on Hadoop Framework. In: Proc 2nd Int. Conf. Comput. Appl. Syst. Model., pp. 391–393 (2012), doi:10.2991/iccasm.2012.98Google Scholar
  10. 10.
    Bank, J., Cole, B.: Calculating the Jaccard Similarity Coefficient with Map Reduce for Entity Pairs in Wikipedia (2008)Google Scholar
  11. 11.
    Wan, J., Yu, W., Xu, X.: Design and Implement of Distributed Document Clustering Based on MapReduce 7, 278–280 (2009)Google Scholar
  12. 12.
    Zhou, P., Lei, J., Ye, W.: Large-Scale Data Sets Clustering Based on MapReduce and Hadoop 16, 5956–5963 (2011)Google Scholar
  13. 13.
    Mihalcea, R., Corley, C., Strapparava, C.: Corpus-based and Knowledge-based Measures of Text Semantic Similarity (2005)Google Scholar
  14. 14.
    Turney, P.D.: From Frequency to Meaning: Vector Space Models of Semantics 37, 141–188 (2010)Google Scholar
  15. 15.
    Raghavan, V.V., Wong, S.K.M.: A critical analysis of vector space model for information retrieval. J. Am. Soc. Inf. Sci. 37, 279–287 (1986), doi:10.1002/asi.4630370502CrossRefGoogle Scholar
  16. 16.
    Terms RT NRC Publications Archive Archives des publications du CNRC Coherent Keyphrase Extraction via Web Mining Coherent Keyphrase Extraction via Web MiningGoogle Scholar
  17. 17.
    Kalaivendhan, K., Sumathi, P.: An Efficient Clustering Method To Find Similarity Between The Documents 2, 2532–2535 (2014)Google Scholar
  18. 18.
    Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The Hadoop Distributed File System. In: 2010 IEEE 26th Symp. Mass. Storage Syst. Technol., pp. 1–10 (2010), doi:10.1109/MSST.2010.5496972Google Scholar
  19. 19.
    Lin, X., Meng, Z., Xu, C., Wang, M.: A Practical Performance Model for Hadoop MapReduce. In: 2012 IEEE Int. Conf. Clust. Comput. Work, pp. 231–239 (2012), doi:10.1109/ClusterW.2012.24Google Scholar
  20. 20.
    Ekanayake, J., Pallickara, S., Fox, G.: MapReduce for Data Intensive Scientific Analyses. In: 2008 IEEE Fourth Int Conf eScience, pp. 277–284 (2008), doi:10.1109/eScience.2008.59Google Scholar
  21. 21.
    Dean, J., Ghemawat, S.: MapReduce: Simplified Data Processing on Large Clusters, 1–13Google Scholar
  22. 22.
    Lämmel, R.: Google’s MapReduce programming model — Revisited. Sci. Comput. Program 70, 1–30 (2008), doi:10.1016/j.scico.2007.07.001CrossRefzbMATHGoogle Scholar
  23. 23.
    Zaharia, M., Chowdhury, M., Franklin, M.J., et al.: Spark: Cluster Computing with Working SetsGoogle Scholar

Copyright information

© IFIP International Federation for Information Processing 2014

Authors and Affiliations

  • Giannakouris-Salalidis Victor
    • 1
  • Plerou Antonia
    • 1
  • Sioutas Spyros
    • 1
  1. 1.Department of InformaticsIonian UniversityGreece

Personalised recommendations