Skip to main content

A parallel text clustering method using Spark and hashing

Abstract

Clustering textual data has become an important task in data analytics since several applications require to automatically organizing large amounts of textual documents into homogeneous topics. The increasing growth of available textual data from web, social networks and open platforms have challenged this task. It becomes important to design scalable clustering method able to effectively organize huge amount of textual data into topics. In this context, we propose a new parallel text clustering method based on Spark framework and hashing. The proposed method deals simultaneously with the issue of clustering huge amount of documents and the issue of high dimensionality of textual data by respectively integrating the divide and conquer approach and implementing a new document hashing strategy. These two facts have shown an important improvement of scalability and a good approximation of clustering quality results. Experiments performed on several large collections of documents have shown the effectiveness of the proposed method compared to existing ones in terms of running time and clustering accuracy.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. 1.

    https://howtodoinjava.com/java/string/string-hashcode-method/.

  2. 2.

    http://glaros.dtc.umn.edu/gkhome/cluto/cluto/download.

  3. 3.

    https://trec.nist.gov/.

  4. 4.

    http://reaction.fe.up.pt/textmining/20newsgroup.html.

  5. 5.

    http://csmining.org/index.php/r52-and-r8-of-reuters-21578.html.

References

  1. 1.

    Al-Maitah M (2019) Text analytics for big data using rough-fuzzy soft computing techniques. Expert Syst 36(6):e12463

    Article  Google Scholar 

  2. 2.

    Arin I, Erpam MK, Saygin Y (2018) I-TWEC: interactive clustering tool for Twitter. Expert Syst Appl 96:1–13

    Article  Google Scholar 

  3. 3.

    Attenberg J, Weinberger K, Dasgupta A, Smola A, Zinkevich M (2009) Collaborative email-spam filtering with the hashing trick. In: The sixth conference on Email and anti-spam

  4. 4.

    Bejos S, Feliciano-Avelino I, Martínez-Trinidad JF, Carrasco-Ochoa JA (2020) Improved fast partitional clustering algorithm for text clustering. J Intell Fuzzy Syst 39(2): 1–9

  5. 5.

    Ben HajKacem MA, Ben N’Cir CE, Essoussi N (2019) One-pass MapReduce-based clustering method for mixed large scale data. J Intell Inf Syst 52(3):619–636

    Article  Google Scholar 

  6. 6.

    Ben HajKacem MA, Ben N’Cir CE, Essoussi N (2019) Overview of scalable partitional methods for big data clustering. In: Clustering methods for big data analytics. Springer, pp 1–23

  7. 7.

    Ben N’Cir CE, Essoussi N (2015) Using sequences of words for non-disjoint grouping of documents. Int J Pattern Recognit Artif Intell 29(3):1–20

    Google Scholar 

  8. 8.

    Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: The seventh ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 245–250

  9. 9.

    Caragea C, Silvescu A, Mitra P (2012) Combining hashing and abstraction in sparse high dimensional feature spaces. In: The advancement of artificial intelligence AAAI

  10. 10.

    Choi FY, Wiemer-Hastings P, Moore J (2001) Latent semantic analysis for text segmentation. In: The conference on empirical methods in natural language processing

  11. 11.

    Choi DW, Chung CW (2017) A K-partitioning algorithm for clustering large-scale spatio-textual data. Inf Syst 64:1–11

    Article  Google Scholar 

  12. 12.

    Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count-min sketch and its applications. J Algorithms 55(1):58–75

    MathSciNet  MATH  Article  Google Scholar 

  13. 13.

    Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data K-means clustering using MapReduce. J Supercomput 70(3):1249–1259

    Article  Google Scholar 

  14. 14.

    Dasgupta A, Kumar R, Sarlós T (2010) A sparse johnson: Lindenstrauss transform. In: The forty-second ACM symposium on Theory of computing, ACM, pp 341–350

  15. 15.

    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  16. 16.

    Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407

    Article  Google Scholar 

  17. 17.

    Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: The 20th international conference on machine learning, pp 186–193

  18. 18.

    Fradkin D, Madigan D (2003) Experiments with random projections for machine learning. In: The ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 517–522

  19. 19.

    Fraj M, Hajkacem MAB, Essoussi N (2018) A novel tweets clustering method using word embeddings. In: The IEEE/ACS 15th international conference on computer systems and applications (AICCSA), IEEE, pp 1–7

  20. 20.

    Irandoost MA, Rahmani AM, Setayeshi S (2019) MapReduce data skewness handling: a systematic literature review. Int J Parallel Program 47(5–6):907–950

    Article  Google Scholar 

  21. 21.

    Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam

    MATH  Google Scholar 

  22. 22.

    Hassan MT, Karim A, Kim JB, Jeon M (2015) CDIM: document clustering by discrimination information maximization. Inf Sci 316(2015):87–106

    Article  Google Scholar 

  23. 23.

    Hussain SF, Mushtaq M, Halim Z (2014) Multi-view document clustering via ensemble method. J Intell Inf Syst 43(1):81–99

    Article  Google Scholar 

  24. 24.

    Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323

    Article  Google Scholar 

  25. 25.

    Jun S, Park SS, Jang DS (2014) Document clustering method using dimension reduction and support vector clustering to overcome sparseness. Expert Syst Appl 41(7):3204–3212

    Article  Google Scholar 

  26. 26.

    Kowalski R, Hayes PJ (1968) Semantic trees in automatic theorem proving. Edinburgh University, Edinburgh

    MATH  Google Scholar 

  27. 27.

    Kushwaha N, Pant M (2018) Link based BPSO for feature selection in big data text clustering. Future Gener Comput Syst 82(2018):190–199

    Article  Google Scholar 

  28. 28.

    Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735

    Article  Google Scholar 

  29. 29.

    Landset S, Khoshgoftaar TM, Richter AN, Hasanin T (2015) A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J Big Data 2(1):24

    Article  Google Scholar 

  30. 30.

    Li Y, Luo C, Chung SM (2015) A parallel text document clustering algorithm based on neighbors. Clust Comput 18(2):933–948

    Article  Google Scholar 

  31. 31.

    Lin J (2013) Mapreduce is good enough? if all you have is a hammer, throw away everything that’s not a nail!. Big Data 1(1):28–37

    Article  Google Scholar 

  32. 32.

    Liu G, Wang Y, Zhao T, Li D (2011) Research on the parallel text clustering algorithm based on the semantic tree. In: The 6th international conference on computer sciences and convergence information technology (ICCIT), IEEE, pp 400–403

  33. 33.

    Ma Y, Wang Y, Jin B (2014) A three-phase approach to document clustering based on topic significance degree. Expert Syst Appl 41(18):8203–8210

    Article  Google Scholar 

  34. 34.

    MacQueen J (1967) Some methods for classification and analysis of multivariate observations. Proc Fifth Berkeley Symp Math Stat Probab 14(1):281–297

    MathSciNet  MATH  Google Scholar 

  35. 35.

    Papadopoulos A, Pallis G, Dikaiakos MD (2017) Weighted clustering of attributed multi-graphs. Computing 99(9):813–840

    MathSciNet  MATH  Article  Google Scholar 

  36. 36.

    Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of. Addison-Wesley, Reading

    Google Scholar 

  37. 37.

    Saxena A, Prasad M, Gupta A, Bharill N, Patel OP, Tiwari A, Lin CT (2017) A review of clustering techniques and developments. Neurocomputing 267(2017):664–681

    Article  Google Scholar 

  38. 38.

    Schütze H, Silverstein C (1997) Projections for efficient document clustering. In: ACM SIGIR Forum, ACM, pp 74–81

  39. 39.

    Sculley, D. (2010) Web-scale k-means clustering. In: The 19th international conference on World wide web, ACM, pp 1177–1178

  40. 40.

    Sinha A, Jana PK (2018) A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets. J Supercomput 74(4):1562–1579

    Article  Google Scholar 

  41. 41.

    Singh D, Reddy CK (2015) A survey on platforms for big data analytics. J Big Data 2(1):8

    Article  Google Scholar 

  42. 42.

    Shahnaz F, Berry MW, Pauca VP, Plemmons RJ (2006) Document clustering using nonnegative matrix factorization. Inf Process Manag 42(2):373–386

    MATH  Article  Google Scholar 

  43. 43.

    Shi Q, Petterson J, Dror G, Langford J, Smola A, Vishwanathan SVN (2009) Hash kernels for structured data. J Mach Learn Res 10(2009):2615–2637

    MathSciNet  MATH  Google Scholar 

  44. 44.

    Song W, Park SC (2007) A novel document clustering model based on latent semantic analysis. In: The third international conference on semantics. Knowledge and grid, IEEE, pp 539–542

  45. 45.

    Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. KDD Workshop Text Min 400(1):525–526

    Google Scholar 

  46. 46.

    Sun Z, Fox G, Gu W, Li Z (2014) A parallel clustering method combined information bottleneck theory and centroid-based clustering. J Supercomput 69(1):452–467

    Article  Google Scholar 

  47. 47.

    Tagarelli A, Karypis G (2013) A segment-based approach to clustering multi-topic documents. Knowl Inf Syst 34(3):563–595

    Article  Google Scholar 

  48. 48.

    Victor GS, Antonia P, Spyros S (2014) CSMR: a scalable algorithm for text clustering with cosine similarity and mapreduce. In: The IFIP international conference on artificial intelligence applications and innovations. Springer, pp 211–220

  49. 49.

    Wang P, Xu B, Xu J, Tian G, Liu CL, Hao H (2016) Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing 174(2016):806–814

    Article  Google Scholar 

  50. 50.

    Wei T, Lu Y, Chang H, Zhou Q, Bao X (2015) A semantic approach for text clustering using WordNet and lexical chains. Expert Syst Appl 42(4):2264–2275

    Article  Google Scholar 

  51. 51.

    White T (2012) Hadoop: the definitive guide. O’Reilly Media, Inc, Sebastopol

    Google Scholar 

  52. 52.

    Xu Y, Qu W, Li Z, Min G, Li K, Liu Z (2014) Efficient k-Means++ approximation with MapReduce. IEEE Trans Parallel Distrib Syst 25(12):3135–3144

    Article  Google Scholar 

  53. 53.

    Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. HotCloud 10(10):95

    Google Scholar 

  54. 54.

    Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: The IEEE international conference on cloud computing. Springer, pp 674-679

  55. 55.

    Zhou Z, Qin J, Xiang X, Tan Y, Liu Q, Xiong NN (2020) News text topic clustering optimized method based on TF-IDF algorithm on Spark. Comput Mater Continua 62(1):217–231

    Article  Google Scholar 

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Mohamed Aymen Ben HajKacem.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ben HajKacem, M.A., Ben N’cir, CE. & Essoussi, N. A parallel text clustering method using Spark and hashing. Computing 103, 2007–2031 (2021). https://doi.org/10.1007/s00607-021-00932-y

Download citation

Keywords

  • Text clustering
  • Parallel computing
  • Spark framework
  • Hashing
  • High-dimensional data