A parallel text clustering method using Spark and hashing

Ben HajKacem, Mohamed Aymen; Ben N’cir, Chiheb-Eddine; Essoussi, Nadia

doi:10.1007/s00607-021-00932-y

A parallel text clustering method using Spark and hashing

Special Issue Article
Published: 07 April 2021

Volume 103, pages 2007–2031, (2021)
Cite this article

Computing Aims and scope Submit manuscript

Mohamed Aymen Ben HajKacem¹,
Chiheb-Eddine Ben N’cir^1,2 &
Nadia Essoussi¹

353 Accesses
3 Citations
Explore all metrics

Abstract

Clustering textual data has become an important task in data analytics since several applications require to automatically organizing large amounts of textual documents into homogeneous topics. The increasing growth of available textual data from web, social networks and open platforms have challenged this task. It becomes important to design scalable clustering method able to effectively organize huge amount of textual data into topics. In this context, we propose a new parallel text clustering method based on Spark framework and hashing. The proposed method deals simultaneously with the issue of clustering huge amount of documents and the issue of high dimensionality of textual data by respectively integrating the divide and conquer approach and implementing a new document hashing strategy. These two facts have shown an important improvement of scalability and a good approximation of clustering quality results. Experiments performed on several large collections of documents have shown the effectiveness of the proposed method compared to existing ones in terms of running time and clustering accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spark Based Text Clustering Method Using Hashing

SHDC: A Fast Documents Classification Method Based on Simhash

Scalable Textual Similarity Search on Large Document Collections Through Random Indexing and K-means Clustering

Notes

References

Al-Maitah M (2019) Text analytics for big data using rough-fuzzy soft computing techniques. Expert Syst 36(6):e12463
Article Google Scholar
Arin I, Erpam MK, Saygin Y (2018) I-TWEC: interactive clustering tool for Twitter. Expert Syst Appl 96:1–13
Article Google Scholar
Attenberg J, Weinberger K, Dasgupta A, Smola A, Zinkevich M (2009) Collaborative email-spam filtering with the hashing trick. In: The sixth conference on Email and anti-spam
Bejos S, Feliciano-Avelino I, Martínez-Trinidad JF, Carrasco-Ochoa JA (2020) Improved fast partitional clustering algorithm for text clustering. J Intell Fuzzy Syst 39(2): 1–9
Ben HajKacem MA, Ben N’Cir CE, Essoussi N (2019) One-pass MapReduce-based clustering method for mixed large scale data. J Intell Inf Syst 52(3):619–636
Article Google Scholar
Ben HajKacem MA, Ben N’Cir CE, Essoussi N (2019) Overview of scalable partitional methods for big data clustering. In: Clustering methods for big data analytics. Springer, pp 1–23
Ben N’Cir CE, Essoussi N (2015) Using sequences of words for non-disjoint grouping of documents. Int J Pattern Recognit Artif Intell 29(3):1–20
Google Scholar
Bingham E, Mannila H (2001) Random projection in dimensionality reduction: applications to image and text data. In: The seventh ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 245–250
Caragea C, Silvescu A, Mitra P (2012) Combining hashing and abstraction in sparse high dimensional feature spaces. In: The advancement of artificial intelligence AAAI
Choi FY, Wiemer-Hastings P, Moore J (2001) Latent semantic analysis for text segmentation. In: The conference on empirical methods in natural language processing
Choi DW, Chung CW (2017) A K-partitioning algorithm for clustering large-scale spatio-textual data. Inf Syst 64:1–11
Article Google Scholar
Cormode G, Muthukrishnan S (2005) An improved data stream summary: the count-min sketch and its applications. J Algorithms 55(1):58–75
Article MathSciNet MATH Google Scholar
Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data K-means clustering using MapReduce. J Supercomput 70(3):1249–1259
Article Google Scholar
Dasgupta A, Kumar R, Sarlós T (2010) A sparse johnson: Lindenstrauss transform. In: The forty-second ACM symposium on Theory of computing, ACM, pp 341–350
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407
Article Google Scholar
Fern XZ, Brodley CE (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. In: The 20th international conference on machine learning, pp 186–193
Fradkin D, Madigan D (2003) Experiments with random projections for machine learning. In: The ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 517–522
Fraj M, Hajkacem MAB, Essoussi N (2018) A novel tweets clustering method using word embeddings. In: The IEEE/ACS 15th international conference on computer systems and applications (AICCSA), IEEE, pp 1–7
Irandoost MA, Rahmani AM, Setayeshi S (2019) MapReduce data skewness handling: a systematic literature review. Int J Parallel Program 47(5–6):907–950
Article Google Scholar
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam
MATH Google Scholar
Hassan MT, Karim A, Kim JB, Jeon M (2015) CDIM: document clustering by discrimination information maximization. Inf Sci 316(2015):87–106
Article Google Scholar
Hussain SF, Mushtaq M, Halim Z (2014) Multi-view document clustering via ensemble method. J Intell Inf Syst 43(1):81–99
Article Google Scholar
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv (CSUR) 31(3):264–323
Article Google Scholar
Jun S, Park SS, Jang DS (2014) Document clustering method using dimension reduction and support vector clustering to overcome sparseness. Expert Syst Appl 41(7):3204–3212
Article Google Scholar
Kowalski R, Hayes PJ (1968) Semantic trees in automatic theorem proving. Edinburgh University, Edinburgh
MATH Google Scholar
Kushwaha N, Pant M (2018) Link based BPSO for feature selection in big data text clustering. Future Gener Comput Syst 82(2018):190–199
Article Google Scholar
Lan M, Tan CL, Su J, Lu Y (2009) Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans Pattern Anal Mach Intell 31(4):721–735
Article Google Scholar
Landset S, Khoshgoftaar TM, Richter AN, Hasanin T (2015) A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J Big Data 2(1):24
Article Google Scholar
Li Y, Luo C, Chung SM (2015) A parallel text document clustering algorithm based on neighbors. Clust Comput 18(2):933–948
Article Google Scholar
Lin J (2013) Mapreduce is good enough? if all you have is a hammer, throw away everything that’s not a nail!. Big Data 1(1):28–37
Article Google Scholar
Liu G, Wang Y, Zhao T, Li D (2011) Research on the parallel text clustering algorithm based on the semantic tree. In: The 6th international conference on computer sciences and convergence information technology (ICCIT), IEEE, pp 400–403
Ma Y, Wang Y, Jin B (2014) A three-phase approach to document clustering based on topic significance degree. Expert Syst Appl 41(18):8203–8210
Article Google Scholar
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. Proc Fifth Berkeley Symp Math Stat Probab 14(1):281–297
MathSciNet MATH Google Scholar
Papadopoulos A, Pallis G, Dikaiakos MD (2017) Weighted clustering of attributed multi-graphs. Computing 99(9):813–840
Article MathSciNet MATH Google Scholar
Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of. Addison-Wesley, Reading
Google Scholar
Saxena A, Prasad M, Gupta A, Bharill N, Patel OP, Tiwari A, Lin CT (2017) A review of clustering techniques and developments. Neurocomputing 267(2017):664–681
Article Google Scholar
Schütze H, Silverstein C (1997) Projections for efficient document clustering. In: ACM SIGIR Forum, ACM, pp 74–81
Sculley, D. (2010) Web-scale k-means clustering. In: The 19th international conference on World wide web, ACM, pp 1177–1178
Sinha A, Jana PK (2018) A hybrid MapReduce-based k-means clustering using genetic algorithm for distributed datasets. J Supercomput 74(4):1562–1579
Article Google Scholar
Singh D, Reddy CK (2015) A survey on platforms for big data analytics. J Big Data 2(1):8
Article Google Scholar
Shahnaz F, Berry MW, Pauca VP, Plemmons RJ (2006) Document clustering using nonnegative matrix factorization. Inf Process Manag 42(2):373–386
Article MATH Google Scholar
Shi Q, Petterson J, Dror G, Langford J, Smola A, Vishwanathan SVN (2009) Hash kernels for structured data. J Mach Learn Res 10(2009):2615–2637
MathSciNet MATH Google Scholar
Song W, Park SC (2007) A novel document clustering model based on latent semantic analysis. In: The third international conference on semantics. Knowledge and grid, IEEE, pp 539–542
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. KDD Workshop Text Min 400(1):525–526
Google Scholar
Sun Z, Fox G, Gu W, Li Z (2014) A parallel clustering method combined information bottleneck theory and centroid-based clustering. J Supercomput 69(1):452–467
Article Google Scholar
Tagarelli A, Karypis G (2013) A segment-based approach to clustering multi-topic documents. Knowl Inf Syst 34(3):563–595
Article Google Scholar
Victor GS, Antonia P, Spyros S (2014) CSMR: a scalable algorithm for text clustering with cosine similarity and mapreduce. In: The IFIP international conference on artificial intelligence applications and innovations. Springer, pp 211–220
Wang P, Xu B, Xu J, Tian G, Liu CL, Hao H (2016) Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing 174(2016):806–814
Article Google Scholar
Wei T, Lu Y, Chang H, Zhou Q, Bao X (2015) A semantic approach for text clustering using WordNet and lexical chains. Expert Syst Appl 42(4):2264–2275
Article Google Scholar
White T (2012) Hadoop: the definitive guide. O’Reilly Media, Inc, Sebastopol
Google Scholar
Xu Y, Qu W, Li Z, Min G, Li K, Liu Z (2014) Efficient k-Means++ approximation with MapReduce. IEEE Trans Parallel Distrib Syst 25(12):3135–3144
Article Google Scholar
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. HotCloud 10(10):95
Google Scholar
Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on mapreduce. In: The IEEE international conference on cloud computing. Springer, pp 674-679
Zhou Z, Qin J, Xiang X, Tan Y, Liu Q, Xiong NN (2020) News text topic clustering optimized method based on TF-IDF algorithm on Spark. Comput Mater Continua 62(1):217–231
Article Google Scholar

Download references

Author information

Authors and Affiliations

LARODEC, Institut Supérieur de Gestion de Tunis, Université de Tunis, Tunis, Tunisia
Mohamed Aymen Ben HajKacem, Chiheb-Eddine Ben N’cir & Nadia Essoussi
College of Business, University of Jeddah, Jeddah, Saudi Arabia
Chiheb-Eddine Ben N’cir

Authors

Mohamed Aymen Ben HajKacem
View author publications
You can also search for this author in PubMed Google Scholar
Chiheb-Eddine Ben N’cir
View author publications
You can also search for this author in PubMed Google Scholar
Nadia Essoussi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed Aymen Ben HajKacem.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ben HajKacem, M.A., Ben N’cir, CE. & Essoussi, N. A parallel text clustering method using Spark and hashing. Computing 103, 2007–2031 (2021). https://doi.org/10.1007/s00607-021-00932-y

Download citation

Received: 15 August 2020
Accepted: 26 February 2021
Published: 07 April 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s00607-021-00932-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A parallel text clustering method using Spark and hashing

Abstract

Access this article

Similar content being viewed by others

Spark Based Text Clustering Method Using Hashing

SHDC: A Fast Documents Classification Method Based on Simhash

Scalable Textual Similarity Search on Large Document Collections Through Random Indexing and K-means Clustering

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A parallel text clustering method using Spark and hashing

Abstract

Access this article

Similar content being viewed by others

Spark Based Text Clustering Method Using Hashing

SHDC: A Fast Documents Classification Method Based on Simhash

Scalable Textual Similarity Search on Large Document Collections Through Random Indexing and K-means Clustering

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation