A Document Clustering Approach Using Shared Nearest Neighbour Affinity, TF-IDF and Angular Similarity

Goswami, Mausumi

doi:10.1007/978-981-15-9509-7_23

Mausumi Goswami⁵

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 57))

947 Accesses
1 Citations

Abstract

Quantum of data is increasing in an exponential order. Clustering is a major task in many text mining applications. Organizing text documents automatically, extracting topics from documents, retrieval of information and information filtering are considered as the applications of clustering. This task reveals identical patterns from a collection of documents. Understanding of the documents, representation of them and categorization of documents require various techniques. Text clustering process requires both natural language processing and machine learning techniques. An unsupervised spatial pattern identification approach is proposed for text data. A new algorithm for finding coherent patterns from a huge collection of text data is proposed, which is based on the shared nearest neighbour. The implementation followed by validation confirms that the proposed algorithm can cluster the text data for the identification of coherent patterns. The results are visualized using a graph. The results show the methodology works well for different text datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Singh P, Meshram PA (2017, November) Survey of density based clustering algorithms and its variants. In: 2017 international conference on inventive computing and informatics (ICICI), pp 920–926. IEEE
Google Scholar
Ali T, Asghar S, Sajid NA (2010, June) Critical analysis of DBSCAN variations. In: 2010 international conference on information and emerging technologies, pp 1–6. IEEE
Google Scholar
Chandrasekar RSV, & Britto GA (2019, June) Comprehensive review on density-based clustering algorithm in data mining. Int J Res Anal 6(2):5–9
Google Scholar
Chauhan R, Batra P, & Chaudhary S (2014) A survey of density based clustering algorithms. Int J Comput Sci Technol 5(2):169–171
Google Scholar
Maitry N, Vaghela D (2014) Survey on different density based algorithms on spatial dataset. Int J Adv Res Comput Sci Manag Stud 2(2):2321–7782
Google Scholar
Böhm C, Noll R, Plant C, Wackersreuther B (2009, November) Density-based clustering using graphics processors. In: Proceedings of the 18th ACM conference on Information and knowledge management. ACM, pp 661–670
Google Scholar
Goswami M, Sarmah R, Bhattacharyya DK (2011) CNNC: a common nearest neighbour clustering approach for gene expression data. Int J Comput Vis Robot 2(2):115–126
Article Google Scholar
Goswami M, Purkayastha BS (2019, October) Discovering patterns using feature selection techniques and correlation. In: International conference on innovative data communication technologies and application. Springer, Cham, pp 824–831
Google Scholar
Brown D, Japa A, & Shi Y (2019, April) An attempt at improving density-based clustering algorithms. In proceedings of the 2019 ACM Southeast conference (pp. 172–175)
Google Scholar
Chen CL, Tseng FS, Liang T (2011) An integration of fuzzy association rules and WordNet for document clustering. Knowl Inf Syst 28(3):687–708
Article Google Scholar
Wei CP, Yang CS, Hsiao HW, Cheng TH (2006) Combining preference-and content-based approaches for improving document clustering effectiveness. Inf Process Manage 42(2):350–372
Article Google Scholar
Mugunthadevi K, Punitha SC, Punithavalli M, Mugunthadevi K (2011) Survey on feature selection in document clustering. Int J Comput Sci Eng 3(3):1240–1241
Google Scholar
Willett P (1988) Recent trends in hierarchic document clustering: a critical review. Inf Process Manage 24(5):577–597
Article Google Scholar
Luo C, Li Y, Chung SM (2009) Text document clustering based on neighbors. Data Knowl Eng 68(11):1271–1288
Article Google Scholar
Hatamlou A, Abdullah S, Nezamabadi-Pour H (2012) A combined approach for clustering based on K-means and gravitational search algorithms. Swarm and Evolutionary Computation 6:47–52
Article Google Scholar
Goswami M, Babu A, Purkayastha BS (2018) A comparative analysis of similarity measures to find coherent documents. Appl Sci Manag 8(11):786–797
Google Scholar
Karol S, Mangat V (2013) Evaluation of text document clustering approach based on particle swarm optimization. Open Comput Sci 3(2):69–90
Article Google Scholar
Shah N, Mahajan S (2012) Document clustering: a detailed review. Int J Appl Inf Syst 4(5):30–38
Google Scholar
Huang A (2008, April) Similarity measures for text document clustering. In: Proceedings of the sixth New Zealand computer science research student conference (NZCSRSC2008), vol 4, Christchurch, New Zealand, pp 9–56)
Google Scholar
Baghel R, Dhir R (2010) A frequent concepts based document clustering algorithm. Int J Comput Appl 4(5):6–12
Google Scholar
Abualigah LM, Khader AT, Al-Betar MA, Awadallah MA (2016, May) A krill herd algorithm for efficient text documents clustering. In: 2016 IEEE symposium on computer applications & industrial electronics (ISCAIE), pp 67–72. IEEE
Google Scholar
Patil LH, Atique M (2013, February) A novel approach for feature selection method TF-IDF in document clustering. In: 2013 3rd IEEE international advance computing conference (IACC), pp 858–862. IEEE
Google Scholar
Chen CL, Tseng FS, Liang T (2010) An integration of WordNet and fuzzy association rule mining for multi-label document clustering. Data Knowl Eng 69(11):1208–1226
Article Google Scholar
Andrews NO, Fox EA (2007) Recent developments in document clustering. Department of Computer Science, Virginia Polytechnic Institute & State University
Google Scholar
Gil-García R, Pons-Porrata A (2010) Dynamic hierarchical algorithms for document clustering. Pattern Recogn Lett 31(6):469–477
Article Google Scholar
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques, KDD workshop on text mining
Google Scholar
Cui X, Potok TE (2005) Document clustering analysis based on hybrid PSO + K-means algorithm. J Comput Sci (Spec issue) 27:33
Google Scholar
Beil F, Ester M, Xu X (2002, July) Frequent term-based text clustering. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 436–442. ACM
Google Scholar
Fung BC, Wang K, Ester M (2003, May) Hierarchical document clustering using frequent itemsets. In: Proceedings of the 2003 SIAM international conference on data mining. Society for industrial and applied mathematics, pp 59–70
Google Scholar
Chen CL, Tseng FS, Liang T (2010) Mining fuzzy frequent itemsets for hierarchical document clustering. Inf Process Manage 46(2):193–211
Article Google Scholar

Download references

Author information

Authors and Affiliations

CHRIST (Deemed to be University), Bengaluru, India
Mausumi Goswami

Authors

Mausumi Goswami
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mausumi Goswami .

Editor information

Editors and Affiliations

Department of Electronics and Communication Engineering, Karunya University, Coimbatore, Tamil Nadu, India
Jude Hemanth
Czech Technical University, Prague, Czech Republic
Robert Bestak
Department of Electrical Engineering, Dayeh University, Changhua, Taiwan
Joy Iong-Zong Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Goswami, M. (2021). A Document Clustering Approach Using Shared Nearest Neighbour Affinity, TF-IDF and Angular Similarity. In: Hemanth, J., Bestak, R., Chen, J.IZ. (eds) Intelligent Data Communication Technologies and Internet of Things. Lecture Notes on Data Engineering and Communications Technologies, vol 57. Springer, Singapore. https://doi.org/10.1007/978-981-15-9509-7_23

Download citation

DOI: https://doi.org/10.1007/978-981-15-9509-7_23
Published: 13 February 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-9508-0
Online ISBN: 978-981-15-9509-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics