Abstract
Quantum of data is increasing in an exponential order. Clustering is a major task in many text mining applications. Organizing text documents automatically, extracting topics from documents, retrieval of information and information filtering are considered as the applications of clustering. This task reveals identical patterns from a collection of documents. Understanding of the documents, representation of them and categorization of documents require various techniques. Text clustering process requires both natural language processing and machine learning techniques. An unsupervised spatial pattern identification approach is proposed for text data. A new algorithm for finding coherent patterns from a huge collection of text data is proposed, which is based on the shared nearest neighbour. The implementation followed by validation confirms that the proposed algorithm can cluster the text data for the identification of coherent patterns. The results are visualized using a graph. The results show the methodology works well for different text datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Singh P, Meshram PA (2017, November) Survey of density based clustering algorithms and its variants. In: 2017 international conference on inventive computing and informatics (ICICI), pp 920–926. IEEE
Ali T, Asghar S, Sajid NA (2010, June) Critical analysis of DBSCAN variations. In: 2010 international conference on information and emerging technologies, pp 1–6. IEEE
Chandrasekar RSV, & Britto GA (2019, June) Comprehensive review on density-based clustering algorithm in data mining. Int J Res Anal 6(2):5–9
Chauhan R, Batra P, & Chaudhary S (2014) A survey of density based clustering algorithms. Int J Comput Sci Technol 5(2):169–171
Maitry N, Vaghela D (2014) Survey on different density based algorithms on spatial dataset. Int J Adv Res Comput Sci Manag Stud 2(2):2321–7782
Böhm C, Noll R, Plant C, Wackersreuther B (2009, November) Density-based clustering using graphics processors. In: Proceedings of the 18th ACM conference on Information and knowledge management. ACM, pp 661–670
Goswami M, Sarmah R, Bhattacharyya DK (2011) CNNC: a common nearest neighbour clustering approach for gene expression data. Int J Comput Vis Robot 2(2):115–126
Goswami M, Purkayastha BS (2019, October) Discovering patterns using feature selection techniques and correlation. In: International conference on innovative data communication technologies and application. Springer, Cham, pp 824–831
Brown D, Japa A, & Shi Y (2019, April) An attempt at improving density-based clustering algorithms. In proceedings of the 2019 ACM Southeast conference (pp. 172–175)
Chen CL, Tseng FS, Liang T (2011) An integration of fuzzy association rules and WordNet for document clustering. Knowl Inf Syst 28(3):687–708
Wei CP, Yang CS, Hsiao HW, Cheng TH (2006) Combining preference-and content-based approaches for improving document clustering effectiveness. Inf Process Manage 42(2):350–372
Mugunthadevi K, Punitha SC, Punithavalli M, Mugunthadevi K (2011) Survey on feature selection in document clustering. Int J Comput Sci Eng 3(3):1240–1241
Willett P (1988) Recent trends in hierarchic document clustering: a critical review. Inf Process Manage 24(5):577–597
Luo C, Li Y, Chung SM (2009) Text document clustering based on neighbors. Data Knowl Eng 68(11):1271–1288
Hatamlou A, Abdullah S, Nezamabadi-Pour H (2012) A combined approach for clustering based on K-means and gravitational search algorithms. Swarm and Evolutionary Computation 6:47–52
Goswami M, Babu A, Purkayastha BS (2018) A comparative analysis of similarity measures to find coherent documents. Appl Sci Manag 8(11):786–797
Karol S, Mangat V (2013) Evaluation of text document clustering approach based on particle swarm optimization. Open Comput Sci 3(2):69–90
Shah N, Mahajan S (2012) Document clustering: a detailed review. Int J Appl Inf Syst 4(5):30–38
Huang A (2008, April) Similarity measures for text document clustering. In: Proceedings of the sixth New Zealand computer science research student conference (NZCSRSC2008), vol 4, Christchurch, New Zealand, pp 9–56)
Baghel R, Dhir R (2010) A frequent concepts based document clustering algorithm. Int J Comput Appl 4(5):6–12
Abualigah LM, Khader AT, Al-Betar MA, Awadallah MA (2016, May) A krill herd algorithm for efficient text documents clustering. In: 2016 IEEE symposium on computer applications & industrial electronics (ISCAIE), pp 67–72. IEEE
Patil LH, Atique M (2013, February) A novel approach for feature selection method TF-IDF in document clustering. In: 2013 3rd IEEE international advance computing conference (IACC), pp 858–862. IEEE
Chen CL, Tseng FS, Liang T (2010) An integration of WordNet and fuzzy association rule mining for multi-label document clustering. Data Knowl Eng 69(11):1208–1226
Andrews NO, Fox EA (2007) Recent developments in document clustering. Department of Computer Science, Virginia Polytechnic Institute & State University
Gil-GarcÃa R, Pons-Porrata A (2010) Dynamic hierarchical algorithms for document clustering. Pattern Recogn Lett 31(6):469–477
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques, KDD workshop on text mining
Cui X, Potok TE (2005) Document clustering analysis based on hybrid PSO + K-means algorithm. J Comput Sci (Spec issue) 27:33
Beil F, Ester M, Xu X (2002, July) Frequent term-based text clustering. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 436–442. ACM
Fung BC, Wang K, Ester M (2003, May) Hierarchical document clustering using frequent itemsets. In: Proceedings of the 2003 SIAM international conference on data mining. Society for industrial and applied mathematics, pp 59–70
Chen CL, Tseng FS, Liang T (2010) Mining fuzzy frequent itemsets for hierarchical document clustering. Inf Process Manage 46(2):193–211
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Goswami, M. (2021). A Document Clustering Approach Using Shared Nearest Neighbour Affinity, TF-IDF and Angular Similarity. In: Hemanth, J., Bestak, R., Chen, J.IZ. (eds) Intelligent Data Communication Technologies and Internet of Things. Lecture Notes on Data Engineering and Communications Technologies, vol 57. Springer, Singapore. https://doi.org/10.1007/978-981-15-9509-7_23
Download citation
DOI: https://doi.org/10.1007/978-981-15-9509-7_23
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-9508-0
Online ISBN: 978-981-15-9509-7
eBook Packages: EngineeringEngineering (R0)