Skip to main content

A Document Clustering Approach Using Shared Nearest Neighbour Affinity, TF-IDF and Angular Similarity

  • Conference paper
  • First Online:
Intelligent Data Communication Technologies and Internet of Things

Part of the book series: Lecture Notes on Data Engineering and Communications Technologies ((LNDECT,volume 57))

Abstract

Quantum of data is increasing in an exponential order. Clustering is a major task in many text mining applications. Organizing text documents automatically, extracting topics from documents, retrieval of information and information filtering are considered as the applications of clustering. This task reveals identical patterns from a collection of documents. Understanding of the documents, representation of them and categorization of documents require various techniques. Text clustering process requires both natural language processing and machine learning techniques. An unsupervised spatial pattern identification approach is proposed for text data. A new algorithm for finding coherent patterns from a huge collection of text data is proposed, which is based on the shared nearest neighbour. The implementation followed by validation confirms that the proposed algorithm can cluster the text data for the identification of coherent patterns. The results are visualized using a graph. The results show the methodology works well for different text datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Singh P, Meshram PA (2017, November) Survey of density based clustering algorithms and its variants. In: 2017 international conference on inventive computing and informatics (ICICI), pp 920–926. IEEE

    Google Scholar 

  2. Ali T, Asghar S, Sajid NA (2010, June) Critical analysis of DBSCAN variations. In: 2010 international conference on information and emerging technologies, pp 1–6. IEEE

    Google Scholar 

  3. Chandrasekar RSV, & Britto GA (2019, June) Comprehensive review on density-based clustering algorithm in data mining. Int J Res Anal 6(2):5–9

    Google Scholar 

  4. Chauhan R, Batra P, & Chaudhary S (2014) A survey of density based clustering algorithms. Int J Comput Sci Technol 5(2):169–171

    Google Scholar 

  5. Maitry N, Vaghela D (2014) Survey on different density based algorithms on spatial dataset. Int J Adv Res Comput Sci Manag Stud 2(2):2321–7782

    Google Scholar 

  6. Böhm C, Noll R, Plant C, Wackersreuther B (2009, November) Density-based clustering using graphics processors. In: Proceedings of the 18th ACM conference on Information and knowledge management. ACM, pp 661–670

    Google Scholar 

  7. Goswami M, Sarmah R, Bhattacharyya DK (2011) CNNC: a common nearest neighbour clustering approach for gene expression data. Int J Comput Vis Robot 2(2):115–126

    Article  Google Scholar 

  8. Goswami M, Purkayastha BS (2019, October) Discovering patterns using feature selection techniques and correlation. In: International conference on innovative data communication technologies and application. Springer, Cham, pp 824–831

    Google Scholar 

  9. Brown D, Japa A, & Shi Y (2019, April) An attempt at improving density-based clustering algorithms. In proceedings of the 2019 ACM Southeast conference (pp. 172–175)

    Google Scholar 

  10. Chen CL, Tseng FS, Liang T (2011) An integration of fuzzy association rules and WordNet for document clustering. Knowl Inf Syst 28(3):687–708

    Article  Google Scholar 

  11. Wei CP, Yang CS, Hsiao HW, Cheng TH (2006) Combining preference-and content-based approaches for improving document clustering effectiveness. Inf Process Manage 42(2):350–372

    Article  Google Scholar 

  12. Mugunthadevi K, Punitha SC, Punithavalli M, Mugunthadevi K (2011) Survey on feature selection in document clustering. Int J Comput Sci Eng 3(3):1240–1241

    Google Scholar 

  13. Willett P (1988) Recent trends in hierarchic document clustering: a critical review. Inf Process Manage 24(5):577–597

    Article  Google Scholar 

  14. Luo C, Li Y, Chung SM (2009) Text document clustering based on neighbors. Data Knowl Eng 68(11):1271–1288

    Article  Google Scholar 

  15. Hatamlou A, Abdullah S, Nezamabadi-Pour H (2012) A combined approach for clustering based on K-means and gravitational search algorithms. Swarm and Evolutionary Computation 6:47–52

    Article  Google Scholar 

  16. Goswami M, Babu A, Purkayastha BS (2018) A comparative analysis of similarity measures to find coherent documents. Appl Sci Manag 8(11):786–797

    Google Scholar 

  17. Karol S, Mangat V (2013) Evaluation of text document clustering approach based on particle swarm optimization. Open Comput Sci 3(2):69–90

    Article  Google Scholar 

  18. Shah N, Mahajan S (2012) Document clustering: a detailed review. Int J Appl Inf Syst 4(5):30–38

    Google Scholar 

  19. Huang A (2008, April) Similarity measures for text document clustering. In: Proceedings of the sixth New Zealand computer science research student conference (NZCSRSC2008), vol 4, Christchurch, New Zealand, pp 9–56)

    Google Scholar 

  20. Baghel R, Dhir R (2010) A frequent concepts based document clustering algorithm. Int J Comput Appl 4(5):6–12

    Google Scholar 

  21. Abualigah LM, Khader AT, Al-Betar MA, Awadallah MA (2016, May) A krill herd algorithm for efficient text documents clustering. In: 2016 IEEE symposium on computer applications & industrial electronics (ISCAIE), pp 67–72. IEEE

    Google Scholar 

  22. Patil LH, Atique M (2013, February) A novel approach for feature selection method TF-IDF in document clustering. In: 2013 3rd IEEE international advance computing conference (IACC), pp 858–862. IEEE

    Google Scholar 

  23. Chen CL, Tseng FS, Liang T (2010) An integration of WordNet and fuzzy association rule mining for multi-label document clustering. Data Knowl Eng 69(11):1208–1226

    Article  Google Scholar 

  24. Andrews NO, Fox EA (2007) Recent developments in document clustering. Department of Computer Science, Virginia Polytechnic Institute & State University

    Google Scholar 

  25. Gil-García R, Pons-Porrata A (2010) Dynamic hierarchical algorithms for document clustering. Pattern Recogn Lett 31(6):469–477

    Article  Google Scholar 

  26. Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques, KDD workshop on text mining

    Google Scholar 

  27. Cui X, Potok TE (2005) Document clustering analysis based on hybrid PSO + K-means algorithm. J Comput Sci (Spec issue) 27:33

    Google Scholar 

  28. Beil F, Ester M, Xu X (2002, July) Frequent term-based text clustering. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp 436–442. ACM

    Google Scholar 

  29. Fung BC, Wang K, Ester M (2003, May) Hierarchical document clustering using frequent itemsets. In: Proceedings of the 2003 SIAM international conference on data mining. Society for industrial and applied mathematics, pp 59–70

    Google Scholar 

  30. Chen CL, Tseng FS, Liang T (2010) Mining fuzzy frequent itemsets for hierarchical document clustering. Inf Process Manage 46(2):193–211

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mausumi Goswami .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Goswami, M. (2021). A Document Clustering Approach Using Shared Nearest Neighbour Affinity, TF-IDF and Angular Similarity. In: Hemanth, J., Bestak, R., Chen, J.IZ. (eds) Intelligent Data Communication Technologies and Internet of Things. Lecture Notes on Data Engineering and Communications Technologies, vol 57. Springer, Singapore. https://doi.org/10.1007/978-981-15-9509-7_23

Download citation

  • DOI: https://doi.org/10.1007/978-981-15-9509-7_23

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-15-9508-0

  • Online ISBN: 978-981-15-9509-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics