An Analytical Approach to Document Clustering Techniques

  • Vikas ChoubeyEmail author
  • Sanjay Kumar Dubey
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1077)


Clustering is a technique that group data together based on their similarity and apart based on their dissimilarity. When this technique is applied to documents and the terms within these documents retrieval of similar documents become easy and efficient. Document clustering is being researched and utilized for many years but is yet far from being optimal. To study and analyze different document clustering algorithm, a theoretical literature review and analysis was performed and the results are presented in this paper. This paper comprises of theoretical review of papers. 95 papers were identified and out of these 30 were selected. Various techniques or algorithms and modifications to previous algorithms proposed for document clustering by various researchers are compiled and presented with the intent that it will aid the researchers in finding out the current and future scope of research in information retrieval systems and document clustering technologies.


Document clustering Information retrieval system Clustering K-Means Precision Algorithm 


  1. 1.
    Handa, R., Rama Krishna, C., Aggarwal, N.: Document clustering for efficient and secure information retrieval from cloud. Concurr. Comput. Pract. Exp. e5127Google Scholar
  2. 2.
    Anbarasi, M.S., et al.: Ontology oriented concept-based clustering. IJRET Int. J. Res. Eng. Technol. 3(2) (2014)Google Scholar
  3. 3.
    Sedding, J., Kazakov, D.: WordNet-based text document clustering. In: Proceedings of the 3rd Workshop on Robust Methods in Analysis of Natural Language Data. Association for Computational Linguistics (2004)Google Scholar
  4. 4.
    Sarkar, S., Roy, A., Purkayastha, B.S.: A comparative analysis of particle swarm optimization and K-means algorithm for text clustering using Nepali Wordnet. Int. J. Nat. Lang. Comput. (IJNLC) 3(3) (2014)CrossRefGoogle Scholar
  5. 5.
    Akter, R., Chung, Y.: An evolutionary approach for document clustering. IERI Procedia 4, 370–375 (2013)CrossRefGoogle Scholar
  6. 6.
    Meena, K.Y., Singh, P.: Text documents clustering using genetic algorithm and discrete differential evolution. Int. J. Comput. Appl. 43(1), 0975–8887 (2012)Google Scholar
  7. 7.
    Trappey, A.J.C., et al.: A fuzzy ontological knowledge document clustering methodology. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 39(3), 806–814 (2009)CrossRefGoogle Scholar
  8. 8.
    Thilagavathi, G., Anitha, J.: Document clustering in forensic investigation by hybrid approach. Int. J. Comput. Appl. 91(3) (2014)CrossRefGoogle Scholar
  9. 9.
    Baghel, R., Dhir, R.: A frequent concepts-based document clustering algorithm. Int. J. Comput. Appl. 4(5), 6–12 (2010)Google Scholar
  10. 10.
    Jing, H., et al.: Semantic naïve Bayes classifier for document classification. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing (2013)Google Scholar
  11. 11.
    Aggarwal, C.C., Reddy, C.K. (eds.): Data Clustering: Algorithms and Applications. CRC Press, New York (2013)Google Scholar
  12. 12.
    Abualigah, L.M., Khader, A.T., Hanandeh, E.S.: A combination of objective functions and hybrid Krill herd algorithm for text document clustering analysis. Eng. Appl. Artif. Intell. 73, 111–125 (2018)CrossRefGoogle Scholar
  13. 13.
    Lydia, E.L., et al.: Charismatic document clustering through novel K-Means non-negative matrix factorization (KNMF) algorithm using key phrase extraction. Int. J. Parallel Program. 1–19 (2018)Google Scholar
  14. 14.
    Altameem, T., Amoon, M.: Hybrid tolerance rough fuzzy set with improved monkey search algorithm-based document clustering. J. Ambient Intell. Humanized Comput. 1–11 (2018)Google Scholar
  15. 15.
    Dalal, V., Malik, L.: Data Clustering Approach for Automatic Text Summarization of Hindi Documents using Particle Swarm Optimization and Semantic GraphGoogle Scholar
  16. 16.
    Ahmad, A., Amin, M.R., Chowdhury, F.: Bengali document clustering using word movers distance. In: 2018 International Conference on Bangla Speech and Language Processing (ICBSLP). IEEE (2018)Google Scholar
  17. 17.
    Lakshmi, R., Baskar, S.: DIC-DOC-K-means: dissimilarity-based Initial Centroid selection for DOCument clustering using K-means for improving the effectiveness of text document clustering. J. Inf. Sci. 0165551518816302 (2018)Google Scholar
  18. 18.
    Megarchioti, S., Mamalis, B.: The BigKClustering approach for document clustering using Hadoop MapReduce. In: Proceedings of the 22nd Pan-Hellenic Conference on Informatics. ACM (2018)Google Scholar
  19. 19.
    Al-Jadir, I., et al.: Enhancing digital forensic analysis using memetic algorithm feature selection method for document clustering. In: 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE (2018)Google Scholar
  20. 20.
    Zhu, Y., Zhang, M., Shi, F.: Application of algorithm CARDBK in document clustering. Wuhan Univ. J. Nat. Sci. 23(6), 514–524 (2018)CrossRefGoogle Scholar
  21. 21.
    Abualigah, L.M., et al.: A krill herd algorithm for efficient text documents clustering. In: 2016 IEEE Symposium on Computer Applications & Industrial Electronics (ISCAIE). IEEE (2016)Google Scholar
  22. 22.
    Akter, R., Chung, Y.: An improved genetic algorithm for document clustering on the cloud. Int. J. Cloud Appl. Comput. (IJCAC) 8(4), 20–28 (2018)Google Scholar
  23. 23.
    Chen, Y., Sun, P.: An optimized K-Means algorithm based on FSTVM. In: 2018 International Conference on Virtual Reality and Intelligent Systems (ICVRIS). IEEE (2018)Google Scholar
  24. 24.
    Al-Jadir, I., et al.: Adaptive crossover memetic differential harmony search for optimizing document clustering. In: International Conference on Neural Information Processing. Springer, Cham (2018)CrossRefGoogle Scholar
  25. 25.
    Seshadri, K., Viswanathan Iyer, K.: Design and evaluation of a parallel document clustering algorithm based on hierarchical latent semantic analysis. Concurr. Comput. Pract. Exp. e5094Google Scholar
  26. 26.
    Saini, N., Saha, S., Bhattacharyya, P.: Automatic scientific document clustering using self-organized multi-objective differential evolution. Cogn. Comput. 1–23 (2018)Google Scholar
  27. 27.
    Rani, M.S., Babu, G.C.: Efficient query clustering technique and context well-informed document clustering. In: Soft Computing and Signal Processing, pp. 261–271. Springer, Singapore (2019)Google Scholar
  28. 28.
    Gonzàlez, E., Turmo, J.: Unsupervised document clustering by weighted combination. LSI Research Report LSI-06-17-R, Departament de Llenguatges i Sistemes Informátics, Barcelona (2006)Google Scholar
  29. 29.
    Gupta, A., Gautam, J., Kumar, A.: A survey on methodologies used for semantic document clustering. In: 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS). IEEE (2017)Google Scholar
  30. 30.
    Jain, A.K., Narasimha Murty, M., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. (CSUR) 31(3), 264–323 (1999)CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  1. 1.Department of Computer Science & EngineeringAmity UniversityNoidaIndia

Personalised recommendations