Semantic string operation for specializing AHC algorithm for text clustering

  • Taeho JoEmail author


This article proposes the modified AHC (Agglomerative Hierarchical Clustering) algorithm which clusters string vectors, instead of numerical vectors, as the approach to the text clustering. The results from applying the string vector based algorithms to the text clustering were successful in previous works and synergy effect between the text clustering and the word clustering is expected by combining them with each other; the two facts become motivations for this research. In this research, we define the operation on string vectors called semantic similarity, and modify the AHC algorithm by adopting the proposed similarity metric as the approach to the text clustering. The proposed AHC algorithm is empirically validated as the better approach in clustering texts in news articles and opinions. We need to define and characterize mathematically more operations on string vectors for modifying more advanced machine learning algorithms.


String vector Semantic similarity String vector based AHC algorithm Text clustering 

Mathematics Subject Classification (2010)



Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.



This work was supported by 2019 Hongik University Research Fund.


  1. 1.
    Abainia, K., Ouamour, S., Sayoud, H.: Neural text categorizer for topic identification of noisy arabic texts. In: Proceedings of 12th IEEE Conference on Computer Systems and Applications, pp. 1–8 (2015)Google Scholar
  2. 2.
    Ah-Pine, J., Wang, X.: Similarity based hierarchical clustering with an application to text collections. In: Proceedings of International Symposium on Intelligent Data Analysis, pp. 320–331 (2016)Google Scholar
  3. 3.
    Brun, M., Sima, C., Hua, J., Lowey, J., Carroll, B., Suha, E., Doughertya, E.R.: Model-based evaluation of clustering validation measures. Pattern Recogn 40, 807–824 (2007)CrossRefGoogle Scholar
  4. 4.
    Dhillon, I.S., Mallela, S., Kumar, R.: Enhanced word clustering for hierarchical text classification. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 191–200 (2002)Google Scholar
  5. 5.
    Gamare, P.S., Patil, G.A.: Web document clustering using hybrid app roach in data mining. Int. J. Advent Technol. 3(7), 92–87 (2015)Google Scholar
  6. 6.
    Gao, H., Jiang, J., She, L., Fu, Y.: A new agglomerative hierarchical clustering algorithm implementation based on the map reduce framework. Int. J. Digital Content Technol. Appl. 4(3), 95–100 (2010)CrossRefGoogle Scholar
  7. 7.
    Jo, T.: NeuroTextCategorizer: a new model of neural network for text categorization. The Proceedings of ICONIP, pp. 280–285 (2000)Google Scholar
  8. 8.
    Jo, T.: The implementation of dynamic document organization using text categorization and text clustering. PhD Dissertation of University of Ottawa (2006)Google Scholar
  9. 9.
    Jo, T.: Table based single pass algorithm for clustering news articles. Int. J. Fuzzy Logic Intell. Syst. 8(3), 231–237 (2008)CrossRefGoogle Scholar
  10. 10.
    Jo, T.: Neural text categorizer for exclusive text categorization. J. Inform. Process. Syst. 4(2), 77–86 (2008)CrossRefGoogle Scholar
  11. 11.
    Jo, T.: Modification of classification algorithm in favor of text categorization. Int. J. Comput. Sci. Softw. Technol. 2(1), 13–23 (2009)Google Scholar
  12. 12.
    Jo, T.: Modification of clustering algorithms for text clustering. Int. J. Comput. Sci. Softw. Technol. 3(1), 21–33 (2010)MathSciNetGoogle Scholar
  13. 13.
    Jo, T.: NTC (neural text categorizer): Neural network for text categorization. Int. J. Inform. Stud. 2(2), 83–96 (2010)Google Scholar
  14. 14.
    Jo, T.: NTSO (neural text self organizer): a new neural network for text clustering. J. Netw. Technol. 1(1), 31–43 (2010)Google Scholar
  15. 15.
    Jo, T.: Device and method for categorizing electronic document automatically, 10-2009-0041272 10-1071495 (2011)Google Scholar
  16. 16.
    Jo, T.: Normalized table matching algorithm as App roach to text categorization. Soft Comput. 19(4), 839–849 (2015)CrossRefGoogle Scholar
  17. 17.
    Jo, T.: Simulation of numerical semantic operations on string in text collection. Int. J. Appl. Eng. Res. 10(24), 45585–45591 (2015)Google Scholar
  18. 18.
    Jo, T., Cho, D.: Index based approach for text categorization. Int. J. Math. Comput. Simul. 2, 127–132 (2008)Google Scholar
  19. 19.
    Jo, T., Japkowicz, N.: Text clustering using NTSO. In: The Proceedings of IJCNN, pp. 558–563 (2005)Google Scholar
  20. 20.
    Jo, T., Lee, M.: The evaluation measure of text clustering for the variable number of clusters. Lect. Notes Comput. Sci. 4492, 871–879 (2007)CrossRefGoogle Scholar
  21. 21.
    Kate, R.J., Mooney, R.J.: Using string kernels for learning semantic parsers. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp. 913–920 (2006)Google Scholar
  22. 22.
    Leslie, C.S., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for discriminative protein classification. Bioinformatics 20(4), 467–476 (2004)CrossRefGoogle Scholar
  23. 23.
    Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification with string kernels. J. Mach. Learn. Res. 2(2), 419–444 (2002)zbMATHGoogle Scholar
  24. 24.
    Pawar, P.Y., Gawande, S.H.: A comparative study on different types of approaches to text categorization. Int. J. Mach. Learn. Comput. 2, 4 (2012)Google Scholar
  25. 25.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv., 1–47 (2002)Google Scholar
  26. 26.
    Slonim, N., Tishby, N.: The power of word clusters for text classification. In: Proceedings of 23rd European Colloquium on Information Retrieval Research, pp. 200–200 (2001)Google Scholar
  27. 27.
    Wiener, E.D.: A Neural Network Approach to Topic Spotting in Text. Master Thesis the Faculty of the Graduate School of the University of Colorado (1995)Google Scholar
  28. 28.
    Yang, Y.: An evaluation of statistical approaches to text categorization. Inform. Retriev. 1(1), 69–90 (1999)CrossRefGoogle Scholar
  29. 29.
    Zheng, Y., Cheng, X., Huang, R., Man, Y.: A comparative study on text clustering methods. Adv. Data Mining Appl., 644–651 (2006)Google Scholar
  30. 30.
    Zhou, E., Zhong, N., Li, Y., Huang, J.: Hot topic detection in news blog based on W2T methodology. In: Proceedings of International Conference on Wisdom Web of Things, pp. 237–258 (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.CheongjuSouth Korea

Personalised recommendations