Knowledge and Information Systems

, Volume 61, Issue 3, pp 1485–1516 | Cite as

Combining semantic and term frequency similarities for text clustering

  • Victor Hugo Andrade SoaresEmail author
  • Ricardo J. G. B. Campello
  • Seyednaser Nourashrafeddin
  • Evangelos Milios
  • Murilo Coelho Naldi
Regular Paper


A key challenge for document clustering consists in finding a proper similarity measure for text documents that enables the generation of cohesive groups. Measures based on the classic bag-of-words model take into account solely the presence (and frequency) of words in documents. In doing so, semantically similar documents which use different vocabularies may end up in different clusters. For this reason, semantic similarity measures that use external knowledge, such as word n-gram corpora or thesauri, have been proposed in the literature. In this paper, the Frequency Google Tri-gram Measure is proposed to assess similarity between documents based on the frequencies of terms in the compared documents as well as the Google n-gram corpus as an additional semantic similarity source. Clustering algorithms are applied to several real datasets in order to experimentally evaluate the quality of the clusters obtained with the proposed measure and compare it with a number of state-of-the-art measures from the literature. The experimental results demonstrate that the proposed measure improves significantly the quality of document clustering, based on statistical tests. We further demonstrate that clustering results combining bag-of-words and semantic similarity are superior to those obtained with either approach independently.


Document clustering Similarity measure Semantic similarity Text mining 



The authors acknowledge the Brazilian Research Agencies Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)—Finance Code 001, CNPq, FAPEMIG and FAPESP, the Natural Sciences and Engineering Research Council of Canada, the Boeing Company, CALDO, and the International Development Research Centre, Ottawa, Canada, for their financial support to this work.


  1. 1.
    Aggarwal CC, Zhai CX (2012) Mining text data. Springer, BerlinCrossRefGoogle Scholar
  2. 2.
    Arora S, Liang Y, Ma T (2017) A simple but tough-to-beat baseline for sentence embeddings. In: International conference on learning representationsGoogle Scholar
  3. 3.
    Banerjee S, Pedersen T (2003) Extended gloss overlaps as a measure of semantic relatedness. In: Proceedings of the 18th international joint conference on artificial intelligence, IJCAI’03. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 805–810. Accessed 24 Sept 2018
  4. 4.
    Beyer KS, Goldstein J, Ramakrishnan R, Shaft U (1999) When is ”nearest neighbor” meaningful? In: Proceedings of the 7th international conference on database theory, ICDT ’99. Springer, London, UK, pp 217–235Google Scholar
  5. 5.
    Bishop CM (2006) Pattern recognition and machine learning. No. 4 in Information science and statistics, Springer. ISBN: 0-387-31073-8. Accessed 24 Sept 2018
  6. 6.
    Blei DM, Ng AY, Jordan MI, Lafferty J (2003) Latent Dirichlet allocation. J Mach Learn Res 3:2003Google Scholar
  7. 7.
    Brants T, Franz A (2006) Web 1T 5-gram corpus version 1. Technical Report, Google ResearchGoogle Scholar
  8. 8.
    Cai D, He X, Han J (2008) Training linear discriminant analysis in linear time. In: IEEE 24th international conference on data engineering, 2008, ICDE 2008. IEEE, pp 209–217Google Scholar
  9. 9.
    Carpenter GA, Grossberg S (1987) A massively parallel architecture for a self-organizing neural pattern recognition machine. Comput Vis Graph Image Process 37(1):54–115. CrossRefzbMATHGoogle Scholar
  10. 10.
    Cormack GV, Hidalgo JMG, Sánz EP (2007) Feature engineering for mobile (SMS) spam filtering. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, ACM, New York, NY, USA, SIGIR ’07, pp 871–872.
  11. 11.
    Feldman R, Sanger J (2006) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, CambridgeCrossRefGoogle Scholar
  12. 12.
    Ferreira R, Lins RD, Freitas F, Simske SJ, Riss M (2014) A new sentence similarity assessment measure based on a three-layer sentence representation. In: Proceedings of the 2014 acm symposium on document engineering, ACM, New York, NY, USA, DocEng ’14, pp 25–34.
  13. 13.
    Ho C, Murad MAA, Kadir RA, Doraisamy SC (2010) Word sense disambiguation-based sentence similarity. In: Proceedings of the 23rd international conference on computational linguistics: posters, association for computational linguistics, Stroudsburg, PA, USA, COLING ’10, pp 418–426. Accessed 24 Sept 2018
  14. 14.
    Hochberg Y (1988) A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75:800–802MathSciNetCrossRefGoogle Scholar
  15. 15.
    Hochberg Y, Tamhane AC (1987) Multiple comparison procedures. Wiley, New YorkCrossRefGoogle Scholar
  16. 16.
    Hollander M, Wolfe DA (1999) Nonparametric statistical methods. Wiley series in probability and statistics, Wiley, New York. A Wiley-Interscience publication. Accessed 24 Sept 2018
  17. 17.
    Horn RA, Johnson CRCR (2012) Matrix analysis, 2nd edn. Cambridge University Press, Cambridge. CrossRefGoogle Scholar
  18. 18.
    Hotho A, Nrnberger A, Paa G (2005) A brief survey of text mining. LDV Forum GLDV J Comput Linguist Lang Technol 20(1):19–62Google Scholar
  19. 19.
    Hu J, Fang L, Cao Y, Zeng HJ, Li H, Yang Q, Chen Z (2008) Enhancing text clustering by leveraging wikipedia semantics. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, NY, USA, SIGIR ’08, pp 179–186.
  20. 20.
    Huang A, Milne D, Frank E, Witten IH (2008) Clustering documents with active learning using wikipedia. In: Proceedings of the 2008 Eighth IEEE international conference on data mining , ICDM ’08. IEEE Computer Society, Washington, DC, USA, pp 839–844.
  21. 21.
    Huang L, Milne D, Frank E, Witten IH (2012) Learning a concept-based document similarity measure. J Am Soc Inf Sci Technol 63(8):1593–1608. CrossRefGoogle Scholar
  22. 22.
    Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218. CrossRefzbMATHGoogle Scholar
  23. 23.
    Islam A, Inkpen D (2008) Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans Knowl Discov Data 2(2):10:1–10:25. CrossRefGoogle Scholar
  24. 24.
    Islam A, Milios E, Keselj V (2012) Text similarity using Google tri-grams. In: Proceedings of the 25th Canadian conference on advances in artificial intelligence, Canadian AI’12. Springer, Berlin, Heidelberg, pp 312–317CrossRefGoogle Scholar
  25. 25.
    Kaplan A (1955) An experimental study of ambiguity and context. Mech Transl 2:39–46.
  26. 26.
    Kogan J, Nicholas C, Volkovich V (2003) Text mining with information-theoretic clustering. Comput Sci Eng 5(6):52–59. CrossRefGoogle Scholar
  27. 27.
    Krishnapuram R, Joshi A, Nasraoui O, Yi L (2001) Low-complexity fuzzy relational clustering algorithms for web mining. Trans Fuzzy Syst 9(4):595–607. CrossRefGoogle Scholar
  28. 28.
    Lafore R (2002) Data structures and algorithms in Java, 2nd edn. Sams, IndianapolisGoogle Scholar
  29. 29.
    Lau JH, Baldwin T (2016) An empirical evaluation of doc2vec with practical insights into document embedding generation. CoRR arXiv:1607.05368
  30. 30.
    Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31st international conference on international conference on machine learning, ICML’14, Vol 32., pp II–1188–II–1196. Accessed 24 Sept 2018
  31. 31.
    Lee MD, Welsh M (2005) An empirical evaluation of models of text document similarity. In: In CogSci2005, Erlbaum, pp 1254–1259Google Scholar
  32. 32.
    Li Y, McLean D, Bandar ZA, O’Shea JD, Crockett K (2006) Sentence similarity based on semantic nets and corpus statistics. IEEE Trans Knowl Data Eng 18(8):1138–1150. CrossRefGoogle Scholar
  33. 33.
    Liu L, Shell D (2010) Assessing optimal assignment under uncertainty: an interval-based algorithm. In: Proceedings of robotics: science and systems, Zaragoza, Spain.
  34. 34.
    Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, CambridgezbMATHGoogle Scholar
  35. 35.
    Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New YorkCrossRefGoogle Scholar
  36. 36.
    Meng L, Tan AH, Xu D (2014) Semi-supervised heterogeneous fusion for multimedia data co-clustering. IEEE Trans Knowl Data Eng 26(9):2293–2306CrossRefGoogle Scholar
  37. 37.
    Meng L, Tan AH, Wunsch DC (2015) Adaptive scaling of cluster boundaries for large-scale social media data clustering. IEEE Trans Neural Netw Learn Syst 27(12):2656–2669CrossRefGoogle Scholar
  38. 38.
    Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the 21st national conference on artificial intelligence, Vol 1, AAAI’06. AAAI Press, pp 775–780. Accessed 24 Sept 2018
  39. 39.
    Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. CoRR arXiv:1301.3781
  40. 40.
    Miller GA (1995) Wordnet: a lexical database for english. Commun ACM 38(11):39–41. CrossRefGoogle Scholar
  41. 41.
    Milne D, Witten IH (2013) An open-source toolkit for mining wikipedia. Artif Intell 194:222–239. MathSciNetCrossRefGoogle Scholar
  42. 42.
    Naldi MC, Campello RJGB, Hruschka ER, Carvalho ACPLF (2011) Efficiency issues of evolutionary k-means. Appl Soft Comput 11(2):1938–1952CrossRefGoogle Scholar
  43. 43.
    Nourashrafeddin S (2014) Interactive user-supervised text document clustering. Ph.D. thesis, Dalhousie UniversityGoogle Scholar
  44. 44.
    Nourashrafeddin S, Milios E, Arnold D (2013) Interactive text document clustering using feature labeling. In: Proceedings of the 2013 ACM symposium on document engineering, DocEng ’13. ACM, New York, NY, USA, pp 61–70.
  45. 45.
    Nourashrafeddin S, Milios E, Arnold DV (2014) An ensemble approach for text document clustering using wikipedia concepts. In: Proceedings of the 2014 ACM symposium on document engineering, DocEng ’14. ACM, New York, NY, USA, pp 107–116.
  46. 46.
    Paulovich FV, Nonato LG, Minghim R, Levkowitz H (2008) Least square projection: a fast high-precision multidimensional projection technique and its application to document mapping. IEEE Trans Vis Comput Graph 14(3):564–575. CrossRefGoogle Scholar
  47. 47.
    Rakib M, Islam A, Milios E (2015) TrWP: text relatedness using word and phrase relatedness. In: Proceedings of the SemEval-2015. ACM, New York, NY, USA, pp 90–95.
  48. 48.
    Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323:533–536. CrossRefzbMATHGoogle Scholar
  49. 49.
    Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523. CrossRefGoogle Scholar
  50. 50.
    Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill Inc., New YorkzbMATHGoogle Scholar
  51. 51.
    Tang B, Shepherd M, Milios E, Heywood M (2005) Comparing and combining dimension reduction techniques for efficient text clustering. In: SIAM international workshop on feature selection for data mining - interfacing machine learning and statistics, Newport Beach, California, in conjunction with 2005 SIAM international conference on data mining, pp 1–10Google Scholar
  52. 52.
    Walpole RE, Myers RH, Myers SL, Ye K (2007) Probability and statistics for engineers and scientists, 8th edn. Pearson Education, Upper Saddle RiverzbMATHGoogle Scholar
  53. 53.
    Wei T, Lu Y, Chang H, Zhou Q, Bao X (2015) A semantic approach for text clustering using wordnet and lexical chains. Expert Syst Appl 42(4):2264–2275. CrossRefGoogle Scholar
  54. 54.
    Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In: Proceedings of the 32nd annual meeting on association for computational linguistics, ACL ’94. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 133–138.

Copyright information

© Springer-Verlag London Ltd., part of Springer Nature 2019

Authors and Affiliations

  • Victor Hugo Andrade Soares
    • 1
    • 3
    Email author
  • Ricardo J. G. B. Campello
    • 2
    • 3
  • Seyednaser Nourashrafeddin
    • 4
  • Evangelos Milios
    • 4
  • Murilo Coelho Naldi
    • 1
    • 5
  1. 1.Department of InformaticsFederal University of Viçosa (UFV)ViçosaBrazil
  2. 2.School of Mathematical and Physical SciencesThe University of NewcastleCallaghanAustralia
  3. 3.Institute of Mathematics and Computer SciencesUniversity of São Paulo (USP)São CarlosBrazil
  4. 4.Faculty of Computer ScienceDalhousie UniversityHalifaxCanada
  5. 5.Department of Computer ScienceFederal University of São Carlos (UFSCar)São CarlosBrazil

Personalised recommendations