Skip to main content
Log in

Taxonomy-based information content and wordnet-wiktionary-wikipedia glosses for semantic relatedness

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Computing the semantic similarity/relatedness between terms is an important research area for several disciplines, including artificial intelligence, cognitive science, linguistics, psychology, biomedicine and information retrieval. These measures exploit knowledge bases to express the semantics of concepts. Some approaches, such as the information theoretical approaches, rely on knowledge structure, while others, such as the gloss-based approaches, use knowledge content. Firstly, based on structure, we propose a new intrinsic Information Content (IC) computing method which is based on the quantification of the subgraph formed by the ancestors of the target concept. Taxonomic measures including the IC-based ones consume the topological parameters that must be extracted from taxonomies considered as Directed Acyclic Graphs (DAGs). Accordingly, we propose a routine of graph algorithms that are able to provide some basic parameters, such as depth, ancestors, descendents, Lowest Common Subsumer (LCS). The IC-computing method is assessed using several knowledge structures which are: the noun and verb WordNet “is a” taxonomies, Wikipedia Category Graph (WCG), and MeSH taxonomy. We also propose an aggregation schema that exploits the WordNet “is a” taxonomy and WCG in a complementary way through the IC-based measures to improve coverage capacity. Secondly, taking content into consideration, we propose a gloss-based semantic similarity measure that operates based on the noun weighting mechanism using our IC-computing method, as well as on the WordNet, Wiktionary and Wikipedia resources. Further evaluation is performed on various items, including nouns, verbs, multiword expressions and biomedical datasets, using well-recognized benchmarks. The results indicate an improvement in terms of similarity and relatedness assessment accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. http://www.nlm.nih.gov/mesh.

  2. https://en.wiktionary.org/wiki/Wiktionary:Main_Page.

  3. http://nlp.stanford.edu/software/.

  4. card Leaves<n=∣E∣ that represents the cardinality of the set of nodes.

  5. http://adapt.seiee.sjtu.edu.cn/similarity/.

  6. http://www.cl.cam.ac.uk/~fh295/simlex.html..

  7. http://www.technion.ac.il/~kirar/Datasets.html.

  8. The Amazon Mechanical Turk (MTurk) is an online labor market where workers are paid small amounts of money to complete small tasks. https://www.mturk.com/mturk/welcome.

  9. http://clic.cimec.unitn.it/elia.bruni/MEN.htm.

  10. http://wacky.sslmit.unibo.it/doku.php?id=corpora.

  11. http://www-nlp.stanford.edu/~lmthang/morphoNLM/.

References

  1. Curran JR (2002) Ensemble Methods for Automatic Thesaurus Extraction, pp 222–229

  2. Atkinson J, Ferreira A, Aravena E (2009) Discovering implicit intention-level knowledge from natural-language texts. Know-Based Syst 22:502–508

    Article  Google Scholar 

  3. Stevenson M, Greenwood MA (2005) A semantic approach to IE pattern induction. In: Proceedings of the 43th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA USA, pp 379–386

  4. Sánchez D, Isern D, Millan M (2011) Content annotation for the semantic web: an automatic web-based approach. Knowl Inf Syst 27:393–418

    Article  Google Scholar 

  5. Hadj Taieb MA, Ben Aouicha M, Bourouis Y (2015) FM3S: Features-Based Measure of Sentences Semantic Similarity. In: Hybrid Artificial Intelligent Systems - 10th International Conference, HAIS 2015, Bilbao, Spain, 22-24 June , 2015, Proceedings, pp 515–529

  6. Gaeta M, Orciuoli F, Ritrovato P (2009) Advanced ontology management system for personalised e-Learning. Know-Based Syst 22:292–301

    Article  Google Scholar 

  7. Sánchez D (2010) A methodology to learn ontological attributes from the Web. Data Knowl Eng 69:573–597

    Article  Google Scholar 

  8. Al-Mubaid H, Nguyen HA (2006) A cluster-based approach for semantic similarity in the biomedical domain, vol 1, pp 2713–7

  9. Budanitsky A, Hirst G (2006) Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Comput Linguist 32:13–47

    Article  MATH  Google Scholar 

  10. Hliaoutakis A, Varelas G, Voutsakis E, Petrakis EGM, Milios E (2006) Information Retrieval by Semantic Similarity. Special Issue of Multimedia Semantics, vol 3, p 5573

  11. Nicolas Fiorini JM, Ranwez S, Harispe S, Ranwez V (2015) USI at BioASQ 2015: a Semantic Similarity-Based Approach for Semantic Indexing. In: CLEF 2015 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings (CEUR-WS.org/Vol-1391)

  12. Martinez S, Sánchez D, Valls A, Batet M (2012) Privacy protection of textual attributes through a semantic-based masking method. Inf Fusion 13:304–314

    Article  Google Scholar 

  13. Otegi A, Arregi X, Ansa O, Agirre E (2015) Using knowledge-based relatedness for information retrieval. Knowl Inf Syst 44:689–718

    Article  Google Scholar 

  14. Agirre E, Soroa A (2009) Personalizing PageRank for Word Sense Disambiguation. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Athens, Greece, pp 33–41

  15. Luo Q, Chen E, Xiong H (2011) A semantic term weighting scheme for text categorization. Expert Syst Appl 38:12708– 12716

    Article  Google Scholar 

  16. Batet M (2011) Ontology-based semantic clustering. AI Commun 24:291–292

    Google Scholar 

  17. Tagarelli A (2013) Exploring dictionary-based semantic relatedness in labeled tree data. Inf Sci 220:244–268

    Article  Google Scholar 

  18. Pedersen T, Pakhomov SVS, Patwardhan S, Chute CG (2007) Measures of semantic similarity and relatedness in the biomedical domain. J Biomed Inform 40:288–299

    Article  Google Scholar 

  19. Couto FM, Silva MJ, Coutinho PM (2007) Measuring semantic similarity between Gene Ontology terms. Data Knowl Eng 61:137–152

    Article  Google Scholar 

  20. Pakhomov S, McInnes B, Adam T, Liu Y, Pedersen T, Melton GB (2010) Semantic similarity and relatedness between clinical terms: an experimental study. AMI. AAnnual Symposium proceedings / AMIA Symposium AMIA Symposium 2010:572– 576

    Google Scholar 

  21. Batet M, Sánchez D, Valls A (2011) An ontology-based measure to compute semantic similarity in biomedicine. J Biomed Inform 44:118–125

    Article  Google Scholar 

  22. Ferreira JD, Couto FM (2010) Semantic similarity for automatic classification of chemical compounds. PLoS Comput Biol

  23. Ferreira JD, Couto FM (2011) Generic semantic relatedness measure for biomedical ontologies. ICBO 833

  24. Köhler S, Schulz MH, Krawitz P, Bauer S, Dölken S, Ott CE, Mundlos C, Horn D, Mundlos S, Robinson PN (2009) Clinical diagnostics in human genetics with semantic similarity searches in ontologies. Am J Hum Genet 85:457–464

    Article  Google Scholar 

  25. Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. arXiv:CoRRcmp-lg/9709008

  26. Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the fifteenth international conference on machine learning. Morgan Kaufmann Publishers Inc., San Francisco, pp 296–304

  27. Resnik P (1998) Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J Artif Intell Res 11:95–130

    MATH  Google Scholar 

  28. Sánchez D, Batet M, Valls A, Gibert K (2010) Ontology-driven web-based semantic similarity. J Intell Inf Syst 35:383–413

    Article  Google Scholar 

  29. Hadj Taieb MA, Ben Aouicha M, Ben Hamadou A (2014) A new semantic relatedness measurement using WordNet features. Knowl Inf Syst 41:467–497

    Article  Google Scholar 

  30. Seco N, Veale T, Hayes J (2004) An intrinsic information content metric for semantic similarity in wordnet. In: Proceedings of ECAI

  31. Zhou Z, Wang Y, Gu J (2008) A new model of information content for semantic similarity in wordnet. In: Future generation communication and networking symposia, international conference on, vol 3, pp 85–89

  32. Sánchez D, Batet M, Isern D (2011) Ontology-based information content computation. Know-Based Syst 24:297–303

    Article  Google Scholar 

  33. Banerjee S, Pedersen T (2003) Extended gloss overlaps as a measure of semantic relatedness. In: Proceedings of the 18th international joint conference on artificial intelligence. Morgan Kaufmann Publishers Inc., Acapulco, pp 805–810

  34. Lesk M (1986) Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In: Proceedings of the 5th Annual International Conference on Systems Documentation. ACM, Toronto, Ontario, Canada, pp 24–26

  35. Patwardhan S, Pedersen T (2006) Using WordNet-based context vectors to estimate the semantic relatedness of concepts, pp 1–8

  36. Sánchez D, Solé-Ribalta A, Batet M, Serratosa F (2012) Enabling semantic similarity estimation across multiple ontologies: an evaluation in the biomedical domain. J Biomed Inform 45:141–155

    Article  Google Scholar 

  37. Petrakis EGM, Varelas G, Hliaoutakis A, Raftopoulou P (2006) X-similarity: computing semantic similarity between concepts from different ontologies. J Digit Inf Manag (JDIM)

  38. Rodriguez MA, Egenhofer MJ (2003) Determining semantic similarity among entity classes from different ontologies. IEEE Trans Knowl Data Eng 15:442–456

    Article  Google Scholar 

  39. Tversky A (1977) Features of similarity. Psychol Rev 84:327–352

    Article  Google Scholar 

  40. Rada R, Mili H, Bicknell E, Blettner M (1989) Development and application of a metric on semantic nets. IEEE Trans Syst Man Cybern:17–30

  41. Bulskov H, Andreasen T (2002) On Measuring Similarity for Conceptual Querying. In: Procedings of the 5textsuperscriptth international conference on flexible query answering systems. Springer, pp 100–111

  42. Leacock C, Chodorow M (1998) Combining local context and WordNet similarity for word sense identification. In: Fellfaum C (ed). MIT, Press, Cambridge, pp 265–283

  43. Richardson R (1994) Using wordnet as a knowledge base for measuring semantic similarity between words. In: Proceedings AICS conference. Murphy J

  44. Wu Z, Palmer M (1994) Verbs Semantics and Lexical Selection. In: Proceedings of the 32nd annual meeting on association for computational linguistics. Association for computational linguistics, Las Cruces, New Mexico, pp 133–138

  45. Li Y, Bandar ZA, McLean D (2003) An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans on Knowl and Data Eng 15:871–882

    Article  Google Scholar 

  46. Pirró G (2009) A semantic similarity metric combining features and intrinsic information content. Data Knowl Eng 68:1289–1308

    Article  Google Scholar 

  47. Meng L, Gu J (2012) A new model for measuring word sense similarity in wordnet. In: Proceedings of the 4th international conference on advanced communication and networking. SERSC, Jeju, Korea, pp 18–23

  48. Shannon CE (1948) A mathematical theory of communication. Bell System Technical Journal 27

  49. Francis WN (1983) Kucera, H. Lexicon and Grammar, Houghton Mifflin

    Google Scholar 

  50. Sebti A, Barfroush AA (2008) A new word sense similarity measure in wordnet. IMCSIT. IEEE:369–373

  51. Fellbaum C (1998) WordNet: An electronic lexical database (language, speech, and communication), illustrated edition. The MIT Press

  52. Halavais A, Lackaff D (2008) An analysis of topical coverage of wikipedia. J Comput-Mediat Commun 13:429–440

    Article  Google Scholar 

  53. Zesch T, Gurevych I, Mühlhäuser M (2007) Analyzing and accessing wikipedia as a lexical semantic resource. In: Rehm G, Witt A, Lemnitzer L (eds) Data structures for linguistic resources and applications. Gunter Narr, Tübingen , Tuebingen, pp 197–205

  54. Hadj Taieb MA, Ben Aouicha M, Ben Hamadou A (2013) Computing semantic relatedness using Wikipedia features. Knowl-Based Syst 50:260–278

    Article  Google Scholar 

  55. Zesch T, Müller C, Gurevych I (2008) Extracting lexical semantic knowledge from wikipedia and wiktionary. In: Proceedings of the international conference on language resources and evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco

  56. Meng L (2012) Gu, J, A New Model of Information Content Based on Concept’s Topology for Measuring Semantic Similarity in WordNet. International Journal of Grid and Distributed Computing, Zhou, Z

  57. Dijkstra EW (1971) A short introduction to the art of programming

  58. Bellman R (1958) On a routing problem. Q Appl Math 16:87–90

    MATH  Google Scholar 

  59. Ford LR (1956) Network Flow Theory

  60. Kahn AB (1962) Topological sorting of large networks. Commun ACM 5:558–562

    Article  MATH  Google Scholar 

  61. Tarjan RE (1976) Edge-disjoint spanning trees and depth-first search. Acta Inf 6:171–185

    Article  MathSciNet  MATH  Google Scholar 

  62. Harel D, Tarjan RE (1984) Fast algorithms for finding nearest common ancestors. SIAM J Comput 13:338–355

    Article  MathSciNet  MATH  Google Scholar 

  63. Bender MA, Farach-Colton M, Pemmasani G, Skiena S, Sumazin P (2005) Lowest common ancestors in trees and directed acyclic graphs. J Algorithms 57:75–94

    Article  MathSciNet  MATH  Google Scholar 

  64. Czumaj A, Kowaluk M, Lingas A (2007) Faster algorithms for finding lowest common ancestors in directed acyclic graphs. Theor Comput Sci 380:37–46

    Article  MathSciNet  MATH  Google Scholar 

  65. Kowaluk M, Lingas A (2007) Unique lowest common ancestors in dags are almost as easy as matrix multiplication. In: Proceedings of the 15textsuperscriptth annual European conference on Algorithms. Springer, Berlin, pp 265–274

  66. Kowaluk M, Lingas A (2005) LCA queries in directed acyclic graphs. In: Proceedings of the 32th international conference on Automata, Languages and Programming. Springer, pp 241–248

  67. Rubenstein H, Goodenough JB (1965) Contextual correlates of synonymy. Commun ACM 8:627–633

    Article  Google Scholar 

  68. Miller GA, Charles WG (1991) Contextual correlates of semantic similarity. Lang Cogn Process 6:1–28

    Article  Google Scholar 

  69. Agirre E, Alfonseca E, Hall K, Kravalova J, Pasca M, Soroa A (2009) A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Boulder, Colorado, pp 19–27

  70. Li P, Wang H, Zhu KQ, Wang Z, Wu X (2013) Computing term similarity by large probabilistic isA knowledge. In: Proceedings of the 22Nd ACM international conference on conference on information & knowledge management. ACM, San Francisco, California, pp 1401–1410

  71. Hill F, Reichart R, Korhonen A (2014) SimLex-999: evaluating semantic models with (Genuine) similarity estimation. arXiv:CoRRabs/1408.3456

  72. Yang D, Powers DMW (2006) Verb Similarity on the Taxonomy of Wordnet. In: The 3rd International WordNet Conference (GWC-06), Jeju Island, Korea

  73. Hliaoutakis A (2005) Semantic similarity measures in the mesh ontology and their application to information retrieval on medline. Technical report, Technical University of Crete (TUC), Deparment of Electronic and Computer Engineering

  74. Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E (2002) Placing search in context: the concept revisited. ACM Trans Inf Syst 20:116–131

    Article  Google Scholar 

  75. Gracia J, Mena E (2008) Web-based measure of semantic relatedness. In: Proceedings of 9th international conference on web information systems engineering (WISE). Springer, Auckland, pp 136–150

  76. Radinsky K, Agichtein E, Gabrilovich E, Markovitch S (2011) A word at a time: computing word relatedness using temporal semantic analysis. In: Proceedings of the 20th international conference on World wide web. ACM, New York, pp 337–346

  77. Bruni E, Tran NK, Baroni M (2014) Multimodal distributional semantics. J Artif Int Res 49:1–47

    MathSciNet  MATH  Google Scholar 

  78. Luong T, Socher R, Manning C (2013) Better word representations with recursive neural networks for morphology. In: Proceedings of the seventeenth conference on computational natural language learning. Association for computational linguistics, Sofia, Bulgaria, pp 104–113

  79. Spearman C (1987) The proof and measurement of association between two things. By C. Spearman, 1904. Am J Psychol 100:441–471

    Article  Google Scholar 

  80. Zesch T (2010) Study of semantic relatedness of words using collaboratively constructed semantic resources:1–130

  81. Zesch T, Gurevych I (2007) Analysis of the wikipedia category graph for NLP applications. In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT)

  82. Pedersen T (2010) Information content measures of semantic similarity perform better without sense-tagged text. In: Human language technologies: the 2010 annual conference of the north american chapter of the association for computational linguistics. Association for computational linguistics, Stroudsburg, PA, USA, pp 329–332

Download references

Acknowledgments

The authors would like to express their gratitude to Mr. Anouar Smaoui from the English Language Unit at the SfaxFaculty of Sciences for his constructive language editing and proofreading services.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamed Ali Hadj Taieb.

Appendix: An example of application of the Algorithm 1

Appendix: An example of application of the Algorithm 1

figure d

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aouicha, M.B., Hadj Taieb, M.A. & Hamadou, A.B. Taxonomy-based information content and wordnet-wiktionary-wikipedia glosses for semantic relatedness. Appl Intell 45, 475–511 (2016). https://doi.org/10.1007/s10489-015-0755-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-015-0755-x

Keywords

Navigation