Abstract
Computing the semantic similarity/relatedness between terms is an important research area for several disciplines, including artificial intelligence, cognitive science, linguistics, psychology, biomedicine and information retrieval. These measures exploit knowledge bases to express the semantics of concepts. Some approaches, such as the information theoretical approaches, rely on knowledge structure, while others, such as the gloss-based approaches, use knowledge content. Firstly, based on structure, we propose a new intrinsic Information Content (IC) computing method which is based on the quantification of the subgraph formed by the ancestors of the target concept. Taxonomic measures including the IC-based ones consume the topological parameters that must be extracted from taxonomies considered as Directed Acyclic Graphs (DAGs). Accordingly, we propose a routine of graph algorithms that are able to provide some basic parameters, such as depth, ancestors, descendents, Lowest Common Subsumer (LCS). The IC-computing method is assessed using several knowledge structures which are: the noun and verb WordNet “is a” taxonomies, Wikipedia Category Graph (WCG), and MeSH taxonomy. We also propose an aggregation schema that exploits the WordNet “is a” taxonomy and WCG in a complementary way through the IC-based measures to improve coverage capacity. Secondly, taking content into consideration, we propose a gloss-based semantic similarity measure that operates based on the noun weighting mechanism using our IC-computing method, as well as on the WordNet, Wiktionary and Wikipedia resources. Further evaluation is performed on various items, including nouns, verbs, multiword expressions and biomedical datasets, using well-recognized benchmarks. The results indicate an improvement in terms of similarity and relatedness assessment accuracy.
Similar content being viewed by others
Notes
card Leaves<n=∣E∣ that represents the cardinality of the set of nodes.
The Amazon Mechanical Turk (MTurk) is an online labor market where workers are paid small amounts of money to complete small tasks. https://www.mturk.com/mturk/welcome.
References
Curran JR (2002) Ensemble Methods for Automatic Thesaurus Extraction, pp 222–229
Atkinson J, Ferreira A, Aravena E (2009) Discovering implicit intention-level knowledge from natural-language texts. Know-Based Syst 22:502–508
Stevenson M, Greenwood MA (2005) A semantic approach to IE pattern induction. In: Proceedings of the 43th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA USA, pp 379–386
Sánchez D, Isern D, Millan M (2011) Content annotation for the semantic web: an automatic web-based approach. Knowl Inf Syst 27:393–418
Hadj Taieb MA, Ben Aouicha M, Bourouis Y (2015) FM3S: Features-Based Measure of Sentences Semantic Similarity. In: Hybrid Artificial Intelligent Systems - 10th International Conference, HAIS 2015, Bilbao, Spain, 22-24 June , 2015, Proceedings, pp 515–529
Gaeta M, Orciuoli F, Ritrovato P (2009) Advanced ontology management system for personalised e-Learning. Know-Based Syst 22:292–301
Sánchez D (2010) A methodology to learn ontological attributes from the Web. Data Knowl Eng 69:573–597
Al-Mubaid H, Nguyen HA (2006) A cluster-based approach for semantic similarity in the biomedical domain, vol 1, pp 2713–7
Budanitsky A, Hirst G (2006) Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Comput Linguist 32:13–47
Hliaoutakis A, Varelas G, Voutsakis E, Petrakis EGM, Milios E (2006) Information Retrieval by Semantic Similarity. Special Issue of Multimedia Semantics, vol 3, p 5573
Nicolas Fiorini JM, Ranwez S, Harispe S, Ranwez V (2015) USI at BioASQ 2015: a Semantic Similarity-Based Approach for Semantic Indexing. In: CLEF 2015 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings (CEUR-WS.org/Vol-1391)
Martinez S, Sánchez D, Valls A, Batet M (2012) Privacy protection of textual attributes through a semantic-based masking method. Inf Fusion 13:304–314
Otegi A, Arregi X, Ansa O, Agirre E (2015) Using knowledge-based relatedness for information retrieval. Knowl Inf Syst 44:689–718
Agirre E, Soroa A (2009) Personalizing PageRank for Word Sense Disambiguation. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Athens, Greece, pp 33–41
Luo Q, Chen E, Xiong H (2011) A semantic term weighting scheme for text categorization. Expert Syst Appl 38:12708– 12716
Batet M (2011) Ontology-based semantic clustering. AI Commun 24:291–292
Tagarelli A (2013) Exploring dictionary-based semantic relatedness in labeled tree data. Inf Sci 220:244–268
Pedersen T, Pakhomov SVS, Patwardhan S, Chute CG (2007) Measures of semantic similarity and relatedness in the biomedical domain. J Biomed Inform 40:288–299
Couto FM, Silva MJ, Coutinho PM (2007) Measuring semantic similarity between Gene Ontology terms. Data Knowl Eng 61:137–152
Pakhomov S, McInnes B, Adam T, Liu Y, Pedersen T, Melton GB (2010) Semantic similarity and relatedness between clinical terms: an experimental study. AMI. AAnnual Symposium proceedings / AMIA Symposium AMIA Symposium 2010:572– 576
Batet M, Sánchez D, Valls A (2011) An ontology-based measure to compute semantic similarity in biomedicine. J Biomed Inform 44:118–125
Ferreira JD, Couto FM (2010) Semantic similarity for automatic classification of chemical compounds. PLoS Comput Biol
Ferreira JD, Couto FM (2011) Generic semantic relatedness measure for biomedical ontologies. ICBO 833
Köhler S, Schulz MH, Krawitz P, Bauer S, Dölken S, Ott CE, Mundlos C, Horn D, Mundlos S, Robinson PN (2009) Clinical diagnostics in human genetics with semantic similarity searches in ontologies. Am J Hum Genet 85:457–464
Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. arXiv:CoRRcmp-lg/9709008
Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the fifteenth international conference on machine learning. Morgan Kaufmann Publishers Inc., San Francisco, pp 296–304
Resnik P (1998) Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J Artif Intell Res 11:95–130
Sánchez D, Batet M, Valls A, Gibert K (2010) Ontology-driven web-based semantic similarity. J Intell Inf Syst 35:383–413
Hadj Taieb MA, Ben Aouicha M, Ben Hamadou A (2014) A new semantic relatedness measurement using WordNet features. Knowl Inf Syst 41:467–497
Seco N, Veale T, Hayes J (2004) An intrinsic information content metric for semantic similarity in wordnet. In: Proceedings of ECAI
Zhou Z, Wang Y, Gu J (2008) A new model of information content for semantic similarity in wordnet. In: Future generation communication and networking symposia, international conference on, vol 3, pp 85–89
Sánchez D, Batet M, Isern D (2011) Ontology-based information content computation. Know-Based Syst 24:297–303
Banerjee S, Pedersen T (2003) Extended gloss overlaps as a measure of semantic relatedness. In: Proceedings of the 18th international joint conference on artificial intelligence. Morgan Kaufmann Publishers Inc., Acapulco, pp 805–810
Lesk M (1986) Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In: Proceedings of the 5th Annual International Conference on Systems Documentation. ACM, Toronto, Ontario, Canada, pp 24–26
Patwardhan S, Pedersen T (2006) Using WordNet-based context vectors to estimate the semantic relatedness of concepts, pp 1–8
Sánchez D, Solé-Ribalta A, Batet M, Serratosa F (2012) Enabling semantic similarity estimation across multiple ontologies: an evaluation in the biomedical domain. J Biomed Inform 45:141–155
Petrakis EGM, Varelas G, Hliaoutakis A, Raftopoulou P (2006) X-similarity: computing semantic similarity between concepts from different ontologies. J Digit Inf Manag (JDIM)
Rodriguez MA, Egenhofer MJ (2003) Determining semantic similarity among entity classes from different ontologies. IEEE Trans Knowl Data Eng 15:442–456
Tversky A (1977) Features of similarity. Psychol Rev 84:327–352
Rada R, Mili H, Bicknell E, Blettner M (1989) Development and application of a metric on semantic nets. IEEE Trans Syst Man Cybern:17–30
Bulskov H, Andreasen T (2002) On Measuring Similarity for Conceptual Querying. In: Procedings of the 5textsuperscriptth international conference on flexible query answering systems. Springer, pp 100–111
Leacock C, Chodorow M (1998) Combining local context and WordNet similarity for word sense identification. In: Fellfaum C (ed). MIT, Press, Cambridge, pp 265–283
Richardson R (1994) Using wordnet as a knowledge base for measuring semantic similarity between words. In: Proceedings AICS conference. Murphy J
Wu Z, Palmer M (1994) Verbs Semantics and Lexical Selection. In: Proceedings of the 32nd annual meeting on association for computational linguistics. Association for computational linguistics, Las Cruces, New Mexico, pp 133–138
Li Y, Bandar ZA, McLean D (2003) An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans on Knowl and Data Eng 15:871–882
Pirró G (2009) A semantic similarity metric combining features and intrinsic information content. Data Knowl Eng 68:1289–1308
Meng L, Gu J (2012) A new model for measuring word sense similarity in wordnet. In: Proceedings of the 4th international conference on advanced communication and networking. SERSC, Jeju, Korea, pp 18–23
Shannon CE (1948) A mathematical theory of communication. Bell System Technical Journal 27
Francis WN (1983) Kucera, H. Lexicon and Grammar, Houghton Mifflin
Sebti A, Barfroush AA (2008) A new word sense similarity measure in wordnet. IMCSIT. IEEE:369–373
Fellbaum C (1998) WordNet: An electronic lexical database (language, speech, and communication), illustrated edition. The MIT Press
Halavais A, Lackaff D (2008) An analysis of topical coverage of wikipedia. J Comput-Mediat Commun 13:429–440
Zesch T, Gurevych I, Mühlhäuser M (2007) Analyzing and accessing wikipedia as a lexical semantic resource. In: Rehm G, Witt A, Lemnitzer L (eds) Data structures for linguistic resources and applications. Gunter Narr, Tübingen , Tuebingen, pp 197–205
Hadj Taieb MA, Ben Aouicha M, Ben Hamadou A (2013) Computing semantic relatedness using Wikipedia features. Knowl-Based Syst 50:260–278
Zesch T, Müller C, Gurevych I (2008) Extracting lexical semantic knowledge from wikipedia and wiktionary. In: Proceedings of the international conference on language resources and evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco
Meng L (2012) Gu, J, A New Model of Information Content Based on Concept’s Topology for Measuring Semantic Similarity in WordNet. International Journal of Grid and Distributed Computing, Zhou, Z
Dijkstra EW (1971) A short introduction to the art of programming
Bellman R (1958) On a routing problem. Q Appl Math 16:87–90
Ford LR (1956) Network Flow Theory
Kahn AB (1962) Topological sorting of large networks. Commun ACM 5:558–562
Tarjan RE (1976) Edge-disjoint spanning trees and depth-first search. Acta Inf 6:171–185
Harel D, Tarjan RE (1984) Fast algorithms for finding nearest common ancestors. SIAM J Comput 13:338–355
Bender MA, Farach-Colton M, Pemmasani G, Skiena S, Sumazin P (2005) Lowest common ancestors in trees and directed acyclic graphs. J Algorithms 57:75–94
Czumaj A, Kowaluk M, Lingas A (2007) Faster algorithms for finding lowest common ancestors in directed acyclic graphs. Theor Comput Sci 380:37–46
Kowaluk M, Lingas A (2007) Unique lowest common ancestors in dags are almost as easy as matrix multiplication. In: Proceedings of the 15textsuperscriptth annual European conference on Algorithms. Springer, Berlin, pp 265–274
Kowaluk M, Lingas A (2005) LCA queries in directed acyclic graphs. In: Proceedings of the 32th international conference on Automata, Languages and Programming. Springer, pp 241–248
Rubenstein H, Goodenough JB (1965) Contextual correlates of synonymy. Commun ACM 8:627–633
Miller GA, Charles WG (1991) Contextual correlates of semantic similarity. Lang Cogn Process 6:1–28
Agirre E, Alfonseca E, Hall K, Kravalova J, Pasca M, Soroa A (2009) A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Boulder, Colorado, pp 19–27
Li P, Wang H, Zhu KQ, Wang Z, Wu X (2013) Computing term similarity by large probabilistic isA knowledge. In: Proceedings of the 22Nd ACM international conference on conference on information & knowledge management. ACM, San Francisco, California, pp 1401–1410
Hill F, Reichart R, Korhonen A (2014) SimLex-999: evaluating semantic models with (Genuine) similarity estimation. arXiv:CoRRabs/1408.3456
Yang D, Powers DMW (2006) Verb Similarity on the Taxonomy of Wordnet. In: The 3rd International WordNet Conference (GWC-06), Jeju Island, Korea
Hliaoutakis A (2005) Semantic similarity measures in the mesh ontology and their application to information retrieval on medline. Technical report, Technical University of Crete (TUC), Deparment of Electronic and Computer Engineering
Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E (2002) Placing search in context: the concept revisited. ACM Trans Inf Syst 20:116–131
Gracia J, Mena E (2008) Web-based measure of semantic relatedness. In: Proceedings of 9th international conference on web information systems engineering (WISE). Springer, Auckland, pp 136–150
Radinsky K, Agichtein E, Gabrilovich E, Markovitch S (2011) A word at a time: computing word relatedness using temporal semantic analysis. In: Proceedings of the 20th international conference on World wide web. ACM, New York, pp 337–346
Bruni E, Tran NK, Baroni M (2014) Multimodal distributional semantics. J Artif Int Res 49:1–47
Luong T, Socher R, Manning C (2013) Better word representations with recursive neural networks for morphology. In: Proceedings of the seventeenth conference on computational natural language learning. Association for computational linguistics, Sofia, Bulgaria, pp 104–113
Spearman C (1987) The proof and measurement of association between two things. By C. Spearman, 1904. Am J Psychol 100:441–471
Zesch T (2010) Study of semantic relatedness of words using collaboratively constructed semantic resources:1–130
Zesch T, Gurevych I (2007) Analysis of the wikipedia category graph for NLP applications. In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT)
Pedersen T (2010) Information content measures of semantic similarity perform better without sense-tagged text. In: Human language technologies: the 2010 annual conference of the north american chapter of the association for computational linguistics. Association for computational linguistics, Stroudsburg, PA, USA, pp 329–332
Acknowledgments
The authors would like to express their gratitude to Mr. Anouar Smaoui from the English Language Unit at the SfaxFaculty of Sciences for his constructive language editing and proofreading services.
Author information
Authors and Affiliations
Corresponding author
Appendix: An example of application of the Algorithm 1
Appendix: An example of application of the Algorithm 1
Rights and permissions
About this article
Cite this article
Aouicha, M.B., Hadj Taieb, M.A. & Hamadou, A.B. Taxonomy-based information content and wordnet-wiktionary-wikipedia glosses for semantic relatedness. Appl Intell 45, 475–511 (2016). https://doi.org/10.1007/s10489-015-0755-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-015-0755-x