Applied Intelligence

, Volume 45, Issue 2, pp 475–511 | Cite as

Taxonomy-based information content and wordnet-wiktionary-wikipedia glosses for semantic relatedness

  • Mohamed Ben Aouicha
  • Mohamed Ali Hadj Taieb
  • Abdelmajid Ben Hamadou
Article

Abstract

Computing the semantic similarity/relatedness between terms is an important research area for several disciplines, including artificial intelligence, cognitive science, linguistics, psychology, biomedicine and information retrieval. These measures exploit knowledge bases to express the semantics of concepts. Some approaches, such as the information theoretical approaches, rely on knowledge structure, while others, such as the gloss-based approaches, use knowledge content. Firstly, based on structure, we propose a new intrinsic Information Content (IC) computing method which is based on the quantification of the subgraph formed by the ancestors of the target concept. Taxonomic measures including the IC-based ones consume the topological parameters that must be extracted from taxonomies considered as Directed Acyclic Graphs (DAGs). Accordingly, we propose a routine of graph algorithms that are able to provide some basic parameters, such as depth, ancestors, descendents, Lowest Common Subsumer (LCS). The IC-computing method is assessed using several knowledge structures which are: the noun and verb WordNet “is a” taxonomies, Wikipedia Category Graph (WCG), and MeSH taxonomy. We also propose an aggregation schema that exploits the WordNet “is a” taxonomy and WCG in a complementary way through the IC-based measures to improve coverage capacity. Secondly, taking content into consideration, we propose a gloss-based semantic similarity measure that operates based on the noun weighting mechanism using our IC-computing method, as well as on the WordNet, Wiktionary and Wikipedia resources. Further evaluation is performed on various items, including nouns, verbs, multiword expressions and biomedical datasets, using well-recognized benchmarks. The results indicate an improvement in terms of similarity and relatedness assessment accuracy.

Keywords

Information content Gloss WordNet Wikipedia Wiktionary MeSH DAG algorithms Semantic similarity Semantic relatedness 

References

  1. 1.
    Curran JR (2002) Ensemble Methods for Automatic Thesaurus Extraction, pp 222–229Google Scholar
  2. 2.
    Atkinson J, Ferreira A, Aravena E (2009) Discovering implicit intention-level knowledge from natural-language texts. Know-Based Syst 22:502–508CrossRefGoogle Scholar
  3. 3.
    Stevenson M, Greenwood MA (2005) A semantic approach to IE pattern induction. In: Proceedings of the 43th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA USA, pp 379–386Google Scholar
  4. 4.
    Sánchez D, Isern D, Millan M (2011) Content annotation for the semantic web: an automatic web-based approach. Knowl Inf Syst 27:393–418CrossRefGoogle Scholar
  5. 5.
    Hadj Taieb MA, Ben Aouicha M, Bourouis Y (2015) FM3S: Features-Based Measure of Sentences Semantic Similarity. In: Hybrid Artificial Intelligent Systems - 10th International Conference, HAIS 2015, Bilbao, Spain, 22-24 June , 2015, Proceedings, pp 515–529Google Scholar
  6. 6.
    Gaeta M, Orciuoli F, Ritrovato P (2009) Advanced ontology management system for personalised e-Learning. Know-Based Syst 22:292–301CrossRefGoogle Scholar
  7. 7.
    Sánchez D (2010) A methodology to learn ontological attributes from the Web. Data Knowl Eng 69:573–597CrossRefGoogle Scholar
  8. 8.
    Al-Mubaid H, Nguyen HA (2006) A cluster-based approach for semantic similarity in the biomedical domain, vol 1, pp 2713–7Google Scholar
  9. 9.
    Budanitsky A, Hirst G (2006) Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Comput Linguist 32:13–47CrossRefMATHGoogle Scholar
  10. 10.
    Hliaoutakis A, Varelas G, Voutsakis E, Petrakis EGM, Milios E (2006) Information Retrieval by Semantic Similarity. Special Issue of Multimedia Semantics, vol 3, p 5573Google Scholar
  11. 11.
    Nicolas Fiorini JM, Ranwez S, Harispe S, Ranwez V (2015) USI at BioASQ 2015: a Semantic Similarity-Based Approach for Semantic Indexing. In: CLEF 2015 Labs and Workshops, Notebook Papers. CEUR Workshop Proceedings (CEUR-WS.org/Vol-1391)Google Scholar
  12. 12.
    Martinez S, Sánchez D, Valls A, Batet M (2012) Privacy protection of textual attributes through a semantic-based masking method. Inf Fusion 13:304–314CrossRefGoogle Scholar
  13. 13.
    Otegi A, Arregi X, Ansa O, Agirre E (2015) Using knowledge-based relatedness for information retrieval. Knowl Inf Syst 44:689–718CrossRefGoogle Scholar
  14. 14.
    Agirre E, Soroa A (2009) Personalizing PageRank for Word Sense Disambiguation. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Athens, Greece, pp 33–41Google Scholar
  15. 15.
    Luo Q, Chen E, Xiong H (2011) A semantic term weighting scheme for text categorization. Expert Syst Appl 38:12708– 12716CrossRefGoogle Scholar
  16. 16.
    Batet M (2011) Ontology-based semantic clustering. AI Commun 24:291–292Google Scholar
  17. 17.
    Tagarelli A (2013) Exploring dictionary-based semantic relatedness in labeled tree data. Inf Sci 220:244–268CrossRefGoogle Scholar
  18. 18.
    Pedersen T, Pakhomov SVS, Patwardhan S, Chute CG (2007) Measures of semantic similarity and relatedness in the biomedical domain. J Biomed Inform 40:288–299CrossRefGoogle Scholar
  19. 19.
    Couto FM, Silva MJ, Coutinho PM (2007) Measuring semantic similarity between Gene Ontology terms. Data Knowl Eng 61:137–152CrossRefGoogle Scholar
  20. 20.
    Pakhomov S, McInnes B, Adam T, Liu Y, Pedersen T, Melton GB (2010) Semantic similarity and relatedness between clinical terms: an experimental study. AMI. AAnnual Symposium proceedings / AMIA Symposium AMIA Symposium 2010:572– 576Google Scholar
  21. 21.
    Batet M, Sánchez D, Valls A (2011) An ontology-based measure to compute semantic similarity in biomedicine. J Biomed Inform 44:118–125CrossRefGoogle Scholar
  22. 22.
    Ferreira JD, Couto FM (2010) Semantic similarity for automatic classification of chemical compounds. PLoS Comput BiolGoogle Scholar
  23. 23.
    Ferreira JD, Couto FM (2011) Generic semantic relatedness measure for biomedical ontologies. ICBO 833Google Scholar
  24. 24.
    Köhler S, Schulz MH, Krawitz P, Bauer S, Dölken S, Ott CE, Mundlos C, Horn D, Mundlos S, Robinson PN (2009) Clinical diagnostics in human genetics with semantic similarity searches in ontologies. Am J Hum Genet 85:457–464CrossRefGoogle Scholar
  25. 25.
    Jiang JJ, Conrath DW (1997) Semantic similarity based on corpus statistics and lexical taxonomy. arXiv:CoRRcmp-lg/9709008
  26. 26.
    Lin D (1998) An information-theoretic definition of similarity. In: Proceedings of the fifteenth international conference on machine learning. Morgan Kaufmann Publishers Inc., San Francisco, pp 296–304Google Scholar
  27. 27.
    Resnik P (1998) Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J Artif Intell Res 11:95–130MATHGoogle Scholar
  28. 28.
    Sánchez D, Batet M, Valls A, Gibert K (2010) Ontology-driven web-based semantic similarity. J Intell Inf Syst 35:383–413CrossRefGoogle Scholar
  29. 29.
    Hadj Taieb MA, Ben Aouicha M, Ben Hamadou A (2014) A new semantic relatedness measurement using WordNet features. Knowl Inf Syst 41:467–497CrossRefGoogle Scholar
  30. 30.
    Seco N, Veale T, Hayes J (2004) An intrinsic information content metric for semantic similarity in wordnet. In: Proceedings of ECAIGoogle Scholar
  31. 31.
    Zhou Z, Wang Y, Gu J (2008) A new model of information content for semantic similarity in wordnet. In: Future generation communication and networking symposia, international conference on, vol 3, pp 85–89Google Scholar
  32. 32.
    Sánchez D, Batet M, Isern D (2011) Ontology-based information content computation. Know-Based Syst 24:297–303CrossRefGoogle Scholar
  33. 33.
    Banerjee S, Pedersen T (2003) Extended gloss overlaps as a measure of semantic relatedness. In: Proceedings of the 18th international joint conference on artificial intelligence. Morgan Kaufmann Publishers Inc., Acapulco, pp 805–810Google Scholar
  34. 34.
    Lesk M (1986) Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In: Proceedings of the 5th Annual International Conference on Systems Documentation. ACM, Toronto, Ontario, Canada, pp 24–26Google Scholar
  35. 35.
    Patwardhan S, Pedersen T (2006) Using WordNet-based context vectors to estimate the semantic relatedness of concepts, pp 1–8Google Scholar
  36. 36.
    Sánchez D, Solé-Ribalta A, Batet M, Serratosa F (2012) Enabling semantic similarity estimation across multiple ontologies: an evaluation in the biomedical domain. J Biomed Inform 45:141–155CrossRefGoogle Scholar
  37. 37.
    Petrakis EGM, Varelas G, Hliaoutakis A, Raftopoulou P (2006) X-similarity: computing semantic similarity between concepts from different ontologies. J Digit Inf Manag (JDIM)Google Scholar
  38. 38.
    Rodriguez MA, Egenhofer MJ (2003) Determining semantic similarity among entity classes from different ontologies. IEEE Trans Knowl Data Eng 15:442–456CrossRefGoogle Scholar
  39. 39.
    Tversky A (1977) Features of similarity. Psychol Rev 84:327–352CrossRefGoogle Scholar
  40. 40.
    Rada R, Mili H, Bicknell E, Blettner M (1989) Development and application of a metric on semantic nets. IEEE Trans Syst Man Cybern:17–30Google Scholar
  41. 41.
    Bulskov H, Andreasen T (2002) On Measuring Similarity for Conceptual Querying. In: Procedings of the 5textsuperscriptth international conference on flexible query answering systems. Springer, pp 100–111Google Scholar
  42. 42.
    Leacock C, Chodorow M (1998) Combining local context and WordNet similarity for word sense identification. In: Fellfaum C (ed). MIT, Press, Cambridge, pp 265–283Google Scholar
  43. 43.
    Richardson R (1994) Using wordnet as a knowledge base for measuring semantic similarity between words. In: Proceedings AICS conference. Murphy JGoogle Scholar
  44. 44.
    Wu Z, Palmer M (1994) Verbs Semantics and Lexical Selection. In: Proceedings of the 32nd annual meeting on association for computational linguistics. Association for computational linguistics, Las Cruces, New Mexico, pp 133–138Google Scholar
  45. 45.
    Li Y, Bandar ZA, McLean D (2003) An approach for measuring semantic similarity between words using multiple information sources. IEEE Trans on Knowl and Data Eng 15:871–882CrossRefGoogle Scholar
  46. 46.
    Pirró G (2009) A semantic similarity metric combining features and intrinsic information content. Data Knowl Eng 68:1289–1308CrossRefGoogle Scholar
  47. 47.
    Meng L, Gu J (2012) A new model for measuring word sense similarity in wordnet. In: Proceedings of the 4th international conference on advanced communication and networking. SERSC, Jeju, Korea, pp 18–23Google Scholar
  48. 48.
    Shannon CE (1948) A mathematical theory of communication. Bell System Technical Journal 27Google Scholar
  49. 49.
    Francis WN (1983) Kucera, H. Lexicon and Grammar, Houghton MifflinGoogle Scholar
  50. 50.
    Sebti A, Barfroush AA (2008) A new word sense similarity measure in wordnet. IMCSIT. IEEE:369–373Google Scholar
  51. 51.
    Fellbaum C (1998) WordNet: An electronic lexical database (language, speech, and communication), illustrated edition. The MIT PressGoogle Scholar
  52. 52.
    Halavais A, Lackaff D (2008) An analysis of topical coverage of wikipedia. J Comput-Mediat Commun 13:429–440CrossRefGoogle Scholar
  53. 53.
    Zesch T, Gurevych I, Mühlhäuser M (2007) Analyzing and accessing wikipedia as a lexical semantic resource. In: Rehm G, Witt A, Lemnitzer L (eds) Data structures for linguistic resources and applications. Gunter Narr, Tübingen , Tuebingen, pp 197–205Google Scholar
  54. 54.
    Hadj Taieb MA, Ben Aouicha M, Ben Hamadou A (2013) Computing semantic relatedness using Wikipedia features. Knowl-Based Syst 50:260–278CrossRefGoogle Scholar
  55. 55.
    Zesch T, Müller C, Gurevych I (2008) Extracting lexical semantic knowledge from wikipedia and wiktionary. In: Proceedings of the international conference on language resources and evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, MoroccoGoogle Scholar
  56. 56.
    Meng L (2012) Gu, J, A New Model of Information Content Based on Concept’s Topology for Measuring Semantic Similarity in WordNet. International Journal of Grid and Distributed Computing, Zhou, ZGoogle Scholar
  57. 57.
    Dijkstra EW (1971) A short introduction to the art of programmingGoogle Scholar
  58. 58.
    Bellman R (1958) On a routing problem. Q Appl Math 16:87–90MATHGoogle Scholar
  59. 59.
    Ford LR (1956) Network Flow TheoryGoogle Scholar
  60. 60.
    Kahn AB (1962) Topological sorting of large networks. Commun ACM 5:558–562CrossRefMATHGoogle Scholar
  61. 61.
    Tarjan RE (1976) Edge-disjoint spanning trees and depth-first search. Acta Inf 6:171–185MathSciNetCrossRefMATHGoogle Scholar
  62. 62.
    Harel D, Tarjan RE (1984) Fast algorithms for finding nearest common ancestors. SIAM J Comput 13:338–355MathSciNetCrossRefMATHGoogle Scholar
  63. 63.
    Bender MA, Farach-Colton M, Pemmasani G, Skiena S, Sumazin P (2005) Lowest common ancestors in trees and directed acyclic graphs. J Algorithms 57:75–94MathSciNetCrossRefMATHGoogle Scholar
  64. 64.
    Czumaj A, Kowaluk M, Lingas A (2007) Faster algorithms for finding lowest common ancestors in directed acyclic graphs. Theor Comput Sci 380:37–46MathSciNetCrossRefMATHGoogle Scholar
  65. 65.
    Kowaluk M, Lingas A (2007) Unique lowest common ancestors in dags are almost as easy as matrix multiplication. In: Proceedings of the 15textsuperscriptth annual European conference on Algorithms. Springer, Berlin, pp 265–274Google Scholar
  66. 66.
    Kowaluk M, Lingas A (2005) LCA queries in directed acyclic graphs. In: Proceedings of the 32th international conference on Automata, Languages and Programming. Springer, pp 241–248Google Scholar
  67. 67.
    Rubenstein H, Goodenough JB (1965) Contextual correlates of synonymy. Commun ACM 8:627–633CrossRefGoogle Scholar
  68. 68.
    Miller GA, Charles WG (1991) Contextual correlates of semantic similarity. Lang Cogn Process 6:1–28CrossRefGoogle Scholar
  69. 69.
    Agirre E, Alfonseca E, Hall K, Kravalova J, Pasca M, Soroa A (2009) A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. In: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Boulder, Colorado, pp 19–27Google Scholar
  70. 70.
    Li P, Wang H, Zhu KQ, Wang Z, Wu X (2013) Computing term similarity by large probabilistic isA knowledge. In: Proceedings of the 22Nd ACM international conference on conference on information & knowledge management. ACM, San Francisco, California, pp 1401–1410Google Scholar
  71. 71.
    Hill F, Reichart R, Korhonen A (2014) SimLex-999: evaluating semantic models with (Genuine) similarity estimation. arXiv:CoRRabs/1408.3456
  72. 72.
    Yang D, Powers DMW (2006) Verb Similarity on the Taxonomy of Wordnet. In: The 3rd International WordNet Conference (GWC-06), Jeju Island, KoreaGoogle Scholar
  73. 73.
    Hliaoutakis A (2005) Semantic similarity measures in the mesh ontology and their application to information retrieval on medline. Technical report, Technical University of Crete (TUC), Deparment of Electronic and Computer EngineeringGoogle Scholar
  74. 74.
    Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, Ruppin E (2002) Placing search in context: the concept revisited. ACM Trans Inf Syst 20:116–131CrossRefGoogle Scholar
  75. 75.
    Gracia J, Mena E (2008) Web-based measure of semantic relatedness. In: Proceedings of 9th international conference on web information systems engineering (WISE). Springer, Auckland, pp 136–150Google Scholar
  76. 76.
    Radinsky K, Agichtein E, Gabrilovich E, Markovitch S (2011) A word at a time: computing word relatedness using temporal semantic analysis. In: Proceedings of the 20th international conference on World wide web. ACM, New York, pp 337–346Google Scholar
  77. 77.
    Bruni E, Tran NK, Baroni M (2014) Multimodal distributional semantics. J Artif Int Res 49:1–47MathSciNetMATHGoogle Scholar
  78. 78.
    Luong T, Socher R, Manning C (2013) Better word representations with recursive neural networks for morphology. In: Proceedings of the seventeenth conference on computational natural language learning. Association for computational linguistics, Sofia, Bulgaria, pp 104–113Google Scholar
  79. 79.
    Spearman C (1987) The proof and measurement of association between two things. By C. Spearman, 1904. Am J Psychol 100:441–471CrossRefGoogle Scholar
  80. 80.
    Zesch T (2010) Study of semantic relatedness of words using collaboratively constructed semantic resources:1–130Google Scholar
  81. 81.
    Zesch T, Gurevych I (2007) Analysis of the wikipedia category graph for NLP applications. In: Proceedings of the TextGraphs-2 Workshop (NAACL-HLT)Google Scholar
  82. 82.
    Pedersen T (2010) Information content measures of semantic similarity perform better without sense-tagged text. In: Human language technologies: the 2010 annual conference of the north american chapter of the association for computational linguistics. Association for computational linguistics, Stroudsburg, PA, USA, pp 329–332Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Mohamed Ben Aouicha
    • 1
  • Mohamed Ali Hadj Taieb
    • 1
  • Abdelmajid Ben Hamadou
    • 1
  1. 1.Multimedia Information System and Advanced Computing LaboratorySfax UniversitySfaxTunisia

Personalised recommendations