Fast, Accurate, Multilingual Semantic Relatedness Measurement Using Wikipedia Links

  • Dante Degl’InnocentiEmail author
  • Dario De Nart
  • M. Helmy
  • C. Tasso
Part of the Studies in Computational Intelligence book series (SCI, volume 740)


In this chapter we present a fast, accurate, and elegant metric to assess semantic relatedness among entities included in an hypertextual corpus building an novel language independent Vector Space Model. Such a technique is based upon the Jaccard similarity coefficient, approximated with the MinHash technique to generate a constant-size vector fingerprint for each entity in the considered corpus. This strategy allows evaluation of pairwise semantic relatedness in constant time, no matter how many entities are included in the data and how dense the internal link structure is. Being semantic relatedness a subtle and somewhat subjective matter, we evaluated our approach by running user tests on a crowdsourcing platform. To achieve a better evaluation we considered two collaboratively built corpora: the English Wikipedia and the Italian Wikipedia, which differ significantly in size, topology, and user base. The evaluation suggests that the proposed technique is able to generate satisfactory results, outperforming commercial baseline systems regardless of the employed data and the cultural differences of the considered test users.


Semantic networks Vector space Text processing theory Multilinguality 

Mathematics Subject Classification (2010)

Primary 68T30 Secondary 68T50 


  1. 1.
    Benjamini, Y., Hochberg, Y.: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Statist. Soc. Ser. B Methodol. 57(1), 289–300 (1995)Google Scholar
  2. 2.
    Broder, A.Z.: On the resemblance and containment of documents. In: Proceedings of Compression and Complexity of Sequences (SEQUENCES’97), pp. 21–29. IEEE, June 1997Google Scholar
  3. 3.
    Alexander, B., Graeme, H.: Semantic distance in wordnet: an experimental, application-oriented evaluation of five measures. In: Workshop on WordNet and Other Lexical Resources, Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 29–34 (2001)Google Scholar
  4. 4.
    Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of lexical semantic relatedness. Comput. Linguist. 32(1), 13–47 (2006)Google Scholar
  5. 5.
    Rudi, L.C., Paul, M.B.V.: The google similarity distance. IEEE Trans. Knowled. Data Eng. 19(3), 370–383 (2007)Google Scholar
  6. 6.
    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJcAI 7, 1606–1611 (2007)Google Scholar
  7. 7.
    Risto, G., Warner ten K., Zharko, A., Frank Van H.: Using google distance to weight approximate ontology matches. In: The 16th International Conference on World Wide Web, pp. 767–776. ACM, (2007)Google Scholar
  8. 8.
    Sebastien, H., Sylvie, R., Stefan, J., Jacky, M.: Semantic similarity from natural language and ontology analysis. Synth. Lect. Human Lang. Technol. 8(1), 1–254 (2015)Google Scholar
  9. 9.
    Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)Google Scholar
  10. 10.
    Tin, H., Kiem, H., Loc, Do, Huong, T., Hiep, L., Susan, G.: Scientific publication recommendations based on collaborative citation networks. In: Collaboration Technologies and Systems (CTS), 2012 International Conference on, pp. 316–321. IEEE, (2012)Google Scholar
  11. 11.
    Jaccard, P.: Lois de distribution florale. Bulletin de la Socíeté Vaudoise des Sciences Naturelles 38, 67–130 (1902)Google Scholar
  12. 12.
    Lillian, L.: Measures of distributional similarity. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (ACL), (199)Google Scholar
  13. 13.
    Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, (2014)Google Scholar
  14. 14.
    Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015)Google Scholar
  15. 15.
    Christopher, D.M., Prabhakar, R., Hinrich, S.: Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA (2008)Google Scholar
  16. 16.
    Cataldo, M., Pasquale, L., Pierpaolo, B., Marco de G., Giovanni, S.: Semantics-aware graph-based recommender systems exploiting linked open data. In: Proceedings of the 2016 Conference on User Modeling Adaptation and Personalization, pp. 229–237. ACM, (2016)Google Scholar
  17. 17.
    Novak, J.D.: Learning, Creating, and Using Knowledge: Concept Maps as Facilitative Tools in Schools and Corporations. Taylor & Francis, London, United Kingdom (2010)Google Scholar
  18. 18.
    Mohammad, T.P. Roberto, N.: From senses to texts: an all-in-one graph-based approach for measuring semantic similarity. Artific. Intell. 228, 95–128 (2015)Google Scholar
  19. 19.
    Rodríguez, M.A., Egenhofer, M.J.: Determining semantic similarity among entity classes from different ontologies. IEEE Trans. Knowled. Data Eng. 15(2), 442–456 (2003)Google Scholar
  20. 20.
    Turney, Peter D.: Pantel, Patrick: from frequency to meaning: vector space models of semantics. J. Artif. Int. Res. 37(1), 141–188 (2010)MathSciNetzbMATHGoogle Scholar
  21. 21.
    Jingdong, W., Heng, T.S., Jingkuan, S., Jianqiu, J.: Hashing for similarity search: a survey. arXiv:1408.2927, (2014)
  22. 22.
    Weeds, Julie: Weir, D.: Co-occurrence retrieval: a flexible framework for lexical distributional similarity. Comput. Linguist. 31(4), 439–475 (2005)MathSciNetCrossRefzbMATHGoogle Scholar
  23. 23.
    Ian, W., David, M.: An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In: Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy, AAAI Press, Chicago, USA, pp. 25–30(2008)Google Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Dante Degl’Innocenti
    • 1
    Email author
  • Dario De Nart
    • 1
  • M. Helmy
    • 1
  • C. Tasso
    • 1
  1. 1.Department of Mathematics and Computer ScienceUniversity of UdineUdineItaly

Personalised recommendations