Advertisement

Automatic Identification and Disambiguation of Concepts and Named Entities in the Multilingual Wikipedia

  • Federico ScozzafavaEmail author
  • Alessandro Raganato
  • Andrea Moro
  • Roberto Navigli
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9336)

Abstract

In this paper we present an automatic multilingual annotation of the Wikipedia dumps in two languages, with both word senses (i.e. concepts) and named entities. We use Babelfy 1.0, a state-of-the-art multilingual Word Sense Disambiguation and Entity Linking system. As its reference inventory, Babelfy draws upon BabelNet 3.0, a very large multilingual encyclopedic dictionary and semantic network which connects concepts and named entities in 271 languages from different inventories, such as WordNet, Open Multilingual WordNet, Wikipedia, OmegaWiki, Wiktionary and Wikidata. In addition, we perform both an automatic evaluation of the dataset and a language-specific statistical analysis. In detail, we investigate the word sense distributions by part-of-speech and language, together with the similarity of the annotated entities and concepts for a random sample of interlinked Wikipedia pages in different languages. The annotated corpora are available at http://lcl.uniroma1.it/babelfied-wikipedia/.

Keywords

Semantic annotation Named entities Word senses Disambiguation Multilinguality Corpus annotation Sense annotation Word sense disambiguation Entity linking 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Basave, A.E.C., Rizzo, G., Varga, A., Rowe, M., Stankovic, M., Dadzie, A.S.: Making sense of microposts (#Microposts2014) named entity extraction & linking challenge. In: 4th Workshop on Making Sense of Microposts (#Microposts2014) (2014)Google Scholar
  2. 2.
    Bond, F., Foster, R.: Linking and extending an open multilingual wordnet. In: ACL (1), pp. 1352–1362 (2013)Google Scholar
  3. 3.
    Carmel, D., Chang, M.W., Gabrilovich, E., Hsu, B.J.P., Wang, K.: ERD’14: entity recognition and disambiguation challenge. In: ACM SIGIR Forum, vol. 48, pp. 63–77. ACM (2014)Google Scholar
  4. 4.
    Cornolti, M., Ferragina, P., Ciaramita, M.: A framework for benchmarking entity-annotation systems. In: Proc. of WWW, pp. 249–260 (2013)Google Scholar
  5. 5.
    Dolan, S.: Six Degrees of Wikipedia (2008). http://mu.netsoc.ie/wiki/
  6. 6.
    Flati, T., Vannella, D., Pasini, T., Navigli, R.: Two is bigger (and better) than one: the wikipedia bitaxonomy project. In: Proc. of ACL, pp. 945–955. Association for Computational Linguistics, Baltimore (2014)Google Scholar
  7. 7.
    Gabrilovich, E., Ringgaard, M., Subramanya, A.: FACC1: Freebase annotation of ClueWeb corpora, Version 1. Release date, pp. 06–26 (2013)Google Scholar
  8. 8.
    Giles, J.: Internet encyclopaedias go head to head. Nature 438(7070), 900–901 (2005)CrossRefGoogle Scholar
  9. 9.
    Gracia, J., Montiel-Ponsoda, E., Cimiano, P., Gómez-Pérez, A., Buitelaar, P., McCrae, J.: Challenges for the multilingual web of data. Web Semantics: Science, Services and Agents on the World Wide Web 11, 63–71 (2012)CrossRefGoogle Scholar
  10. 10.
    Ide, N., Baker, C., Fellbaum, C., Fillmore, C.: MASC: the manually annotated sub-corpus of American English. In: Proc. of LREC (2008)Google Scholar
  11. 11.
    Ji, H., Dang, H., Nothman, J., Hachey, B.: Overview of tac-kbp2014 entity discovery and linking tasks. In: Proc. of TAC (2014)Google Scholar
  12. 12.
    Lefever, E., Hoste, V.: Semeval-2010 task 3: cross-lingual word sense disambiguation. In: Proc. of SemEval, pp. 15–20 (2010)Google Scholar
  13. 13.
    Lefever, E., Hoste, V.: Semeval-2013 task 10: cross-lingual word sense disambiguation. In: Proc. of SemEval, pp. 158–166 (2013)Google Scholar
  14. 14.
    Manandhar, S., Klapaftis, I.P., Dligach, D., Pradhan, S.S.: SemEval-2010 task 14: word sense induction & disambiguation. In: Proc. of SemEval, pp. 63–68 (2010)Google Scholar
  15. 15.
    McDonald, R.T., Nivre, J., Quirmbach-Brundage, Y., Goldberg, Y., Das, D., Ganchev, K., Hall, K.B., Petrov, S., Zhang, H., Täckström, O., et al.: Universal dependency annotation for multilingual parsing. In: ACL (2), pp. 92–97 (2013)Google Scholar
  16. 16.
    Mihalcea, R.: Using wikipedia for automatic word sense disambiguation. In: HLT-NAACL, pp. 196–203 (2007)Google Scholar
  17. 17.
    Miller, G.A.: WordNet: a lexical database for English. Communications of the ACM 38(11), 39–41 (1995)CrossRefGoogle Scholar
  18. 18.
    Miller, G.A., Leacock, C., Tengi, R., Bunker, R.T.: A semantic concordance. In: Proc. of the workshop on Human Language Technology, pp. 303–308 (1993)Google Scholar
  19. 19.
    Moro, A., Navigli, R.: SemEval-2015 task 13: multilingual all-words sense disambiguation and entity linking. In: Proc. of SemEval, pp. 288–297 (2015)Google Scholar
  20. 20.
    Moro, A., Navigli, R., Tucci, F.M., Passonneau, R.J.: Annotating the MASC corpus with BabelNet. In: Proc. of LREC, pp. 4214–4219 (2014)Google Scholar
  21. 21.
    Moro, A., Raganato, A., Navigli, R.: Entity Linking meets Word Sense Disambiguation: A Unified Approach. TACL 2, 231–244 (2014)Google Scholar
  22. 22.
    Navigli, R.: Word sense disambiguation: A survey. ACM Computing Surveys (CSUR) 41(2), 10 (2009)CrossRefGoogle Scholar
  23. 23.
    Navigli, R., Jurgens, D., Vannella, D.: Semeval-2013 task 12: multilingual word sense disambiguation. In: Proc. of SemEval, vol. 2, pp. 222–231 (2013)Google Scholar
  24. 24.
    Navigli, R., Litkowski, K.C., Hargraves, O.: Semeval-2007 task 07: coarse-grained english all-words task. In: Proc. of SemEval, pp. 30–35 (2007)Google Scholar
  25. 25.
    Navigli, R., Ponzetto, S.P.: BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence 193, 217–250 (2012)MathSciNetCrossRefzbMATHGoogle Scholar
  26. 26.
    Navigli, R., Ponzetto, S.P.: Joining forces pays off: multilingual joint word sense disambiguation. In: Proc. of EMNLP, pp. 1399–1410 (2012)Google Scholar
  27. 27.
    Palmer, M., Fellbaum, C., Cotton, S., Delfs, L., Dang, H.T.: English tasks: all-words and verb lexical sample. In: Proc. of the Second International Workshop on Evaluating Word Sense Disambiguation Systems, pp. 21–24 (2001)Google Scholar
  28. 28.
    Pilehvar, M.T., Navigli, R.: A large-scale pseudoword-based evaluation framework for state-of-the-art word sense disambiguation. Computational Linguistics 40(4), 837–881 (2014)CrossRefGoogle Scholar
  29. 29.
    Pradhan, S.S., Loper, E., Dligach, D., Palmer, M.: Semeval-2007 task 17: English lexical sample, SRL and all words. In: Proc. of SemEval, pp. 87–92 (2007)Google Scholar
  30. 30.
    Rao, D., McNamee, P., Dredze, M.: Entity linking: finding extracted entities in a knowledge base. In: Multi-source, Multilingual Information Extraction and Summarization, pp. 93–115. Springer (2013)Google Scholar
  31. 31.
    Singh, S., Subramanya, A., Pereira, F., McCallum, A.: Wikilinks: a large-scale cross-document coreference corpus labeled via links to Wikipedia. University of Massachusetts, Amherst, Tech. Rep. UM-CS-2012-015 (2012)Google Scholar
  32. 32.
    Snyder, B., Palmer, M.: The English all-words task. In: Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pp. 41–43 (2004)Google Scholar
  33. 33.
    Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: HLT-NAACL, vol. 1, pp. 173–180 (2003)Google Scholar
  34. 34.
    Usbeck, R., Röder, M., Ngonga Ngomo, A.C., Baron, C., Both, A., Brümmer, M., Ceccarelli, D., Cornolti, M., Cherix, D., Eickmann, B., Ferragina, P., Lemke, C., Moro, A., Navigli, R., Piccinno, F., Rizzo, G., Sack, H., Speck, R., Troncy, R., Waitelonis, J., Wesemann, L.: GERBIL - general entity annotation benchmark framework. In: Proc. of WWW, pp. 1133–1143Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Federico Scozzafava
    • 1
    Email author
  • Alessandro Raganato
    • 1
  • Andrea Moro
    • 1
  • Roberto Navigli
    • 1
  1. 1.Dipartimento di InformaticaSapienza Università di RomaRomaItaly

Personalised recommendations