Skip to main content

An Overview of BabelNet and its API for Multilingual Language Processing

  • Chapter
  • First Online:
The People’s Web Meets NLP

Abstract

In this chapter we present BabelNet, a very large multilingual semantic network. We first describe the two-stage approach used to build it, namely: (a) the automatic integration of lexicographic information from WordNet with encyclopedic knowledge from Wikipedia; (b) the combination of Wikipedia’s manually-edited translations with the output of a state-of-the-art machine translation system. Next, we present in detail statistics about the current version of BabelNet, which consists of a very large semantic network with lexicalizations for six languages (Catalan, English, French, German, Italian and Spanish). The figures all indicate that, thanks to our methodology, we are able to effectively create a knowledge repository containing wide-coverage lexical knowledge for many different languages. Finally, we present an overview of the Application Programming Interface (API) which enables easy programmatic access to all levels of information encoded in BabelNet.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    See also [10] for a discussion from a machine learning perspective.

  2. 2.

    http://www.wikipedia.org

  3. 3.

    This paper is based on [36] and [37]. We expand our previous work by giving in Sect. 7.4 statistics for the current version of BabelNet, as well as an overview of how to access it programmatically in Sect. 7.5.

  4. 4.

    Sense disambiguated glosses are distributed by the Princeton WordNet project at http://wordnet.princeton.edu/glosstag.shtml.

  5. 5.

    Throughout this chapter, we use Sans Serif for words, Small Caps for Wikipedia pages and CAPITALS for Wikipedia categories.

  6. 6.

    Throughout the paper, unless otherwise stated, we use the general term concept to denote either a concept or a named entity.

  7. 7.

    Lexical relations link senses (e.g., dentala1 pertains-to toothn1). However, relations between senses can be easily extended to the synsets which contain them, thus making all the relations connect synsets.

  8. 8.

    We use the Google Translate API. An initial prototype used a statistical machine translation system based on Moses [18] and trained on Europarl [17]. However, we found such system unable to cope with many technical names, such as in the domains of sciences, literature, history, etc.

  9. 9.

    “The article should begin with a declarative sentence telling the nonspecialist reader what (or who) the subject is.”, extracted from http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section#First_sentence. This simple, albeit powerful, heuristic has been successfully used in previous work to construct a corpus of definitional sentences [40].

  10. 10.

    http://lucene.apache.org

  11. 11.

    http://www.wiktionary.org

  12. 12.

    http://www.globalwordnet.org

References

  1. Agirre E, Soroa A (2009) Personalizing PageRank for Word Sense Disambiguation. In: Proceedings of the 12th conference of the European chapter of the association for computational linguistics, Athens, Greece, 30 March–3 April 2009, pp 33–41

    Google Scholar 

  2. Atserias J, Villarejo L, Rigau G, Agirre E, Carroll J, Magnini B, Vossen P (2004) The MEANING multilingual central repository. In: Proceedings of the 2nd international global WordNet conference, Brno, Czech Republic, 20–23 Jan 2004, pp 80–210

    Google Scholar 

  3. Baker CF, Fillmore CJ, Lowe JB (1998) The Berkeley FrameNet project. In: Proceedings of the 17th international conference on computational linguistics and 36th annual meeting of the association for computational linguistics, Montréal, Québec, Canada, 10–14 Aug 1998

    Google Scholar 

  4. Banerjee S, Pedersen T (2003) Extended gloss overlap as a measure of semantic relatedness. In: Proceedings of the 18th international joint conference on artificial intelligence, Acapulco, Mexico, 9–15 Aug 2003, pp 805–810

    Google Scholar 

  5. Banko M, Cafarella MJ, Soderland S, Broadhead M, Etzioni O (2007) Open information extraction from the Web. In: Proceedings of the 20th international joint conference on artificial intelligence, Hyderabad, India, 6–12 Jan 2007, pp 2670–2676

    Google Scholar 

  6. Barrón-Cedeño A, Rosso P, Agirre E, Labaka G (2010) Plagiarism detection across distant language pairs. In: Proceedings of the 23rd international conference on computational linguistics, Beijing, China, 23–27 Aug 2010, pp 37–45

    Google Scholar 

  7. Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S (2009) DBpedia – a crystallization point for the web of data. J Web Semant 7(3):154–165

    Article  Google Scholar 

  8. Buitelaar P, Cimiano P, Magnini B (eds) (2005) Ontology learning from text: methods, evaluation and applications. IOS, Amsterdam

    Google Scholar 

  9. Bunescu R, Paşca M (2006) Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of the 11th conference of the European chapter of the association for computational linguistics, Trento, Italy, 3–7 Apr 2006, pp 9–16

    Google Scholar 

  10. Domingos P (2007) Toward knowledge-rich data mining. Data Min Knowl Disc 15(1):21–28

    Article  Google Scholar 

  11. Fellbaum C (ed) (1998) WordNet: an electronic database. MIT, Cambridge, MA

    Google Scholar 

  12. Gabrilovich E, Markovitch S (2006) Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of the 21st national conference on artificial intelligence, Boston, MA, 16–20 July 2006, pp 1301–1306

    Google Scholar 

  13. Gurevych I, Eckle-Kohler J, Hartmann S, Matuschek M, Meyer CM, Wirth C (2012) UBY – a large-scale unified lexical-semantic resource based on LMF. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, Avignon, France, 23–27 Apr 2012, pp 580–590

    Google Scholar 

  14. Harabagiu SM, Moldovan D, Paşca M, Mihalcea R, Surdeanu M, Bunescu R, Girju R, Rus V, Morarescu P (2000) FALCON: boosting knowledge for answer engines. In: Proceedings of the ninth text REtrieval conference, Gaithersburg, Maryland, 13–16 Nov 2000, pp 479–488

    Google Scholar 

  15. Henrich V, Hinrichs E, Vodolazova T (2011) Semi-automatic extension of GermaNet with sense definitions from Wiktionary. In: Proceedings of 5th language & technology conference, Poznań, Poland, 25–27 Nov 2011, pp 126–130

    Google Scholar 

  16. Hoffart J, Suchanek FM, Berberich K, Weikum G (2012) YAGO2: a spatially and temporally enhanced knowledge base from Wikipedia. Artif Intell. doi:10.1016/j.artint.2012.06.001

    Article  Google Scholar 

  17. Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of machine translation summit X, Phuket, Thailand, 2005, pp 79–86

    Google Scholar 

  18. Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Companion volume to the proceedings of the 45th annual meeting of the association for computational linguistics, Prague, Czech Republic, 23–30 June 2007, pp 177–180

    Google Scholar 

  19. Lemnitzer L, Kunze C (2002) GermaNet – representation, visualization, application. In: Proceedings of the 3rd international conference on language resources and evaluation, Las Palmas, Canary Islands, Spain, 29–31 May 2002, pp 1485–1491

    Google Scholar 

  20. Lenat DB (1995) Cyc: a large-scale investment in knowledge infrastructure. Commun ACM 38(11), pp 33–38

    Article  Google Scholar 

  21. Lita LV, Hunt WA, Nyberg E (2004) Resource analysis for question answering. In: Companion volume to the proceedings of the 42nd annual meeting of the association for computational linguistics, Barcelona, Spain, 21–26 July 2004, pp 162–165

    Google Scholar 

  22. Lu B, Tan C, Cardie C, KB Tsou (2011) Joint bilingual sentiment classification with unlabeled parallel corpora. In: Proceedings of the 49th annual meeting of the association for computational linguistics, Portland, OR, 19–24 June 2011, pp 320–330

    Google Scholar 

  23. Medelyan O, Milne D, Legg C, Witten IH (2009) Mining meaning from Wikipedia. Int J Hum-Comput Stud 67(9):716–754. doi:10.1016/j.ijhcs.2009.05.004

    Article  Google Scholar 

  24. Mehdad Y, Negri M, Federico M (2011) Using bilingual parallel corpora for cross-lingual textual entailment. In: Proceedings of the 49th annual meeting of the association for computational linguistics, Portland, OR, 19–24 June 2011, pp 1336–1345

    Google Scholar 

  25. de Melo G, Weikum G (2009) Towards a universal wordnet by learning from combined evidence. In: Proceedings of the eighteenth ACM conference on information and knowledge management, Hong Kong, China, 2–6 Nov 2009, pp 513–522

    Google Scholar 

  26. de Melo G, Weikum G (2010) MENTA: inducing multilingual taxonomies from Wikipedia. In: Proceedings of the nineteenth ACM conference on information and knowledge management, Toronto, Canada, 26–30 Oct 2010, pp 1099–1108

    Google Scholar 

  27. Meyer CM, Gurevych I (2011) What psycholinguists know about chemistry: aligning Wiktionary and WordNet for increased domain coverage. In: Proceedings of the 5th international joint conference on natural language processing, Chiang Mai, Thailand, 8–13 Nov 2011, pp 883–892

    Google Scholar 

  28. Miller GA, Leacock C, Tengi R, Bunker R (1993) A semantic concordance. In: Proceedings of the 3rd DARPA workshop on human language technology, Plainsboro, NJ, pp 303–308

    Google Scholar 

  29. Moro A, Navigli R (2012) WiSeNet: building a Wikipedia-based semantic network with ontologized relations. In: Proceedings of the twenty-first ACM conference on information and knowledge management, Maui, Hawaii, 29 Oct–2 Nov 2012

    Google Scholar 

  30. Nastase V (2008) Topic-driven multi-document summarization with encyclopedic knowledge and activation spreading. In: Proceedings of the conference on empirical methods in natural language processing, Waikiki, Honolulu, HI, 25–27 Oct 2008, pp 763–772

    Google Scholar 

  31. Nastase V, Strube M (2008) Decoding Wikipedia category names for knowledge acquisition. In: Proceedings of the 23rd conference on the advancement of artificial intelligence, Chicago, IL, 13–17 July 2008, pp 1219–1224

    Google Scholar 

  32. Nastase V, Strube M (2012) Transforming Wikipedia into a large scale multilingual concept network. Artif Intell. doi:10.1016/j.artint.2012.06.008

    Article  Google Scholar 

  33. Navigli R (2009) Word Sense Disambiguation: a survey. ACM Comput Surv 41(2):1–69

    Article  Google Scholar 

  34. Navigli R (2012) A quick tour of Word Sense Disambiguation, induction and related approaches. In: Bieliková M, Friedrich G, Gottlob G, Katzenbeisser S, Turán G (eds) SOFSEM 2012: theory and practice of computer science. Lecture notes in computer science, vol 7147. Springer, Heidelberg, pp 115–129

    Google Scholar 

  35. Navigli R, Lapata M (2010) An experimental study on graph connectivity for unsupervised Word Sense Disambiguation. IEEE Trans Pattern Anal Mach Intel 32(4):678–692

    Article  Google Scholar 

  36. Navigli R, Ponzetto SP (2010) BabelNet: building a very large multilingual semantic network. In: Proceedings of the 48th annual meeting of the association for computational linguistics, Uppsala, Sweden, 11–16 July 2010, pp 216–225

    Google Scholar 

  37. Navigli R, Ponzetto SP (2012) BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif Intell. doi:10.1016/j.artint.2012.07.001

    Article  Google Scholar 

  38. Navigli R, Ponzetto SP (2012) BabelRelate! A joint multilingual approach to computing semantic relatedness. In: Proceedings of the 26th conference on artificial intelligence, Toronto, ON, Canada, 22–26 July 2012, pp 108–114

    Google Scholar 

  39. Navigli R, Ponzetto SP (2012) Joining forces pays off: multilingual joint Word Sense Disambiguation. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational language learning, Jeju Island, South Korea, 12–14 July 2012, pp 1399–1410

    Google Scholar 

  40. Navigli R, Velardi P (2010) Learning Word-Class Lattices for definition and hypernym extraction. In: Proceedings of the 48th annual meeting of the association for computational linguistics, Uppsala, Sweden, 11–16 July 2010, pp 1318–1327

    Google Scholar 

  41. Navigli R, Faralli S, Soroa A, de Lacalle OL, Agirre E (2011) Two birds with one stone: learning semantic models for Text Categorization and Word Sense Disambiguation. In: Proceedings of the twentieth ACM conference on information and knowledge management, Glasgow, Scotland, UK, 24–28 Oct 2011, pp 2317–2320

    Google Scholar 

  42. Ng HT, Lee HB (1996) Integrating multiple knowledge sources to disambiguate word senses: an exemplar-based approach. In: Proceedings of the 34th annual meeting of the association for computational linguistics, Santa Cruz, CA, 24–27 June 1996, pp 40–47

    Google Scholar 

  43. Niemann E, Gurevych I (2011) The people’s web meets linguistic knowledge: automatic sense alignment of Wikipedia and WordNet. In: Proceedings of the 9th international conference on computational semantics, Oxford, UK, pp 205–214

    Google Scholar 

  44. Paşca M (2007) Organizing and searching the World Wide Web of facts – Step two: Harnessing the wisdom of the crowds. In: Proceedings of the 16th world wide web conference, Banff, Canada, 8–12 May 2007, pp 101–110

    Google Scholar 

  45. Paşca M, Lin D, Bigham J, Lifchits A, Jain A (2006) Organizing and searching the world wide web of facts – step one: the one-million fact extraction challenge. In: Proceedings of the 21st national conference on artificial intelligence, Boston, MA, 16–20 July 2006, pp 1400–1405

    Google Scholar 

  46. Pianta E, Bentivogli L, Girardi C (2002) MultiWordNet: developing an aligned multilingual database. In: Proceedings of the 1st international global WordNet conference, Mysore, India, 21–25 Jan 2002, pp 21–25

    Google Scholar 

  47. Ponzetto SP, Navigli R (2009) Large-scale taxonomy mapping for restructuring and integrating Wikipedia. In: Proceedings of the 21st international joint conference on artificial intelligence, Pasadena, CA, 14–17 July 2009, pp 2083–2088

    Google Scholar 

  48. Ponzetto SP, Navigli R (2010) Knowledge-rich Word Sense Disambiguation rivaling supervised systems. In: Proceedings of the 48th annual meeting of the association for computational linguistics, Uppsala, Sweden, 11–16 July 2010, pp 1522–1531

    Google Scholar 

  49. Ponzetto SP, Strube M (2007) Knowledge derived from Wikipedia for computing semantic relatedness. J Artif Intell Res 30:181–212

    Article  Google Scholar 

  50. Ponzetto SP, Strube M (2011) Taxonomy induction based on a collaboratively built knowledge repository. Artif Intell 175:1737–1756

    Article  Google Scholar 

  51. Rahman A, Ng V (2011) Narrowing the modeling gap: a cluster-ranking approach to coreference resolution. J Artif Intell Res 40:469–521

    Article  Google Scholar 

  52. Ruiz-Casado M, Alfonseca E, Castells P (2005) Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets. In: Advances in web intelligence. Lecture notes in computer science, vol 3528. Springer, Berlin/New York, pp 380–386

    Google Scholar 

  53. Schubert LK (2006) Turing’s dream and the knowledge challenge. In: Proceedings of the 21st national conference on artificial intelligence, Boston, MA, 16–20 July 2006, pp 1534–1538

    Google Scholar 

  54. Suchanek FM, Kasneci G, Weikum G (2008) Yago: a large ontology from Wikipedia and WordNet. J Web Semant 6(3):203–217

    Article  Google Scholar 

  55. Toral A, Ferrández O, Agirre E, Muñoz R (2009) A study on linking Wikipedia categories to WordNet synsets using text similarity. In: Proceedings of the international conference on recent advances in natural language processing, Borovets, Bulgaria, 14–16 Sept 2009, pp 449–454

    Google Scholar 

  56. Tufiş D, Ion R, Ide N (2004) Fine-grained Word Sense Disambiguation based on parallel corpora, word alignment, word clustering, and aligned wordnets. In: Proceedings of the 20th international conference on computational linguistics, Geneva, Switzerland, 23–27 Aug 2004, pp 1312–1318

    Google Scholar 

  57. Vossen P (ed) (1998) EuroWordNet: a multilingual database with lexical semantic networks. Kluwer, Dordrecht

    Google Scholar 

  58. Wang P, Domeniconi C (2008) Building semantic kernels for text classification using Wikipedia. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, Las Vegas, Nevada, 24–27 Aug 2008, pp 713–721

    Google Scholar 

  59. Wu F, Weld D (2008) Automatically refining the Wikipedia infobox ontology. In: Proceedings of the 17th world wide web conference, Beijing, China, 21–25 Apr 2008, pp 635–644

    Google Scholar 

  60. Yarowsky D, Florian R (2002) Evaluating sense disambiguation across diverse parameter spaces. Nat Lang Eng 9(4):293–310

    Article  Google Scholar 

  61. Yates A, Etzioni O (2009) Unsupervised methods for determining object and relation synonyms on the web. J Artif Intell Res 34:255–296

    Article  Google Scholar 

Download references

Acknowledgements

The authors gratefully acknowledge the support of the ERC Starting Grant MultiJEDI No. 259234. Thanks go to Google for access to the University Research Program for Google Translate.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Simone Paolo Ponzetto .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Navigli, R., Ponzetto, S.P. (2013). An Overview of BabelNet and its API for Multilingual Language Processing. In: Gurevych, I., Kim, J. (eds) The People’s Web Meets NLP. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35085-6_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35085-6_7

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35084-9

  • Online ISBN: 978-3-642-35085-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics