An Overview of BabelNet and its API for Multilingual Language Processing

Navigli, Roberto; Ponzetto, Simone Paolo

doi:10.1007/978-3-642-35085-6_7

Roberto Navigli³ &
Simone Paolo Ponzetto³

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

1600 Accesses
1 Citations

Abstract

In this chapter we present BabelNet, a very large multilingual semantic network. We first describe the two-stage approach used to build it, namely: (a) the automatic integration of lexicographic information from WordNet with encyclopedic knowledge from Wikipedia; (b) the combination of Wikipedia’s manually-edited translations with the output of a state-of-the-art machine translation system. Next, we present in detail statistics about the current version of BabelNet, which consists of a very large semantic network with lexicalizations for six languages (Catalan, English, French, German, Italian and Spanish). The figures all indicate that, thanks to our methodology, we are able to effectively create a knowledge repository containing wide-coverage lexical knowledge for many different languages. Finally, we present an overview of the Application Programming Interface (API) which enables easy programmatic access to all levels of information encoded in BabelNet.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
See also [10] for a discussion from a machine learning perspective.
2.
http://www.wikipedia.org
3.
This paper is based on [36] and [37]. We expand our previous work by giving in Sect. 7.4 statistics for the current version of BabelNet, as well as an overview of how to access it programmatically in Sect. 7.5.
4.
Sense disambiguated glosses are distributed by the Princeton WordNet project at http://wordnet.princeton.edu/glosstag.shtml.
5.
Throughout this chapter, we use Sans Serif for words, Small Caps for Wikipedia pages and CAPITALS for Wikipedia categories.
6.
Throughout the paper, unless otherwise stated, we use the general term concept to denote either a concept or a named entity.
7.
Lexical relations link senses (e.g., dental_a¹ pertains-to tooth_n¹). However, relations between senses can be easily extended to the synsets which contain them, thus making all the relations connect synsets.
8.
We use the Google Translate API. An initial prototype used a statistical machine translation system based on Moses [18] and trained on Europarl [17]. However, we found such system unable to cope with many technical names, such as in the domains of sciences, literature, history, etc.
9.
“The article should begin with a declarative sentence telling the nonspecialist reader what (or who) the subject is.”, extracted from http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section#First_sentence. This simple, albeit powerful, heuristic has been successfully used in previous work to construct a corpus of definitional sentences [40].
10.
http://lucene.apache.org
11.
http://www.wiktionary.org
12.
http://www.globalwordnet.org

References

Agirre E, Soroa A (2009) Personalizing PageRank for Word Sense Disambiguation. In: Proceedings of the 12th conference of the European chapter of the association for computational linguistics, Athens, Greece, 30 March–3 April 2009, pp 33–41
Google Scholar
Atserias J, Villarejo L, Rigau G, Agirre E, Carroll J, Magnini B, Vossen P (2004) The MEANING multilingual central repository. In: Proceedings of the 2nd international global WordNet conference, Brno, Czech Republic, 20–23 Jan 2004, pp 80–210
Google Scholar
Baker CF, Fillmore CJ, Lowe JB (1998) The Berkeley FrameNet project. In: Proceedings of the 17th international conference on computational linguistics and 36th annual meeting of the association for computational linguistics, Montréal, Québec, Canada, 10–14 Aug 1998
Google Scholar
Banerjee S, Pedersen T (2003) Extended gloss overlap as a measure of semantic relatedness. In: Proceedings of the 18th international joint conference on artificial intelligence, Acapulco, Mexico, 9–15 Aug 2003, pp 805–810
Google Scholar
Banko M, Cafarella MJ, Soderland S, Broadhead M, Etzioni O (2007) Open information extraction from the Web. In: Proceedings of the 20th international joint conference on artificial intelligence, Hyderabad, India, 6–12 Jan 2007, pp 2670–2676
Google Scholar
Barrón-Cedeño A, Rosso P, Agirre E, Labaka G (2010) Plagiarism detection across distant language pairs. In: Proceedings of the 23rd international conference on computational linguistics, Beijing, China, 23–27 Aug 2010, pp 37–45
Google Scholar
Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S (2009) DBpedia – a crystallization point for the web of data. J Web Semant 7(3):154–165
Article Google Scholar
Buitelaar P, Cimiano P, Magnini B (eds) (2005) Ontology learning from text: methods, evaluation and applications. IOS, Amsterdam
Google Scholar
Bunescu R, Paşca M (2006) Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of the 11th conference of the European chapter of the association for computational linguistics, Trento, Italy, 3–7 Apr 2006, pp 9–16
Google Scholar
Domingos P (2007) Toward knowledge-rich data mining. Data Min Knowl Disc 15(1):21–28
Article Google Scholar
Fellbaum C (ed) (1998) WordNet: an electronic database. MIT, Cambridge, MA
Google Scholar
Gabrilovich E, Markovitch S (2006) Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of the 21st national conference on artificial intelligence, Boston, MA, 16–20 July 2006, pp 1301–1306
Google Scholar
Gurevych I, Eckle-Kohler J, Hartmann S, Matuschek M, Meyer CM, Wirth C (2012) UBY – a large-scale unified lexical-semantic resource based on LMF. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, Avignon, France, 23–27 Apr 2012, pp 580–590
Google Scholar
Harabagiu SM, Moldovan D, Paşca M, Mihalcea R, Surdeanu M, Bunescu R, Girju R, Rus V, Morarescu P (2000) FALCON: boosting knowledge for answer engines. In: Proceedings of the ninth text REtrieval conference, Gaithersburg, Maryland, 13–16 Nov 2000, pp 479–488
Google Scholar
Henrich V, Hinrichs E, Vodolazova T (2011) Semi-automatic extension of GermaNet with sense definitions from Wiktionary. In: Proceedings of 5th language & technology conference, Poznań, Poland, 25–27 Nov 2011, pp 126–130
Google Scholar
Hoffart J, Suchanek FM, Berberich K, Weikum G (2012) YAGO2: a spatially and temporally enhanced knowledge base from Wikipedia. Artif Intell. doi:10.1016/j.artint.2012.06.001
Article Google Scholar
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of machine translation summit X, Phuket, Thailand, 2005, pp 79–86
Google Scholar
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Companion volume to the proceedings of the 45th annual meeting of the association for computational linguistics, Prague, Czech Republic, 23–30 June 2007, pp 177–180
Google Scholar
Lemnitzer L, Kunze C (2002) GermaNet – representation, visualization, application. In: Proceedings of the 3rd international conference on language resources and evaluation, Las Palmas, Canary Islands, Spain, 29–31 May 2002, pp 1485–1491
Google Scholar
Lenat DB (1995) Cyc: a large-scale investment in knowledge infrastructure. Commun ACM 38(11), pp 33–38
Article Google Scholar
Lita LV, Hunt WA, Nyberg E (2004) Resource analysis for question answering. In: Companion volume to the proceedings of the 42nd annual meeting of the association for computational linguistics, Barcelona, Spain, 21–26 July 2004, pp 162–165
Google Scholar
Lu B, Tan C, Cardie C, KB Tsou (2011) Joint bilingual sentiment classification with unlabeled parallel corpora. In: Proceedings of the 49th annual meeting of the association for computational linguistics, Portland, OR, 19–24 June 2011, pp 320–330
Google Scholar
Medelyan O, Milne D, Legg C, Witten IH (2009) Mining meaning from Wikipedia. Int J Hum-Comput Stud 67(9):716–754. doi:10.1016/j.ijhcs.2009.05.004
Article Google Scholar
Mehdad Y, Negri M, Federico M (2011) Using bilingual parallel corpora for cross-lingual textual entailment. In: Proceedings of the 49th annual meeting of the association for computational linguistics, Portland, OR, 19–24 June 2011, pp 1336–1345
Google Scholar
de Melo G, Weikum G (2009) Towards a universal wordnet by learning from combined evidence. In: Proceedings of the eighteenth ACM conference on information and knowledge management, Hong Kong, China, 2–6 Nov 2009, pp 513–522
Google Scholar
de Melo G, Weikum G (2010) MENTA: inducing multilingual taxonomies from Wikipedia. In: Proceedings of the nineteenth ACM conference on information and knowledge management, Toronto, Canada, 26–30 Oct 2010, pp 1099–1108
Google Scholar
Meyer CM, Gurevych I (2011) What psycholinguists know about chemistry: aligning Wiktionary and WordNet for increased domain coverage. In: Proceedings of the 5th international joint conference on natural language processing, Chiang Mai, Thailand, 8–13 Nov 2011, pp 883–892
Google Scholar
Miller GA, Leacock C, Tengi R, Bunker R (1993) A semantic concordance. In: Proceedings of the 3rd DARPA workshop on human language technology, Plainsboro, NJ, pp 303–308
Google Scholar
Moro A, Navigli R (2012) WiSeNet: building a Wikipedia-based semantic network with ontologized relations. In: Proceedings of the twenty-first ACM conference on information and knowledge management, Maui, Hawaii, 29 Oct–2 Nov 2012
Google Scholar
Nastase V (2008) Topic-driven multi-document summarization with encyclopedic knowledge and activation spreading. In: Proceedings of the conference on empirical methods in natural language processing, Waikiki, Honolulu, HI, 25–27 Oct 2008, pp 763–772
Google Scholar
Nastase V, Strube M (2008) Decoding Wikipedia category names for knowledge acquisition. In: Proceedings of the 23rd conference on the advancement of artificial intelligence, Chicago, IL, 13–17 July 2008, pp 1219–1224
Google Scholar
Nastase V, Strube M (2012) Transforming Wikipedia into a large scale multilingual concept network. Artif Intell. doi:10.1016/j.artint.2012.06.008
Article Google Scholar
Navigli R (2009) Word Sense Disambiguation: a survey. ACM Comput Surv 41(2):1–69
Article Google Scholar
Navigli R (2012) A quick tour of Word Sense Disambiguation, induction and related approaches. In: Bieliková M, Friedrich G, Gottlob G, Katzenbeisser S, Turán G (eds) SOFSEM 2012: theory and practice of computer science. Lecture notes in computer science, vol 7147. Springer, Heidelberg, pp 115–129
Google Scholar
Navigli R, Lapata M (2010) An experimental study on graph connectivity for unsupervised Word Sense Disambiguation. IEEE Trans Pattern Anal Mach Intel 32(4):678–692
Article Google Scholar
Navigli R, Ponzetto SP (2010) BabelNet: building a very large multilingual semantic network. In: Proceedings of the 48th annual meeting of the association for computational linguistics, Uppsala, Sweden, 11–16 July 2010, pp 216–225
Google Scholar
Navigli R, Ponzetto SP (2012) BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif Intell. doi:10.1016/j.artint.2012.07.001
Article Google Scholar
Navigli R, Ponzetto SP (2012) BabelRelate! A joint multilingual approach to computing semantic relatedness. In: Proceedings of the 26th conference on artificial intelligence, Toronto, ON, Canada, 22–26 July 2012, pp 108–114
Google Scholar
Navigli R, Ponzetto SP (2012) Joining forces pays off: multilingual joint Word Sense Disambiguation. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational language learning, Jeju Island, South Korea, 12–14 July 2012, pp 1399–1410
Google Scholar
Navigli R, Velardi P (2010) Learning Word-Class Lattices for definition and hypernym extraction. In: Proceedings of the 48th annual meeting of the association for computational linguistics, Uppsala, Sweden, 11–16 July 2010, pp 1318–1327
Google Scholar
Navigli R, Faralli S, Soroa A, de Lacalle OL, Agirre E (2011) Two birds with one stone: learning semantic models for Text Categorization and Word Sense Disambiguation. In: Proceedings of the twentieth ACM conference on information and knowledge management, Glasgow, Scotland, UK, 24–28 Oct 2011, pp 2317–2320
Google Scholar
Ng HT, Lee HB (1996) Integrating multiple knowledge sources to disambiguate word senses: an exemplar-based approach. In: Proceedings of the 34th annual meeting of the association for computational linguistics, Santa Cruz, CA, 24–27 June 1996, pp 40–47
Google Scholar
Niemann E, Gurevych I (2011) The people’s web meets linguistic knowledge: automatic sense alignment of Wikipedia and WordNet. In: Proceedings of the 9th international conference on computational semantics, Oxford, UK, pp 205–214
Google Scholar
Paşca M (2007) Organizing and searching the World Wide Web of facts – Step two: Harnessing the wisdom of the crowds. In: Proceedings of the 16th world wide web conference, Banff, Canada, 8–12 May 2007, pp 101–110
Google Scholar
Paşca M, Lin D, Bigham J, Lifchits A, Jain A (2006) Organizing and searching the world wide web of facts – step one: the one-million fact extraction challenge. In: Proceedings of the 21st national conference on artificial intelligence, Boston, MA, 16–20 July 2006, pp 1400–1405
Google Scholar
Pianta E, Bentivogli L, Girardi C (2002) MultiWordNet: developing an aligned multilingual database. In: Proceedings of the 1st international global WordNet conference, Mysore, India, 21–25 Jan 2002, pp 21–25
Google Scholar
Ponzetto SP, Navigli R (2009) Large-scale taxonomy mapping for restructuring and integrating Wikipedia. In: Proceedings of the 21st international joint conference on artificial intelligence, Pasadena, CA, 14–17 July 2009, pp 2083–2088
Google Scholar
Ponzetto SP, Navigli R (2010) Knowledge-rich Word Sense Disambiguation rivaling supervised systems. In: Proceedings of the 48th annual meeting of the association for computational linguistics, Uppsala, Sweden, 11–16 July 2010, pp 1522–1531
Google Scholar
Ponzetto SP, Strube M (2007) Knowledge derived from Wikipedia for computing semantic relatedness. J Artif Intell Res 30:181–212
Article Google Scholar
Ponzetto SP, Strube M (2011) Taxonomy induction based on a collaboratively built knowledge repository. Artif Intell 175:1737–1756
Article Google Scholar
Rahman A, Ng V (2011) Narrowing the modeling gap: a cluster-ranking approach to coreference resolution. J Artif Intell Res 40:469–521
Article Google Scholar
Ruiz-Casado M, Alfonseca E, Castells P (2005) Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets. In: Advances in web intelligence. Lecture notes in computer science, vol 3528. Springer, Berlin/New York, pp 380–386
Google Scholar
Schubert LK (2006) Turing’s dream and the knowledge challenge. In: Proceedings of the 21st national conference on artificial intelligence, Boston, MA, 16–20 July 2006, pp 1534–1538
Google Scholar
Suchanek FM, Kasneci G, Weikum G (2008) Yago: a large ontology from Wikipedia and WordNet. J Web Semant 6(3):203–217
Article Google Scholar
Toral A, Ferrández O, Agirre E, Muñoz R (2009) A study on linking Wikipedia categories to WordNet synsets using text similarity. In: Proceedings of the international conference on recent advances in natural language processing, Borovets, Bulgaria, 14–16 Sept 2009, pp 449–454
Google Scholar
Tufiş D, Ion R, Ide N (2004) Fine-grained Word Sense Disambiguation based on parallel corpora, word alignment, word clustering, and aligned wordnets. In: Proceedings of the 20th international conference on computational linguistics, Geneva, Switzerland, 23–27 Aug 2004, pp 1312–1318
Google Scholar
Vossen P (ed) (1998) EuroWordNet: a multilingual database with lexical semantic networks. Kluwer, Dordrecht
Google Scholar
Wang P, Domeniconi C (2008) Building semantic kernels for text classification using Wikipedia. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, Las Vegas, Nevada, 24–27 Aug 2008, pp 713–721
Google Scholar
Wu F, Weld D (2008) Automatically refining the Wikipedia infobox ontology. In: Proceedings of the 17th world wide web conference, Beijing, China, 21–25 Apr 2008, pp 635–644
Google Scholar
Yarowsky D, Florian R (2002) Evaluating sense disambiguation across diverse parameter spaces. Nat Lang Eng 9(4):293–310
Article Google Scholar
Yates A, Etzioni O (2009) Unsupervised methods for determining object and relation synonyms on the web. J Artif Intell Res 34:255–296
Article Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge the support of the ERC Starting Grant MultiJEDI No. 259234. Thanks go to Google for access to the University Research Program for Google Translate.

Author information

Authors and Affiliations

Dipartimento di Informatica, Sapienza University of Rome, Rome, Italy
Roberto Navigli & Simone Paolo Ponzetto

Authors

Roberto Navigli
View author publications
You can also search for this author in PubMed Google Scholar
Simone Paolo Ponzetto
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Simone Paolo Ponzetto .

Editor information

Editors and Affiliations

Department of Computer Science Ubiquitous Knowledge Processing (UKP) Lab, Technische Universität Darmstadt, Darmstadt, Germany
Iryna Gurevych & Jungi Kim &

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Navigli, R., Ponzetto, S.P. (2013). An Overview of BabelNet and its API for Multilingual Language Processing. In: Gurevych, I., Kim, J. (eds) The People’s Web Meets NLP. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35085-6_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-35085-6_7
Published: 21 February 2013
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35084-9
Online ISBN: 978-3-642-35085-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics