Abstract
In this chapter we present BabelNet, a very large multilingual semantic network. We first describe the two-stage approach used to build it, namely: (a) the automatic integration of lexicographic information from WordNet with encyclopedic knowledge from Wikipedia; (b) the combination of Wikipedia’s manually-edited translations with the output of a state-of-the-art machine translation system. Next, we present in detail statistics about the current version of BabelNet, which consists of a very large semantic network with lexicalizations for six languages (Catalan, English, French, German, Italian and Spanish). The figures all indicate that, thanks to our methodology, we are able to effectively create a knowledge repository containing wide-coverage lexical knowledge for many different languages. Finally, we present an overview of the Application Programming Interface (API) which enables easy programmatic access to all levels of information encoded in BabelNet.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
See also [10] for a discussion from a machine learning perspective.
- 2.
- 3.
- 4.
Sense disambiguated glosses are distributed by the Princeton WordNet project at http://wordnet.princeton.edu/glosstag.shtml.
- 5.
Throughout this chapter, we use Sans Serif for words, Small Caps for Wikipedia pages and CAPITALS for Wikipedia categories.
- 6.
Throughout the paper, unless otherwise stated, we use the general term concept to denote either a concept or a named entity.
- 7.
Lexical relations link senses (e.g., dentala1 pertains-to toothn1). However, relations between senses can be easily extended to the synsets which contain them, thus making all the relations connect synsets.
- 8.
- 9.
“The article should begin with a declarative sentence telling the nonspecialist reader what (or who) the subject is.”, extracted from http://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section#First_sentence. This simple, albeit powerful, heuristic has been successfully used in previous work to construct a corpus of definitional sentences [40].
- 10.
- 11.
- 12.
References
Agirre E, Soroa A (2009) Personalizing PageRank for Word Sense Disambiguation. In: Proceedings of the 12th conference of the European chapter of the association for computational linguistics, Athens, Greece, 30 March–3 April 2009, pp 33–41
Atserias J, Villarejo L, Rigau G, Agirre E, Carroll J, Magnini B, Vossen P (2004) The MEANING multilingual central repository. In: Proceedings of the 2nd international global WordNet conference, Brno, Czech Republic, 20–23 Jan 2004, pp 80–210
Baker CF, Fillmore CJ, Lowe JB (1998) The Berkeley FrameNet project. In: Proceedings of the 17th international conference on computational linguistics and 36th annual meeting of the association for computational linguistics, Montréal, Québec, Canada, 10–14 Aug 1998
Banerjee S, Pedersen T (2003) Extended gloss overlap as a measure of semantic relatedness. In: Proceedings of the 18th international joint conference on artificial intelligence, Acapulco, Mexico, 9–15 Aug 2003, pp 805–810
Banko M, Cafarella MJ, Soderland S, Broadhead M, Etzioni O (2007) Open information extraction from the Web. In: Proceedings of the 20th international joint conference on artificial intelligence, Hyderabad, India, 6–12 Jan 2007, pp 2670–2676
Barrón-Cedeño A, Rosso P, Agirre E, Labaka G (2010) Plagiarism detection across distant language pairs. In: Proceedings of the 23rd international conference on computational linguistics, Beijing, China, 23–27 Aug 2010, pp 37–45
Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S (2009) DBpedia – a crystallization point for the web of data. J Web Semant 7(3):154–165
Buitelaar P, Cimiano P, Magnini B (eds) (2005) Ontology learning from text: methods, evaluation and applications. IOS, Amsterdam
Bunescu R, Paşca M (2006) Using encyclopedic knowledge for named entity disambiguation. In: Proceedings of the 11th conference of the European chapter of the association for computational linguistics, Trento, Italy, 3–7 Apr 2006, pp 9–16
Domingos P (2007) Toward knowledge-rich data mining. Data Min Knowl Disc 15(1):21–28
Fellbaum C (ed) (1998) WordNet: an electronic database. MIT, Cambridge, MA
Gabrilovich E, Markovitch S (2006) Overcoming the brittleness bottleneck using Wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of the 21st national conference on artificial intelligence, Boston, MA, 16–20 July 2006, pp 1301–1306
Gurevych I, Eckle-Kohler J, Hartmann S, Matuschek M, Meyer CM, Wirth C (2012) UBY – a large-scale unified lexical-semantic resource based on LMF. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, Avignon, France, 23–27 Apr 2012, pp 580–590
Harabagiu SM, Moldovan D, Paşca M, Mihalcea R, Surdeanu M, Bunescu R, Girju R, Rus V, Morarescu P (2000) FALCON: boosting knowledge for answer engines. In: Proceedings of the ninth text REtrieval conference, Gaithersburg, Maryland, 13–16 Nov 2000, pp 479–488
Henrich V, Hinrichs E, Vodolazova T (2011) Semi-automatic extension of GermaNet with sense definitions from Wiktionary. In: Proceedings of 5th language & technology conference, Poznań, Poland, 25–27 Nov 2011, pp 126–130
Hoffart J, Suchanek FM, Berberich K, Weikum G (2012) YAGO2: a spatially and temporally enhanced knowledge base from Wikipedia. Artif Intell. doi:10.1016/j.artint.2012.06.001
Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of machine translation summit X, Phuket, Thailand, 2005, pp 79–86
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W, Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit for statistical machine translation. In: Companion volume to the proceedings of the 45th annual meeting of the association for computational linguistics, Prague, Czech Republic, 23–30 June 2007, pp 177–180
Lemnitzer L, Kunze C (2002) GermaNet – representation, visualization, application. In: Proceedings of the 3rd international conference on language resources and evaluation, Las Palmas, Canary Islands, Spain, 29–31 May 2002, pp 1485–1491
Lenat DB (1995) Cyc: a large-scale investment in knowledge infrastructure. Commun ACM 38(11), pp 33–38
Lita LV, Hunt WA, Nyberg E (2004) Resource analysis for question answering. In: Companion volume to the proceedings of the 42nd annual meeting of the association for computational linguistics, Barcelona, Spain, 21–26 July 2004, pp 162–165
Lu B, Tan C, Cardie C, KB Tsou (2011) Joint bilingual sentiment classification with unlabeled parallel corpora. In: Proceedings of the 49th annual meeting of the association for computational linguistics, Portland, OR, 19–24 June 2011, pp 320–330
Medelyan O, Milne D, Legg C, Witten IH (2009) Mining meaning from Wikipedia. Int J Hum-Comput Stud 67(9):716–754. doi:10.1016/j.ijhcs.2009.05.004
Mehdad Y, Negri M, Federico M (2011) Using bilingual parallel corpora for cross-lingual textual entailment. In: Proceedings of the 49th annual meeting of the association for computational linguistics, Portland, OR, 19–24 June 2011, pp 1336–1345
de Melo G, Weikum G (2009) Towards a universal wordnet by learning from combined evidence. In: Proceedings of the eighteenth ACM conference on information and knowledge management, Hong Kong, China, 2–6 Nov 2009, pp 513–522
de Melo G, Weikum G (2010) MENTA: inducing multilingual taxonomies from Wikipedia. In: Proceedings of the nineteenth ACM conference on information and knowledge management, Toronto, Canada, 26–30 Oct 2010, pp 1099–1108
Meyer CM, Gurevych I (2011) What psycholinguists know about chemistry: aligning Wiktionary and WordNet for increased domain coverage. In: Proceedings of the 5th international joint conference on natural language processing, Chiang Mai, Thailand, 8–13 Nov 2011, pp 883–892
Miller GA, Leacock C, Tengi R, Bunker R (1993) A semantic concordance. In: Proceedings of the 3rd DARPA workshop on human language technology, Plainsboro, NJ, pp 303–308
Moro A, Navigli R (2012) WiSeNet: building a Wikipedia-based semantic network with ontologized relations. In: Proceedings of the twenty-first ACM conference on information and knowledge management, Maui, Hawaii, 29 Oct–2 Nov 2012
Nastase V (2008) Topic-driven multi-document summarization with encyclopedic knowledge and activation spreading. In: Proceedings of the conference on empirical methods in natural language processing, Waikiki, Honolulu, HI, 25–27 Oct 2008, pp 763–772
Nastase V, Strube M (2008) Decoding Wikipedia category names for knowledge acquisition. In: Proceedings of the 23rd conference on the advancement of artificial intelligence, Chicago, IL, 13–17 July 2008, pp 1219–1224
Nastase V, Strube M (2012) Transforming Wikipedia into a large scale multilingual concept network. Artif Intell. doi:10.1016/j.artint.2012.06.008
Navigli R (2009) Word Sense Disambiguation: a survey. ACM Comput Surv 41(2):1–69
Navigli R (2012) A quick tour of Word Sense Disambiguation, induction and related approaches. In: Bieliková M, Friedrich G, Gottlob G, Katzenbeisser S, Turán G (eds) SOFSEM 2012: theory and practice of computer science. Lecture notes in computer science, vol 7147. Springer, Heidelberg, pp 115–129
Navigli R, Lapata M (2010) An experimental study on graph connectivity for unsupervised Word Sense Disambiguation. IEEE Trans Pattern Anal Mach Intel 32(4):678–692
Navigli R, Ponzetto SP (2010) BabelNet: building a very large multilingual semantic network. In: Proceedings of the 48th annual meeting of the association for computational linguistics, Uppsala, Sweden, 11–16 July 2010, pp 216–225
Navigli R, Ponzetto SP (2012) BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif Intell. doi:10.1016/j.artint.2012.07.001
Navigli R, Ponzetto SP (2012) BabelRelate! A joint multilingual approach to computing semantic relatedness. In: Proceedings of the 26th conference on artificial intelligence, Toronto, ON, Canada, 22–26 July 2012, pp 108–114
Navigli R, Ponzetto SP (2012) Joining forces pays off: multilingual joint Word Sense Disambiguation. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational language learning, Jeju Island, South Korea, 12–14 July 2012, pp 1399–1410
Navigli R, Velardi P (2010) Learning Word-Class Lattices for definition and hypernym extraction. In: Proceedings of the 48th annual meeting of the association for computational linguistics, Uppsala, Sweden, 11–16 July 2010, pp 1318–1327
Navigli R, Faralli S, Soroa A, de Lacalle OL, Agirre E (2011) Two birds with one stone: learning semantic models for Text Categorization and Word Sense Disambiguation. In: Proceedings of the twentieth ACM conference on information and knowledge management, Glasgow, Scotland, UK, 24–28 Oct 2011, pp 2317–2320
Ng HT, Lee HB (1996) Integrating multiple knowledge sources to disambiguate word senses: an exemplar-based approach. In: Proceedings of the 34th annual meeting of the association for computational linguistics, Santa Cruz, CA, 24–27 June 1996, pp 40–47
Niemann E, Gurevych I (2011) The people’s web meets linguistic knowledge: automatic sense alignment of Wikipedia and WordNet. In: Proceedings of the 9th international conference on computational semantics, Oxford, UK, pp 205–214
Paşca M (2007) Organizing and searching the World Wide Web of facts – Step two: Harnessing the wisdom of the crowds. In: Proceedings of the 16th world wide web conference, Banff, Canada, 8–12 May 2007, pp 101–110
Paşca M, Lin D, Bigham J, Lifchits A, Jain A (2006) Organizing and searching the world wide web of facts – step one: the one-million fact extraction challenge. In: Proceedings of the 21st national conference on artificial intelligence, Boston, MA, 16–20 July 2006, pp 1400–1405
Pianta E, Bentivogli L, Girardi C (2002) MultiWordNet: developing an aligned multilingual database. In: Proceedings of the 1st international global WordNet conference, Mysore, India, 21–25 Jan 2002, pp 21–25
Ponzetto SP, Navigli R (2009) Large-scale taxonomy mapping for restructuring and integrating Wikipedia. In: Proceedings of the 21st international joint conference on artificial intelligence, Pasadena, CA, 14–17 July 2009, pp 2083–2088
Ponzetto SP, Navigli R (2010) Knowledge-rich Word Sense Disambiguation rivaling supervised systems. In: Proceedings of the 48th annual meeting of the association for computational linguistics, Uppsala, Sweden, 11–16 July 2010, pp 1522–1531
Ponzetto SP, Strube M (2007) Knowledge derived from Wikipedia for computing semantic relatedness. J Artif Intell Res 30:181–212
Ponzetto SP, Strube M (2011) Taxonomy induction based on a collaboratively built knowledge repository. Artif Intell 175:1737–1756
Rahman A, Ng V (2011) Narrowing the modeling gap: a cluster-ranking approach to coreference resolution. J Artif Intell Res 40:469–521
Ruiz-Casado M, Alfonseca E, Castells P (2005) Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets. In: Advances in web intelligence. Lecture notes in computer science, vol 3528. Springer, Berlin/New York, pp 380–386
Schubert LK (2006) Turing’s dream and the knowledge challenge. In: Proceedings of the 21st national conference on artificial intelligence, Boston, MA, 16–20 July 2006, pp 1534–1538
Suchanek FM, Kasneci G, Weikum G (2008) Yago: a large ontology from Wikipedia and WordNet. J Web Semant 6(3):203–217
Toral A, Ferrández O, Agirre E, Muñoz R (2009) A study on linking Wikipedia categories to WordNet synsets using text similarity. In: Proceedings of the international conference on recent advances in natural language processing, Borovets, Bulgaria, 14–16 Sept 2009, pp 449–454
Tufiş D, Ion R, Ide N (2004) Fine-grained Word Sense Disambiguation based on parallel corpora, word alignment, word clustering, and aligned wordnets. In: Proceedings of the 20th international conference on computational linguistics, Geneva, Switzerland, 23–27 Aug 2004, pp 1312–1318
Vossen P (ed) (1998) EuroWordNet: a multilingual database with lexical semantic networks. Kluwer, Dordrecht
Wang P, Domeniconi C (2008) Building semantic kernels for text classification using Wikipedia. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, Las Vegas, Nevada, 24–27 Aug 2008, pp 713–721
Wu F, Weld D (2008) Automatically refining the Wikipedia infobox ontology. In: Proceedings of the 17th world wide web conference, Beijing, China, 21–25 Apr 2008, pp 635–644
Yarowsky D, Florian R (2002) Evaluating sense disambiguation across diverse parameter spaces. Nat Lang Eng 9(4):293–310
Yates A, Etzioni O (2009) Unsupervised methods for determining object and relation synonyms on the web. J Artif Intell Res 34:255–296
Acknowledgements
The authors gratefully acknowledge the support of the ERC Starting Grant MultiJEDI No. 259234. Thanks go to Google for access to the University Research Program for Google Translate.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Navigli, R., Ponzetto, S.P. (2013). An Overview of BabelNet and its API for Multilingual Language Processing. In: Gurevych, I., Kim, J. (eds) The People’s Web Meets NLP. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35085-6_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-35085-6_7
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35084-9
Online ISBN: 978-3-642-35085-6
eBook Packages: Computer ScienceComputer Science (R0)