Skip to main content
Log in

Web 2.0, Language Resources and standards to automatically build a multilingual Named Entity Lexicon

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This paper proposes to advance in the current state-of-the-art of automatic Language Resource (LR) building by taking into consideration three elements: (1) the knowledge available in existing LRs, (2) the vast amount of information available from the collaborative paradigm that has emerged from the Web 2.0 and (3) the use of standards to improve interoperability. We present a case study in which a set of LRs for different languages (WordNet for English and Spanish and Parole-Simple-Clips for Italian) are extended with Named Entities (NE) by exploiting Wikipedia and the aforementioned LRs. The practical result is a multilingual NE lexicon connected to these LRs and to two ontologies: SUMO and SIMPLE. Furthermore, the paper addresses an important problem which affects the Computational Linguistics area in the present, interoperability, by making use of the ISO LMF standard to encode this lexicon. The different steps of the procedure (mapping, disambiguation, extraction, NE identification and postprocessing) are comprehensively explained and evaluated. The resulting resource contains 974,567, 137,583 and 125,806 NEs for English, Spanish and Italian respectively. Finally, in order to check the usefulness of the constructed resource, we apply it into a state-of-the-art Question Answering system and evaluate its impact; the NE lexicon improves the system’s accuracy by 28.1%. Compared to previous approaches to build NE repositories, the current proposal represents a step forward in terms of automation, language independence, amount of NEs acquired and richness of the information represented.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. By Named Entities we refer in this paper to entities belonging to several semantic types (e.g. person, location, organisation) which take the form of proper nouns.

  2. http://www.geonames.org.

  3. http://www.clef-campaign.org.

  4. On 2008/03/11 the English version has 9,141,485 registered users.

  5. http://www.stern.de/media/pdf/wiki_test_750.jpg.

  6. http://www.stern.de/computer-technik/internet/:stern-Test-Wikipedia-Brockhaus/604423.html?q=Brockhaus%20wikipedia.

  7. Specifically, Wikipedia, being an encyclopaedia and having strong policies regarding neutrality, does not suffer from such problems.

  8. http://www.tc37sc4.org.

  9. http://www.globalwordnet.org/.

  10. http://www.lsi.upc.es/~nlp/meaning/(2002-2004).

  11. http://icgl.ctl.cityu.edu.hk.

  12. http://ai.stanford.edu/~rion/swn/.

  13. http://www.linguateca.pt/REPENTINO/.

  14. http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_in_academic_studies#Over_time.

  15. http://ilps.science.uva.nl/WiQA/.

  16. http://www.linguateca.pt/GikiCLEF/.

  17. http://www.sics.se/jussi/newtext.

  18. http://lit.csci.unt.edu/~wikiai08/index.php/Main_Page.

  19. http://www.ukp.tu-darmstadt.de/acl-ijcnlp-2009-workshop.

  20. http://www.ukp.tu-darmstadt.de/software/jwpl/.

  21. http://www.ukp.tu-darmstadt.de/software/jwktl/.

  22. Available for research at http://www.lsi.upc.edu/~nlp.

  23. The set of nouns that can be instantiated by means of a NE, e.g. “country” has instances such as “France”.

  24. E.g. In the category “Philosophers” there are subcategories that follow the hyponymy relation (e.g. “Philosophers by country”) but there are also others that do not (e.g. “Philosophy academics”).

  25. http://code.google.com/p/semanticvectors.

  26. http://lucene.apache.org.

  27. http://search.cpan.org/~dprice/Text-MediawikiFormat-0.05/lib/Text/MediawikiFormat.pm.

  28. http://search.cpan.org/~awrigley/html2text-0.003/html2text.pl.

  29. Downloaded from http://download.wikimedia.org.

  30. http://text-similarity.sourceforge.net.

  31. The combination strategies use Textual Entailment, Personalised PageRank and Word Overlap.

  32. Despite these results, the Wikipedia method is used for building the NE lexicon because of the limitation of the amount of daily queries imposed by web search engines.

  33. The NE lexicon has also been applied to Machine Translation yielding notable results (Toral and Way 2011).

  34. The exact answers are assessed as: (1) Right: if correct; (2) Wrong: if incorrect; (3) Inexact: if contained less or more information than that required by the query; or (4) Unsupported: the supporting snippet did not contain the exact answer.

References

  • Agichtein, E., & Gravano, L. (2000). Snowball: Extracting relations from large plain-text collections. In Proceedings of the fifth ACM conference on digital libraries (pp. 85–94). New York, NY: ACM.

  • Agirre, E., & Soroa, A. (2009) Personalizing PageRank for word sense disambiguation. In Proceedings of the 12th conference of the European chapter of the ACL (EACL 2009), association for computational linguistics (pp. 33–41). Athens, Greece

  • Ahn, D., Jijkoun, V., Mishne, G., de Rijke, K. M. M., & Schlobachz, S. (2005). Using Wikipedia at the TREC QA track. In Proceedings of the thirteenth text retrieval conference (TREC 2004).

  • Alonge, A., Bertagna, F., Calzolari, N., & Roventini, A. (1999) The Italian Wordnet, EuroWordNet deliverable D032D033 part B5. Tech. rep.

  • Alshawi, H. (1987). Processing dictionary definitions with phrasal pattern hierarchies. Computational Linguistics, 13(3–4), 195–202.

    Google Scholar 

  • Aristotle (1908). Metaphysics. In W. D. Ross (Ed.), The works of aristotle translated into English, Vol VIII. Oxford: Oxford University Press.

    Google Scholar 

  • Atserias, J., Villarejo, L., Rigau, G., Agirre, E., Carroll, J., Magnini, B., et al. (2004). The meaning multilingual central repository. In Proceedings of the Second International Global WordNet Conference (GWC’04). Czech Republic: Brno

  • Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., & Ives, Z. (2008). DBpedia: A nucleus for a Web of open data. pp. 722–735.

  • Balahur, A., Lloret, E., Ferrández, Ó., Montoyo, A., Palomar, M., & Muñoz, R. (2008). The DLSIUAES Team’s participation in the TAC 2008 iracks. In Notebook Papers of the Text Analysis Conference, TAC 2008 Workshop. Gaithersburg, Maryland, USA: National Institute of Standards and Technology.

  • Bunescu, R. C., & Pasca, M. (2006). Using Encyclopedic knowledge for named entity disambiguation. In EACL, the association for computer linguistics.

  • Buscaldi, D., & Rosso, P. (2006). Mining knowledge from Wikipedia for the question answering task. In Proceedings of the fifth international conference on language resources and evaluation.

  • Calzolari, N. (1992). Acquiring and representing semantic information in a lexical knowledge base. In Proceedings of the first SIGLEX workshop on lexical semantics and knowledge representation (pp. 235–243). London, UK: Springer.

  • Daudé, J., Padró, L., & Rigau, G. (2003). Making wordnet mappings robust. In Proceedings of the 19th congreso de la sociedad Española para el procesamiento del lenguage natural. SEPLN, Universidad Universidad de Alcalá de Henares. Madrid, Spain.

  • De Loupy, C., Crestan, E., & Lemaire, E. (2004). Proper nouns thesaurus for document retrieval and question answering. In Atelier Question-Réponse, Traitement Automatique des Langues Naturelles (TALN).

  • Etzioni, O., Banko, M., Soderland, S., & Weld, D. S. (2008). Open information extraction from the web. Communications of the ACM, 51(12), 68–74.

    Article  Google Scholar 

  • Ferrández, Ó., Micol, D., Muñoz, R., & Palomar, M. (2007a). A perspective-based approach for solving textual entailment recognition. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing (pp. 66–71). Prague: Association for Computational Linguistics.

  • Ferrández, S., López-Moreno, P., Roger, S., Ferrández, A., Peral, J., Alvarado, X., et al. (2006). Monolingual and cross-lingual QA using AliQAn and BRILI systems for CLEF 2006. In Proceedings of CLEF’2006 (pp. 450–453).

  • Ferrández, S., Toral, A., Ferrández, Ó., Ferrández, A., & Muñoz, R. (2007b). Applying Wikipedia’s multilingual knowledge to cross-lingual question answering. In Z. Kedad, N. Lammari, E. Métais, F. Meziane, & Y. Rezgui (Eds.), NLDB, Springer, Lecture Notes in Computer Science, vol. 4592, pp. 352–363.

  • Fleischman, M., Echihabi, A., & Hovy, E. (2003). Offline strategies for online question answering: Answering questions before they are asked. In Proceedings of the ACL conference. Japan: Sapporo.

  • Francopoulo, G., Bel, N., George, M., Calzolari, N., Monachini, M., Pet, M., et al. (2008). (forthcoming) Multilingual resources for NLP in the Lexical Markup Framework (LMF). Language Resources and Evaluation Journal, 43(1), 57–70.

    Google Scholar 

  • Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In Proceedings of the twentieth international joint conference for artificial intelligence (pp. 1606–1611). Hyderabad, India.

  • Giles, J. (2005). Internet encyclopaedias go head to head. Nature, 438(7070), 900–901. doi:10.1038/438900a.

    Article  Google Scholar 

  • Gregorowicz A., & Kramer M. A. (2006). Mining a large-scale term-concept network from Wikipedia. Tech. rep., MITRE.

  • Hearst, M. A. (1992). Automatic acquisition of hyponyms from large text corpora. In COLING, pp. 539–545.

  • Hearst, M. A. (1998). Automated discovery of WordNet relations. Cambridge, MA: MIT Press.

    Google Scholar 

  • ISO 24613 (2008). Languages resources management–lexical markup framework (LMF), rev.15 ISOTC37SC4 FDIS. [Online; accessed 25-March-2008].

  • Jijkoun, V., Sang, E.T.K., Ahn, D., Möller, K., & de Rijke, M. (2005). The University of Amsterdam at QA@CLEF 2005. In Working notes of the CLEF 2005 workshop.

  • Jones, G., Fantino, F., Newman, E., & Zhang, Y. (2008). Domain-specific query translation for multilingual information access using machine translation augmented with dictionaries mined from wikipedia. In 2nd international workshop on cross lingual information access addressing the information need of multilingual societies.

  • Karlgren, J. (Ed.) (2006). NEW TEXT, Wikis and blogs and other dynamic text sources. Italy: Trento.

    Google Scholar 

  • Krstev, C., Vitas, D., Maurel, D., & Tran, M. (2005). Multilingual ontology of proper names. In Proceedings of the language and technology conference, pp. 116–119.

  • Lenat, D. (1998). From 2001 to 2001: Common sense and the mind of HAL (pp. 193–208). Cambridge, MA: MIT Press.

    Google Scholar 

  • Lenci, A., Bel, N., Busa, F., Calzolari, N., Gola, E., Monachini, M., et al. (2000). SIMPLE: A general framework for the development of multilingual lexicons. International Journal of Lexicography, 13(4), 249–263.

    Article  Google Scholar 

  • Magnini, B., Giampiccolo, D., Forner, P., Ayache, C., Jijkoun, V., Osenova, P., et al. (2006). Overview of the CLEF 2006 Multilingual Question Answering Track. In Proceedings of CLEF’2006 (pp. 223–256).

  • Mann, G. (2002). Fine-grained proper noun ontologies for question answering. In Proceedings of SemaNet’02: Building and using semantic networks.

  • Maurel, D. (2008). Prolexbase: A Multilingual relational lexical database of proper names. In (ELRA) ELRA (Ed) Proceedings of the sixth international language resources and evaluation (LREC’08). Morocco: Marrakech.

  • Medelyan, O., & Legg, C. (2008). Integrating Cyc and Wikipedia: Folksonomy meets rigorously defined common-sense. In AAAI 2008 workshop Wikipedia and artificial intelligence: An evolving synergy, Chicago, United States.

  • Miller, G. A. (1995). WORDNET: A Lexical Database for English. Communications of ACM, (11), 39–41.

  • Miller, G. A., & Hristea, F. (2006). WordNet nouns: Classes and instances. Computational Linguistics, 32(1), 1–3.

    Article  Google Scholar 

  • Milne, D., & Witten, I. H. (2008). An effective, low-cost measure of semantic relatedness obtained from wikipedia links. In AAAI 2008 workshop Wikipedia and artificial intelligence: An evolving synergy, Chicago, United States.

  • Milne, D., Medelyan, O., & Witten, I. H. (2006). Mining domain-specific thesauri from Wikipedia: A case study. In Proceedings of the 2006 IEEE/WIC/ACM international conference on web intelligence (pp. 442–448). Washington, DC: IEEE Computer Society.

  • Nakamura, J., & Nagao, M. (1988). Extraction of semantic information from an ordinary english dictionary and its evaluation. COLING-88 pp. 459–464.

  • Niles, I., & Pease, A. (2003). Linking lexicons and ontologies: Mapping wordnet to the suggested upper merged ontology. In Proceedings of the 2003 international conference on information and knowledge engineering, pp. 23–26.

  • Nothman J., Murphy T., & Curran J. R. (2009). Analysing Wikipedia and gold standard corpora for NER training. In Proceedings of the 12th conference of the European chapter of the association for computational linguistics.

  • Pedro, V., Niculescu, S., & Lita, L. (2008). Okinet: Automatic extraction of a medical ontology from Wikipedia. In AAAI 2008 workshop Wikipedia and artificial intelligence: An evolving synergy, Chicago, USA.

  • Philpot, A., Hovy, E., & Pantel, P. (2005). The omega ontology. In IJCNLP workshop on ontologies and lexical resources (OntoLex-05) (pp. 59–66). Jeju Island, South Korea.

  • Ponzetto, S. P., & Strube, M. (2007). Knowledge derived from Wikipedia for computing semantic relatedness. Journal of Artificial Intelligence Research, 30, 181–212.

    Google Scholar 

  • Pustejovsky, J. (1991). The generative lexicon. Computational Linguistics, 17(4), 409–441.

    Google Scholar 

  • Richardson, S. D., Dolan, W. B., & Vanderwende, L. (1998). MindNet: Acquiring and structuring semantic information from text. In COLING-ACL, pp. 1098–1102.

  • Rigau, G. (1998). Automatic acquisition of lexical knowledge from MRDs. PhD thesis, Universitat Politècnica de Catalunya.

  • Roventini, A., & Ruimy, N. (2008). Mapping events and abstract entities from PAROLE-SIMPLE-CLIPS to ItalWordNet. In (ELRA) ELRA (Ed) Proceedings of the sixth international language resources and evaluation. Morocco: Marrakech.

  • Roventini, A., Ruimy, N., Marinelli, R., Ulivieri, M., & Mammini, M. (2007). Mapping concrete entities from PAROLE-SIMPLE-CLIPS to ItalWordNet: methodology and results. In Proceedings of the 45th annual meeting of the association for computational linguistics, association for computational linguistics (pp. 161–164). Czech Republic: Prague.

  • Ruimy, N., Corazzari, O., Gola, E., Spanu, A., Calzolari, N., Zampolli, A. (1998). The European LE-PAROLE project: The Italian syntactic Lexicon. In Proceedings of the first international conference on language resources and evaluation (LREC’98). Granada, Spain.

  • Ruimy, N., Monachini, M., Distante, R., Guazzini, E., Molino, S., Ulivieri, M., et al. (2002). CLIPS, a multi-level Italian computational lexicon: A Glimpse to data. In Proceedings of the third international conference on language resources and evaluation (LREC’02), Las Palmas de Gran Canaria, Spain.

  • Ruiz-Casado, E. A. M., & Castells, P. (2006). From Wikipedia to semantic relationships: A semi-automated annotation approach. In Proceedings of ESWC2006.

  • Sarmento, L., Pinto, A. S., & Cabral, L. (2006). REPENTINO: A wide-scope gazetteer for entity recognition in Portuguese. In R. Vieira, P. Quaresma, M. da Graças Volpes Nunes, N. Mamede, C. Oliveira, & M. C. Dias, (Eds.), Proceedings of the 7th workshop on computational processing of written and spoken Portuguese, PROPOR 2006 (pp. 31–40). Springer, Itatiaia, Rio de Janeiro, Brazil.

  • Sekine, S., Sudo, K., & Nobata, C. (2002). Extended named entity hierarchy. In Proceedings of third international conference on language resources and evaluation.

  • Sheremetyeva, S., Cowie, J., Nirenburg, S., & Zajac, R. (1998). Multilingual Onomasticon as a multipurpose NLP resource. In Proceedings of the first international conference on language resources and evaluation.

  • Snow, R., Jurafsky, D., & Ng, A. Y. (2006). Semantic taxonomy induction from heterogenous evidence. In Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the association for computational linguistics (pp. 801–808). Association for Computational Linguistics.

  • Suchanek, F. M., Kasneci, G., & Weikum, G. (2007). Yago: A core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web (pp. 697–706). New York, NY: ACM Press, doi:10.1145/1242572.1242667.

  • Sundheim, B. M., Mardis, S., & Burger, J. (2006). Gazetteer Linkage to WordNet. In Proceedings of the third international WordNet conference, pp 103–104.

  • Tjong Kim Sang, E. F. (2002). Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of CoNLL-2002 (pp. 155–158). Taipei, Taiwan.

  • Toral, A., & Way, A. (2011). Automatic acquisition of named entities for rule-based machine translation. In Second international workshop on free/open-source rule-based machine translation.

  • Tran, M., Grass, T., & Maurel, D. (2004). An ontology for multilingual treatment of proper names. In Proceedings of OntoLex 2004.

  • Verdejo, M. F. (1999). The Spanish Wordnet, EuroWordNet Deliverable D032D033 part B3. Tech. rep.

  • Vossen, P. (1998). EuroWordNet a multilingual database with lexical semantic networks. Dordrecht: Kluwer Academic publishers.

    Google Scholar 

  • Widdows, D., & Ferraro, K. (2008). Semantic vectors: A scalable open source package and online technology management application. In (ELRA) ELRA (Ed.), Proceedings of the sixth international language resources and evaluation (LREC’08). Marrakech, Morocco.

  • Wiebe, J., & Riloff, E. (2005). Creating subjective and objective sentence classifiers from unannotated texts. In: Proceedings of CICLing-05, international conference on intelligent text processing and computational linguistics (Vol. 3406, pp. 475–486). Mexico City, MX: Springer-Verlag, Lecture Notes in Computer Science.

  • Wiebe, J. M., Wilson, T., Bruce, R. F., Bell, M., & Martin, M. (2004). Learning subjective language. Computational Linguistics, 30(3), 277–308.

    Article  Google Scholar 

  • Witten, I. H., Frank, E. (2005). Data Mining: Practical machine learning tools and techniques, 2nd edn. San Francisco, United States of America: Morgan Kaufmann.

    Google Scholar 

  • Wu, F., Hoffmann, R., & Weld, D. S. (2008). Augmenting Wikipedia-extraction with results from the web. In AAAI 2008 workshop Wikipedia and artificial intelligence: An evolving synergy, Chicago, United States.

  • Zesch, T., Müller, C., & Gurevych, I. (2008). Extracting lexical semantic knowledge from Wikipedia and wiktionary. In (ELRA) ELRA (Ed.), Proceedings of the sixth international language resources and evaluation (LREC’08). Morocco: Marrakech.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonio Toral.

Appendix: LMF output

Appendix: LMF output

This appendix contains an output sample in LMF format and in the database. It is made up of three monolingual lexicons whose entries are linked by using the “SenseAxis” object of the LMF multilingual extension (See Appendix Tables 15, 16, 17, 18, 19, 20, 21).

Table 15 NE Repository (LexicalEntry table)
Table 16 NE Repository (FormRepresentation table)
Table 17 NE Repository (Sense table)
Table 18 NE Repository (SenseRelation table)
Table 19 NE Repository (SenseAxis table)
Table 20 NE repository (SenseAxisElements table)
Table 21 NE repository (SenseAxisExternalRef table)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Toral, A., Ferrández, S., Monachini, M. et al. Web 2.0, Language Resources and standards to automatically build a multilingual Named Entity Lexicon. Lang Resources & Evaluation 46, 383–419 (2012). https://doi.org/10.1007/s10579-011-9148-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-011-9148-x

Keywords

Navigation