Skip to main content

Advertisement

Log in

Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Named Entity Recognition and Classification (NERC) is an important component of applications like Opinion Tracking, Information Extraction, or Question Answering. When these applications require to work in several languages, NERC becomes a bottleneck because its development requires language-specific tools and resources like lists of names or annotated corpora. This paper presents a lightly supervised system that acquires lists of names and linguistic patterns from large raw text collections in western languages and starting with only a few seeds per class selected by a human expert. Experiments have been carried out with English and Spanish news collections and with the Spanish Wikipedia. Evaluation of NE classification on standard datasets shows that NE lists achieve high precision and reveals that contextual patterns increase recall significantly. Therefore, it would be helpful for applications where annotated NERC data are not available such as those that have to deal with several western languages or information from different domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Agichtein E, Gravano L (2000) Snowball: extracting relations from large plain-text collections. In: Proceedings of the fifth ACM conference on digital libraries (DL ’00), ACM Press, New York, pp 85–94

  2. Biggio S, Giuliano C, Poesio M, Versley Y, Uryupina O, Zanoli, R (2009) Local entity detection and recognition task. In: Proceedings of evaluation of NLP and speech tools for Italian (Evalita 2009), Rome, pp 1–8

  3. Bikel DM, Schwartz RM, Weischedel RM (1999) An algorithm that learns what’s in a name. Mach Learn 34(1–3): 211–231

    Article  MATH  Google Scholar 

  4. Carreras X, Márquez L, Padró L (2002) Named entity extraction using adaboost. In: Proceedings of the 6th conference on natural language learning (CONLL-2002), Toulouse, pp 1–4

  5. Carreras X, Mà àrquez L, Padró L (2003) Named entity recognition for catalan using spanish resources. In: Proceedings of the tenth conference on European chapter of the association for computational linguistics, Association for Computational Linguistics, Sapporo, pp 43–50

  6. Collins M, Singer Y (1999) Unsupervised models for named entity classification. In: Proceedings of empirical methods in natural language processing and very large corpora (EMNLP 99), New Brunswick, pp 189–196

  7. Cucerzan S, Yarowsky D (1999) Language independent named entity recognition combining morphological and contextual evidence. In: Proceedings of the joint SIGDAT conference on EMNLP and VLC 1999 joint SIGDAT conference on EMNLP and VLC, pp 90–99

  8. Li Y, Funk, A (2008) Developing language processing components with GATE version 5 (a user guide). University of Sheffield, Sheffield, last edited February

  9. Dorji T, Atlam E, Yata S, Fuketa M, Morita K, Aoe J-I (2011) Extraction, selection and ranking of field association (fa) terms from domain-specific corpora for building a comprehensive fa terms dictionary. Knowl Inf Syst 27: 141–161

    Article  Google Scholar 

  10. Etzioni O, Cafarella M, Downey D, Popescu AM, Shaked T, Soderland S, Weld DS, Yates A (2005) Unsupervised named-entity extraction from the web: an experimental study. Artif Intell 165(1): 91–134

    Article  Google Scholar 

  11. Florian R, Ittycheriah A, Jing H, Zhang T (2003) Named entity recognition through classifier combination. In: Proceedings of human language technology conference (HLT-NAACL ’03), Edmonton, pp 168–171

  12. Harabagiu S, Strzalkowski T (2006) Advances in open domain question answering. Springerg, New York

    Google Scholar 

  13. Hearst MA (1992) Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th conference on computational linguistics, Association for Computational Linguistics, Morristown, pp 539–545

  14. Ipeirotis PG, Agichtein E, Jain P, Gravano L (2006) To search or to crawl?: towards a query optimizer for text-centric tasks. In: Proceedings of international conference on management of data/principles of database systems (SIGMOD ’06), New York, pp 265–276

  15. Kazama J, Torisawa K (2007) Exploiting wikipedia as external knowledge for named entity recognition. In: Proceedings of joint meeting of the conference on empirical methods on natural language processing (EMNLP) and the conference on natural language learning (CONLL), Prague, pp 698–707

  16. Liu L, Liang Q (2011) A high-performing comprehensive learning algorithm for text classification without pre-labeled training set. Knowl Inf Syst 29(3): 727–738

    Article  Google Scholar 

  17. Nadeau D, Turney P, Matwin S (2006) Unsupervised named-entity recognition: generating gazetteers and resolving ambiguity. In: Advances in artificial intelligence (LNCS), vol 401, pp 266–277

  18. Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Linguist Investig 30(1): 3–26

    Article  Google Scholar 

  19. NIST (2008) Automatic content extraction 2008 evaluation plan (ace 2008). Assessment of detection and recognition of entities and relations within and across documents, technical report, National Institute of Standards and Technology

  20. On BW, Lee I, Lee D (2011) Scalable clustering methods for the name disambiguation problem. Knowle Inf Syst 31: 1–23

    Google Scholar 

  21. Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1–2): 1–135

    Article  Google Scholar 

  22. Richman AE, Schone P (2008) Mining wiki resources for multilingual named entity recognition. In: Proceedings of human language technologies conference, Association for Computational Linguistics, Columbus, pp 1–9

  23. Santos D, Seco N, Cardoso N, Vilela R (2006) Harem: An advanced ner evaluation contest for portuguese. In: Proceedings of the 5th international conference on language resources and evaluation (LREC), Genoa, pp 1986–1991

  24. Sarawagi S (2008) Information extraction. Found Trends Databases 1(3): 261–377

    Article  Google Scholar 

  25. Sekine S, Sudo K, Nobata C (2002) Extended named entity hierarchy. In: Proceedings of the international conference on language resources and evaluation conference (LREC), Las Palmas, pp 1–7

  26. Steinberger R, Bruno P, Ignat C (2004) Exploiting multilingual nomenclatures and language-independent text features as an interlingua for cross-lingual text analysis applications. In: Proceedings of the 4th Slovenian language technology conference. Information Society 2004 (IS’2004), Ljubljana

  27. Steinberger R, Pouliquen B, Ignat C (2005) Navigating multilingual news collections using automatically extracted information. J Comput Inf Technol 13: 257–264

    Article  Google Scholar 

  28. Thelen M, Riloff E (2002) A bootstrapping method for learning semantic lexicons using extraction pattern contexts. In: Proceedings of the conference on Empirical methods in natural language processing, Morristown, pp 214–221

  29. Tjong-Kim-Sang EF (2002) Introduction to the conll-2002 shared task: Language-independent named entity recognition. In: Proceedings of the conference on natural language learning (CoNLL-2002), Taipei, pp 155–158

  30. Tjong-Kim-Sang EF, Meulder FD (2003) Introduction to the conll-2003 shared task: language-independent named entity recognition. In: Proceedings of the conference on natural language learning (CoNLL-2003), Edmonton, pp 142–147

  31. Toral A, Munoz R (2006) A proposal to automatically build and maintain gazetteers for named entity recognition using wikipedia. In: Proceedings of the conference of the European chapter of the Association for computational linguistic (EACL ’06), Trento, pp 56–62

  32. Yangarber R, Lin W, Grishman R (2002) Unsupervised learning of generalized names. In: Proceedings of the 19th international conference on computational linguistics, Morristown, pp 1–7

  33. Zanoli R, Pianta E, Giuliano C (2009) Named entity recognition through redundancy driven classifiers. In: In Proceedings of evaluation of NLP and speech tools for Italian (Evalita 2009), Rome, pp 1–5

  34. Zitouni I, Florian R (2008) Mention detection crossing the language barrier. In: Proceedings of Conference on empirical methods on natural language processing (EMPNLP), Honolulu, pp 600–609

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to César de Pablo-Sánchez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

de Pablo-Sánchez, C., Segura-Bedmar, I., Martínez, P. et al. Lightly supervised acquisition of named entities and linguistic patterns for multilingual text mining. Knowl Inf Syst 35, 87–109 (2013). https://doi.org/10.1007/s10115-012-0502-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-012-0502-0

Keywords

Navigation