Evaluating Corpora for Named Entity Recognition Using Character-Level Features

  • Casey Whitelaw
  • Jon Patrick
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2903)

Abstract

We present a new collection of training corpora for evaluation of language-independent named entity recognition systems. For the five languages included in this initial release, Basque, Dutch, English, Korean, and Spanish, we provide an analysis of the relative difficulty of the NER task for both the language in general, and as a supervised task using these corpora. We construct three strongly language-independent systems, each using only orthographic features, and compare their performance on both seen and unseen data. We achieve improved results through combining these classifiers, showing that ensemble approaches are suitable when dealing with language-independent problems.

Keywords

computational linguistics machine learning 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Carreras, X., Màrques, L., Padró, L.: N amed entity extraction using adaboost. In: Proceedings of CoNLL 2002, Taipei, Taiwan, pp. 167–170 (2002)Google Scholar
  2. 2.
    Chieu, H.L., Ng, H.T.: N amed entity recognition: A maximum entropy approach using global information. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan, pp. 190–196 (2002)Google Scholar
  3. 3.
    Sekine, S., Nobata, C., Tsuji, J.: Difficulty indices for the named entity task in japanese. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2000), Hong Kong, China (2000)Google Scholar
  4. 4.
    Cucerzan, S., Yarowsky, D.: Language independent named entity recognition combining morphological and contextual evidence (1999)Google Scholar
  5. 5.
    Klein, D., Smarr, J., Nguyen, H., Manning, C.D.: Named entity recognition with character-level models. In: Proceedings of CoNLL-2003, Edmonton, Canada, pp. 180–183 (2003)Google Scholar
  6. 6.
    Mayfield, J., McNamee, P., Piatko, C.: N amed entity recognition using hundreds of thousands of features. In: Proceedings of CoNLL 2003, Edmonton, Canada, pp. 184–187 (2003)Google Scholar
  7. 7.
    Mikheev, A., Moens, M., Grover, C.: Named entity recognition without gazetteers (1999)Google Scholar
  8. 8.
    Patrick, J., Whitelaw, C., Munro, R.: Slinerc: The sydney languageindependent named entity recogniser and classifier. In: Proceedings of CoNLL 2002, Taipei, Taiwan, pp. 199–202 (2002)Google Scholar
  9. 9.
    Tjong Kim Sang, E.F.: Introduction to the conll-2002 shared task: Languageindependent named entity recognition. In: Proceedings of CoNLL-2002, Taipei, Taiwan, pp. 155–158 (2002)Google Scholar
  10. 10.
    Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the conll-2003 shared task: Language-independent named entity recognition. In: Proceedings of CoNLL-2003, Edmonton, Canada, pp. 142–147 (2003)Google Scholar
  11. 11.
    Tjong Kim Sang, E.F., Veenstra, J.: Representing text chunks. In: Proceedings of EACL 1999, Bergen, Norway, pp. 173–179 (1999)Google Scholar
  12. 12.
    Whitelaw, C., Patrick, J.: Ortho graphic tries in language independent named entity recognition. In: Proceedings of ANLP 2002. Centre for Language Technology, Macquarie University, pp. 1–8 (2002)Google Scholar
  13. 13.
    Whitelaw, C., Patrick, J.: Named entity recognition using a characterbased probabilistic approach. In: Proceedings of CoNLL 2003, Edmonton, Canada, pp. 196–199 (2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Casey Whitelaw
    • 1
  • Jon Patrick
    • 1
  1. 1.Sydney Language Technology Research Group Capital Markets Co-operative Research CentreUniversity of Sydney 

Personalised recommendations