A Golden Resource for Named Entity Recognition in Portuguese

  • Diana Santos
  • Nuno Cardoso
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3960)


This paper presents a collection of texts manually annotated with named entities in context, which was used for HAREM, the first evaluation contest for named entity recognizers for Portuguese. We discuss the options taken and the originality of our approach compared with previous evaluation initiatives in the area. We document the choice of categories, their quantitative weight in the overall collection and how we deal with vagueness and underspecification.


Machine Translation Name Entity Recognition Entity Recognition Semantic Labelling Name Entity Recognition System 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Santos, D., Seco, N., Cardoso, N., Vilela, R.: HAREM: an Advanced NER Evaluation Contest for Portuguese. In: Proceedings of LREC 2006, Genoa, Italy (2006)Google Scholar
  2. 2.
    Hirschman, L.: The evolution of Evaluation: Lessons from the Message Understanding Conferences. Computer Speech and Language 12(4), 281–305 (1998)CrossRefGoogle Scholar
  3. 3.
    Santos, D.: Avaliação conjunta. In: Santos, D. (ed.) Avaliação conjunta: um novo paradigma no processamento computacional da língua portuguesa (In press)Google Scholar
  4. 4.
    Santos, D., Barreiro, A.: On the problems of creating a consensual golden standard of inflected forms in Portuguese. In: Lino, et al. (eds.) Proceedings of LREC 2004, pp. 483–486 (2004)Google Scholar
  5. 5.
    Santos, D., Costa, L., Rocha, P.: Cooperatively evaluating Portuguese morphology. In: Mamede, N.J., Baptista, J., Trancoso, I., Nunes, M.d.G.V., et al. (eds.) PROPOR 2003. LNCS, vol. 2721, pp. 259–266. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  6. 6.
    Grisham, R., Sundheim, B.: Message Understaning Conference - 6: A Brief History. In: Proceedings of the 16th International Conference on Computational Linguistics (COLING 1996), Copenhagen, pp. 466–471 (1996)Google Scholar
  7. 7.
    Mota, C., Santos, D., Ranchhod, E.: Avaliação de reconhecimento de entidades mencionadas: princípio de AREM. In: Santos, D. (ed.) Avaliação conjunta: um novo paradigma no processamento computacional da língua portuguesa (In press)Google Scholar
  8. 8.
    Rocha, P., Santos, D.: CLEF: Abrindo a porta à participação internacional em avaliação de RI do português. In: Santos, D., ed.: Avaliação conjunta: um novo paradigma no processamento computacional da língua portuguesa (In press)Google Scholar
  9. 9.
    Sang, E.F.T.K.: Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition. In: Proceedings of CoNLL-2002, Taipei, pp. 155–158 (2002)Google Scholar
  10. 10.
    Sang, E.F.T.K., Meulder, F.D.: Introduction to the CoNLL-2003 Shared Task: Language- Independent Named Entity Recognition. In: Proc. of CoNLL-2003, Edmonton, pp. 142–147 (2003)Google Scholar
  11. 11.
    Ferro, L., et al.: TIDES 2003 Standard for the Annotation of Temporal Expressions. Technical report, MITRE (2004)Google Scholar
  12. 12.
    Doddington, G., et al.: The Automatic Content Extraction (ACE) Program. Tasks, Data and Evaluation. In: Lino, et al. (eds.) Proc. LREC 2004, Lisbon, pp. 837–840 (2004)Google Scholar
  13. 13.
    Guthrie, L., Basili, R., Hajicova, E., Jelinek, F.: Beyond Entity Recognition – Semantic Labelling for NLP Tasks. In: Workshop proceedings, ELRA, Lisboa (2004)Google Scholar
  14. 14.
    Sekine, S., Sudo, K., Nobata, C.: Extended Named Entity Hierarchy. In: González Rodríguez, M., Araujo, C.P.S. (eds.) Proceedings LREC 2002, Las Palmas, pp. 1818–1824 (2002)Google Scholar
  15. 15.
    Bering, C., et al.: Corpora and evaluation tools for multilingual named entity grammar development. In: Newman, S., Schirra, S.H. (eds.) Proceedings of Multilingual Corpora Workshop at Corpus Linguistics 2003, Lancaster, pp. 43–52 (2003)Google Scholar
  16. 16.
    Merchant, R., Okurowski, M.E., Chinchor, N.: The Multilingual Entity Task (met) overview. In: Proceedings of TIPSTER Text Program (Phase II), Tysons Corner, Virginia (1996)Google Scholar
  17. 17.
    Callmeier et al: COLLATE-Annotationsschema. Technical report, DFKI (2003), http://www.coli.uni-sb.de/~erbach/pub/collate/AnnotationScheme.pdf
  18. 18.
    Arévalo, M., Carreras, X., Márquez, L., Martí, M.A., Padró, L., Simón, M.J.: A Proposal for Wide-Coverage Spanish Named Entity Recognition. Revista da SEPLN 1(3), 1–15 (2002)Google Scholar
  19. 19.
    Kokkinakis, D.: Reducing the effect of name explosion. In: Guthrie, L., Basili, R., Hajicova, E., Jelinek, F. (eds.) Beyond Named Entity Recognition - Semantic Labelling for NLP Tasks. Pre-conference Workshop at LREC 2004, Lisboa, Portugal, pp. 1–6 (2004)Google Scholar
  20. 20.
    Karlgren, J., Cutting, D.: Recognizing Text Genres with Simple Metrics Using Discriminant Analysis. In: Proceedings of COLING 1994, Kyoto, Japan, pp. 1071–1075 (1994)Google Scholar
  21. 21.
    Santos, D.: Towards language-specific applications. Machine Translation 14(2), 83–112 (1999)CrossRefGoogle Scholar
  22. 22.
    Palmer, D.D., Day, D.S.: A Statistical Profile of the Named Entity Task. In: Proceedings of ANLP 1997, Washington D.C, pp. 190–193 (1997)Google Scholar
  23. 23.
    Bick, E.: Multi-level NER for Portuguese in a CG framework. In: Mamede, N.J., Baptista, J., Trancoso, I., Nunes, M.d.G.V. (eds.) PROPOR 2003. LNCS, vol. 2721, pp. 118–125. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  24. 24.
    Mikheev, A., Moens, M., Grover, C.: Named Entity recognition without Gazetteers. In: Proceedings of EACL 1999, Bergen, pp. 1–8 (1999)Google Scholar
  25. 25.
    Santos, D.: The importance of vagueness in translation: Examples from English to Portuguese. Romansk Forum 5, 43–69 (1997)Google Scholar
  26. 26.
    Calzolari, N., Corazzari, O.: Senseval/Romanseva: The Framework for Italian. Computers and the Humanities 34(1-2), 61–78 (2000)CrossRefGoogle Scholar
  27. 27.
    Macklovitch, E.: Where the Tagger Falters. In: Proc. of the 4th International Coference on Theoretical amd Methodological Issues in Machine Translation, Montréal, pp. 113–126 (1992)Google Scholar
  28. 28.
    Voorhees, E.M., Tice, D.M.: Building a Question Answering Test Collection. In: Belkin, N., et al. (eds.) Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, pp. 200–207 (2000)Google Scholar
  29. 29.
    Cardoso, N.: Avaliação de Sistemas de Reconhecimento de Entidades Mencionadas. Master’s thesis, FEUP, Porto, Portugal (2006) (In preparation)Google Scholar
  30. 30.
    Seco, N., Santos, D., Cardoso, N., Vilela, R.: A complex evaluation architecture for HAREM. In: Vieira, R., Quaresma, P., Nunes, M.d.G.V., Mamede, N.J., Oliveira, C., Dias, M.C. (eds.) PROPOR 2006. LNCS (LNAI), vol. 3960, pp. 260–263. Springer, Heidelberg (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Diana Santos
    • 1
  • Nuno Cardoso
    • 2
  1. 1.Linguateca: Node of Oslo at SINTEF ICTPortugal
  2. 2.Linguateca: Node of XLDB at University of LisbonPortugal

Personalised recommendations