Web-Based Sources for an Annotated Corpus Building and Composite Proper Name Identification

  • Sofía N. Galicia-Haro
  • Alexander Gelbukh
  • Igor A. Bolshakov
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3034)


Nowadays, collections of texts with annotations on several levels are useful resources. Huge efforts are required to develop this resource for languages like Spanish. In this work, we present the initial step, lexical level annotation, for the compilation of an annotated Mexican corpus using Web-based sources. We also describe a method based on heterogeneous knowledge and simple Web-based sources for the proper name identification required in such annotation. We focused our work on composite entities (names with coordinated constituents, names with several prepositional phrases, and names of songs, books, movies, etc.). The preliminary obtained results are presented.


Natural Language Processing Name Entity Recognition Computational Linguistics Prepositional Phrase Annotate Corpus 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Berthouzoz, C., Merlo, P.: Statistical ambiguity resolution for principle-based parsing. In: Proceedings of the Recent Advances in Natural Language Processing, pp.179–186 (1997)Google Scholar
  2. 2.
    Biber, D.: Using Register. Diversified Corpora for general Language Studies. Computational Linguistics 19(2), 219–241 (1993)Google Scholar
  3. 3.
    Bolshakov, I.A., Gelbukh, A.F., Galicia-Haro, S.N.: Stable Coordinated Pairs in Text Processing. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 27–34. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  4. 4.
    Carmona, J., Cervell, S., Màrquez, L., Martí, M.A., Padró, L., Placer, R., Rodríguez, H., Taulé, M., Turmo, J.: An Environment for Morphosyntactic Processing of Unrestricted Spanish Text. In: First International Conference on Language Resources and Evaluation. Granada, Spain (1998)Google Scholar
  5. 5.
  6. 6.
    Francis, W.N., Kučera, H.: Frequency Análisis of English Usage: Lexicon and Grammar. Houghton Mifflin (1982)Google Scholar
  7. 7.
    Gelbukh, A., Sidorov, G., Chanona-Hernández, L.: Compilation of a spanish representative corpus. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 285–288. Springer, Heidelberg (2002)Google Scholar
  8. 8.
    Kilgariff, A.: Web as corpus. In: Proc. of Corpus Linguistics Conference, Lancaster University, pp. 342–344 (2001)Google Scholar
  9. 9.
    Marcus, M., Santorini, B., Marcinkiewicz, M.: Building a large annotated corpus of English The Penn Treebank. Computational Linguistics 19(2) (1993)Google Scholar
  10. 10.
    Mikheev, A.: Periods, Capitalized Words, etc.,
  11. 11.
    MUC: Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan Kaufmann, San Francisco (1995),
  12. 12.
    Ratnaparkhi, A.: Statistical Models for Unsupervised Prepositional Phrase Attachment. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics. Montreal, Quebec, Canada (1998),
  13. 13.
    Roland, D., Jurafsky, D.: How Verb Subcategorization Frequencies are Effected by Corpus Choice. Proc. International Conference COLING-ACL1998. Quebec, Canada, 1122–1128 (1998)Google Scholar
  14. 14.
    Tjong Kim Sang, E.F.: Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition,

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Sofía N. Galicia-Haro
    • 1
  • Alexander Gelbukh
    • 2
    • 3
  • Igor A. Bolshakov
    • 2
  1. 1.Faculty of SciencesUNAM Ciudad Universitaria Mexico CityMexico
  2. 2.Center for Computing ResearchNational Polytechnic InstituteMexico CityMexico
  3. 3.Department of Computer Science and EngineeringChung-Ang UniversitySeoulKorea

Personalised recommendations