Abstract
Nowadays, collections of texts with annotations on several levels are useful resources. Huge efforts are required to develop this resource for languages like Spanish. In this work, we present the initial step, lexical level annotation, for the compilation of an annotated Mexican corpus using Web-based sources. We also describe a method based on heterogeneous knowledge and simple Web-based sources for the proper name identification required in such annotation. We focused our work on composite entities (names with coordinated constituents, names with several prepositional phrases, and names of songs, books, movies, etc.). The preliminary obtained results are presented.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Berthouzoz, C., Merlo, P.: Statistical ambiguity resolution for principle-based parsing. In: Proceedings of the Recent Advances in Natural Language Processing, pp.179–186 (1997)
Biber, D.: Using Register. Diversified Corpora for general Language Studies. Computational Linguistics 19(2), 219–241 (1993)
Bolshakov, I.A., Gelbukh, A.F., Galicia-Haro, S.N.: Stable Coordinated Pairs in Text Processing. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 27–34. Springer, Heidelberg (2003)
Carmona, J., Cervell, S., MĂ rquez, L., MartĂ, M.A., PadrĂł, L., Placer, R., RodrĂguez, H., TaulĂ©, M., Turmo, J.: An Environment for Morphosyntactic Processing of Unrestricted Spanish Text. In: First International Conference on Language Resources and Evaluation. Granada, Spain (1998)
Chinchor, N.: MUC-7 Named Entity Task Definition (1997), http://www.itl.nist.gov/iaui/894.02/re-latedprojects/muc/proceedings/muc7toc.html#appendices
Francis, W.N., Kučera, H.: Frequency Análisis of English Usage: Lexicon and Grammar. Houghton Mifflin (1982)
Gelbukh, A., Sidorov, G., Chanona-Hernández, L.: Compilation of a spanish representative corpus. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 285–288. Springer, Heidelberg (2002)
Kilgariff, A.: Web as corpus. In: Proc. of Corpus Linguistics Conference, Lancaster University, pp. 342–344 (2001)
Marcus, M., Santorini, B., Marcinkiewicz, M.: Building a large annotated corpus of English The Penn Treebank. Computational Linguistics 19(2) (1993)
Mikheev, A.: Periods, Capitalized Words, etc., http://www.ltg.ed.ac.uk/~mikheev/pa-pers.html
MUC: Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan Kaufmann, San Francisco (1995), http://www.itl.nist.gov/iaui/894.02/related_projects/tipster/muc.htm
Ratnaparkhi, A.: Statistical Models for Unsupervised Prepositional Phrase Attachment. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics. Montreal, Quebec, Canada (1998), http://xxx.lanl.gov/ps/cmp-lg/9807011
Roland, D., Jurafsky, D.: How Verb Subcategorization Frequencies are Effected by Corpus Choice. Proc. International Conference COLING-ACL1998. Quebec, Canada, 1122–1128 (1998)
Tjong Kim Sang, E.F.: Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition, http://lcg-www.uia.ac.be/~erikt/papers
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Galicia-Haro, S.N., Gelbukh, A., Bolshakov, I.A. (2004). Web-Based Sources for an Annotated Corpus Building and Composite Proper Name Identification. In: Favela, J., Menasalvas, E., Chávez, E. (eds) Advances in Web Intelligence. AWIC 2004. Lecture Notes in Computer Science(), vol 3034. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24681-7_14
Download citation
DOI: https://doi.org/10.1007/978-3-540-24681-7_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22009-1
Online ISBN: 978-3-540-24681-7
eBook Packages: Springer Book Archive