Web-Based Sources for an Annotated Corpus Building and Composite Proper Name Identification

Galicia-Haro, Sofía N.; Gelbukh, Alexander; Bolshakov, Igor A.

doi:10.1007/978-3-540-24681-7_14

Sofía N. Galicia-Haro¹⁹,
Alexander Gelbukh^20,21 &
Igor A. Bolshakov²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3034))

Included in the following conference series:

International Atlantic Web Intelligence Conference

301 Accesses

Abstract

Nowadays, collections of texts with annotations on several levels are useful resources. Huge efforts are required to develop this resource for languages like Spanish. In this work, we present the initial step, lexical level annotation, for the compilation of an annotated Mexican corpus using Web-based sources. We also describe a method based on heterogeneous knowledge and simple Web-based sources for the proper name identification required in such annotation. We focused our work on composite entities (names with coordinated constituents, names with several prepositional phrases, and names of songs, books, movies, etc.). The preliminary obtained results are presented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Berthouzoz, C., Merlo, P.: Statistical ambiguity resolution for principle-based parsing. In: Proceedings of the Recent Advances in Natural Language Processing, pp.179–186 (1997)
Google Scholar
Biber, D.: Using Register. Diversified Corpora for general Language Studies. Computational Linguistics 19(2), 219–241 (1993)
Google Scholar
Bolshakov, I.A., Gelbukh, A.F., Galicia-Haro, S.N.: Stable Coordinated Pairs in Text Processing. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 27–34. Springer, Heidelberg (2003)
Chapter Google Scholar
Carmona, J., Cervell, S., Màrquez, L., Martí, M.A., Padró, L., Placer, R., Rodríguez, H., Taulé, M., Turmo, J.: An Environment for Morphosyntactic Processing of Unrestricted Spanish Text. In: First International Conference on Language Resources and Evaluation. Granada, Spain (1998)
Google Scholar
Chinchor, N.: MUC-7 Named Entity Task Definition (1997), http://www.itl.nist.gov/iaui/894.02/re-latedprojects/muc/proceedings/muc7toc.html#appendices
Francis, W.N., Kučera, H.: Frequency Análisis of English Usage: Lexicon and Grammar. Houghton Mifflin (1982)
Google Scholar
Gelbukh, A., Sidorov, G., Chanona-Hernández, L.: Compilation of a spanish representative corpus. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 285–288. Springer, Heidelberg (2002)
Google Scholar
Kilgariff, A.: Web as corpus. In: Proc. of Corpus Linguistics Conference, Lancaster University, pp. 342–344 (2001)
Google Scholar
Marcus, M., Santorini, B., Marcinkiewicz, M.: Building a large annotated corpus of English The Penn Treebank. Computational Linguistics 19(2) (1993)
Google Scholar
Mikheev, A.: Periods, Capitalized Words, etc., http://www.ltg.ed.ac.uk/~mikheev/pa-pers.html
MUC: Proceedings of the Sixth Message Understanding Conference (MUC-6). Morgan Kaufmann, San Francisco (1995), http://www.itl.nist.gov/iaui/894.02/related_projects/tipster/muc.htm
Ratnaparkhi, A.: Statistical Models for Unsupervised Prepositional Phrase Attachment. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics. Montreal, Quebec, Canada (1998), http://xxx.lanl.gov/ps/cmp-lg/9807011
Roland, D., Jurafsky, D.: How Verb Subcategorization Frequencies are Effected by Corpus Choice. Proc. International Conference COLING-ACL1998. Quebec, Canada, 1122–1128 (1998)
Google Scholar
Tjong Kim Sang, E.F.: Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition, http://lcg-www.uia.ac.be/~erikt/papers

Download references

Author information

Authors and Affiliations

Faculty of Sciences, UNAM Ciudad Universitaria Mexico City, Mexico
Sofía N. Galicia-Haro
Center for Computing Research, National Polytechnic Institute, Mexico City, Mexico
Alexander Gelbukh & Igor A. Bolshakov
Department of Computer Science and Engineering, Chung-Ang University, 221 Huksuk-Dong, DongJak-Ku, Seoul, 156-756, Korea
Alexander Gelbukh

Authors

Sofía N. Galicia-Haro
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Gelbukh
View author publications
You can also search for this author in PubMed Google Scholar
Igor A. Bolshakov
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, CICESE Research Center, Ensenada, México
Jesús Favela
Facultad de Informática, Universidad Politécnica de Madrid., Campus de Montegancedo s/n, 28660, Boadilla del Monte (Madrid), Spain
Ernestina Menasalvas
Escuela de Ciencias Físico-Matemáticas, Universidad Michoacana de San Nicolás de Hidalgo,, Av.Francisco J. Mujica, Morelia - Michoacán, México
Edgar Chávez

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Galicia-Haro, S.N., Gelbukh, A., Bolshakov, I.A. (2004). Web-Based Sources for an Annotated Corpus Building and Composite Proper Name Identification. In: Favela, J., Menasalvas, E., Chávez, E. (eds) Advances in Web Intelligence. AWIC 2004. Lecture Notes in Computer Science(), vol 3034. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24681-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-540-24681-7_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22009-1
Online ISBN: 978-3-540-24681-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics