An Integrated Statistical Model for Tagging and Chunking Unrestricted Text

Pla, Ferran; Molina, Antonio; Prieto, Natividad

doi:10.1007/3-540-45323-7_3

Ferran Pla³,
Antonio Molina³ &
Natividad Prieto³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1902))

Included in the following conference series:

International Workshop on Text, Speech and Dialogue

377 Accesses
4 Citations

Abstract

In this paper, we present a corpus-based approach for tagging and chunking. The formalism used is based on stochastic finite-state automata. Therefore, it can include n-grams models or any stochastic finite-state automata learnt using grammatical inference techniques. As the models involved in our system are learnt automatically, it allows for a very flexible and portable system for different languages and chunk definitions. In order to show the viability of our approach, we present results for tagging and chunking using different combinations of bigrams and other more complex automata learnt by means of the Error Correcting Grammatical Inference (ECGI) algorithm. The experimentation was carried out on the Wall Street Journal corpus for English and on the Lexesp corpus for Spanish.

This work has been supported by the Spanish Research Project TIC97-0671-C02-01/02.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

S. Abney. Parsing by Chunks. R. Berwick, S. Abney and C. Tenny (eds.) Principle-Based Parsing. Kluwer Academic Publishers, Dordrecht, 1991.
Google Scholar
S. Abney. Partial Parsing via Finite-State Cascades. In Proceedings of the ESSLLI’96 Robust Parsing Workshop, Prague, Czech Republic, 1996.
Google Scholar
S. Argamon, I. Dagan, and Y. Krymolowski. A Memory-Based Approach to Learning Shallow Natural Language Patterns. In Proceedings of the joint 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics, COLING-ACL, pp. 67–73, Montréal, Canada, 1998.
Google Scholar
S. Aït-Mokhtar and J.-P. Chanod. Incremental Finite-State Parsing. In Proceedings of the 5th Conference on Applied Natural Language Processing, Washington D.C., USA, 1997.
Google Scholar
D. Bourigault. Surface Grammatical Analysis for the Extraction of Terminological Noun Phrases. In Proceedings of the 15th International Conference on Computational Linguistics, pp. 977–981, 1992.
Google Scholar
E. Brill. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-Of-Speech Tagging. Computational Linguistics, 21(4):543–565, 1995.
Google Scholar
J. Carmona, S. Cervell, L. Màrquez, M. Martí, L. Padró, R. Placer, H. Rodríýguez, M. Taulé, and J. Turmo. An Environment for Morphosyntactic Processing of Unrestricted Spanish Text. In Proceedings of the 1st International Conference on Language Resources and Evaluation, LREC, pp. 915–922, Granada, Spain, May 1998.
Google Scholar
K. W. Church. A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text. In Proceedings of the 1st Conference on Applied Natural Language Processing, ANLP, pp. 136–143. ACL, 1988.
Google Scholar
P. Clarksond and R. Ronsenfeld. Statistical Language Modelling using the CMU-Cambridge Toolkit. In Proceedings of Euro speech, Rhodes, Greece, 1997.
Google Scholar
W. Daelemans, S. Buchholz, and J. Veenstra. Memory-Based Shallow Parsing. In Proceedings ofEMNLP/VLC-99, pp. 239–246, University of Maryland, USA, June 1999.
Google Scholar
W. Daelemans, J. Zavrel, P. Berck, and S. Gillis. MBT: A Memory-Based Part-Of-Speech Tagger Generator. In Proceedings of the 4th Workshop on Very Large Corpora, pp. 14–27, Copenhagen, Denmark, 1996.
Google Scholar
E. Ejerhed. Finding Clauses in Unrestricted Text by Finitary and Stochastic Methods. In Proceedings of Second Conference on Applied Natural Language Processing, pp. 219–227. ACL, 1988.
Google Scholar
D.M. Magerman. Learning Grammatical Structure Using Statistical Decision-Trees. In Proceedings of the 3rd International Colloquium on Grammatical Inference, ICGI, pp. 1–21, 1996. Springer-Verlag Lecture Notes Series in Artificial Intelligence 1147.
Google Scholar
M. P. Marcus, M.A. Marcinkiewicz, and B. Santorini. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 1993.
Google Scholar
B. Merialdo. Tagging English Text with a Probabilistic Model. Computational Linguistics, 20(2):155–171, 1994.
Google Scholar
F. Pla and A. Molina. Etiquetado Morfosintáctico del Corpus BDGEO. In Proceedings of the CAEPIA, Murcia, España, November 1999.
Google Scholar
F. Pla and N. Prieto. Using Grammatical Inference Methods for Automatic Part-Of-Speech Tagging. In Proceedings of 1st International Conference on Language Resources and Evaluation, LREC, Granada, Spain, 1998.
Google Scholar
N. Prieto and E. Vidal. Learning Language Models through the ECGI Method. Speech Communication, 1:299–309, 1992.
Article Google Scholar
L. Ramshaw and M. Marcus. Text Chunking Using Transformation-Based Learning. In Proceedings of third Workshop on Very Large Corpora, pp. 82–94, June 1995.
Google Scholar
A. Ratnaparkhi. A Maximum Entropy Part-Of-Speech Tagger. In Proceedings of the 1st Conference on Empirical Methods in Natural Language Processing, EMNLP, 1996.
Google Scholar
A. Voutilainen. NPTool, a Detector of English Noun Phrases. In Proceedings of the Workshop on Very Large Corpora. ACL, June 1993.
Google Scholar
A. Voutilainen. A Syntax-Based Part-Of-Speech Analyzer. In Proceedings of the 7th Conference of the European Chapter of the Associationfor Computational Linguistics, EACL, Dublin, Ireland, 1995.
Google Scholar
A. Voutilainen and L. Padró. Developing a Hybrid NP Parser. In Proceedings of the 5th Conference on Applied Natural Language Processing, ANLP, pp. 80–87, Washington DC, 1997. ACL.
Google Scholar

Download references

Author information

Authors and Affiliations

Departament de Sistemes Informàtics i Computació, Universitat Politècnica de València, Camí de Vera s/n, 46020, Valńcia
Ferran Pla, Antonio Molina & Natividad Prieto

Authors

Ferran Pla
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Molina
View author publications
You can also search for this author in PubMed Google Scholar
Natividad Prieto
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics Department of Programming Systems and Communication, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Department of Information Technologies, Masaryk University, Botanická 68a, 602 00, Brno, Czech Republic
Ivan Kopeček & Karel Pala &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pla, F., Molina, A., Prieto, N. (2000). An Integrated Statistical Model for Tagging and Chunking Unrestricted Text. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2000. Lecture Notes in Computer Science(), vol 1902. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45323-7_3

Download citation

DOI: https://doi.org/10.1007/3-540-45323-7_3
Published: 15 August 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-41042-3
Online ISBN: 978-3-540-45323-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics