Extracting Stochastic Grammars from Treebanks

Bod, Rens

doi:10.1007/978-94-010-0201-1_19

Rens Bod^4,5

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 20))

387 Accesses
1 Citations

Abstract

The Data-Oriented Parsing (DOP) model employs an annotated corpus or treebank directly as a stochastic grammar. New input is parsed by combining subtrees from the treebank. The most probable analysis is estimated on the basis of the occurrence-frequencies of the treebank-subtrees. The model as originally defined imposes no constraints on the size and complexity of the subtrees that may be invoked in parsing new input. Both from a theoretical and from a computational perspective we may therefore wonder whether it is possible to impose constraints on the subtrees that are used, in such a way that the performance of the model does not deteriorate or perhaps even improves. That is the main question addressed in the current paper. Moreover, by imposing different constraints on the subtree set, we can simulate several other stochastic grammars, ranging from stochastic context-free grammars to stochastic lexicalized grammars, thus allowing for a proper performance comparison. Experiments with the ATIS and Wall Street Journal treebanks indicate that very few constraints on the treebank- subtrees are warranted. We conclude with a brief discussion of the consequences of our results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Berg, E. van den, R. Bod, R. Scha (1994). A Corpus-Based Approach to Semantic Interpretation, Proceedings Ninth Amsterdam Colloquium, Amsterdam, The Netherlands.
Google Scholar
Bod, R. (1992). Data Oriented Parsing (DOP), Proceedings COLING’92, Nantes, France.
Google Scholar
Bod, R. (1993a). Using an Annotated Language Corpus as a Virtual Stochastic Grammar, Proceedings AAAI’93, Morgan Kaufmann, Menlo Park, Ca.
Google Scholar
Bod, R. (1993b). Monte Carlo Parsing, Proceedings Third International Workshop on Parsing Technologies, Tilburg/Durbuy, The Netherlands/Belgium.
Google Scholar
Bod, R. (1995). Enriching Linguistics with Statistics: Performance Models of Natural Language, ILLC Dissertation Series 1995-14, University of Amsterdam.
Google Scholar
Bod, R. (1998a). Spoken Dialogue Interpretation with the DOP Model, Proceedings COLING-ACL’98, Montreal, Canada.
Google Scholar
Bod, R. (1998b). Beyond Grammar. Stanford: CSLI Publications.
Google Scholar
Bod, R. (2000). Parsing with the Shortest Derivation, Proceedings COL-ING’2000, Saarbrücken, Germany.
Google Scholar
Bod, R. (2001). What is the Minimal Set of Fragments which Achieves Maximal Parse Accuracy? Proceedings ACL’2001, Toulouse, France.
Google Scholar
Bod, R., R. Bonnema, R. Scha (1996). A Data-Oriented Approach to Semantic Interpretation, Proceedings Workshop on Corpus-Oriented Semantic Analysis, ECAI-96, Budapest, Hungary.
Google Scholar
Bod, R., R. Kaplan (1998). A Probabilistic Corpus-Driven Model for Lexical-Functional Analysis, Proceedings COLING-ACL’98, Montreal, Canada.
Google Scholar
Bonnema, R., R. Bod, R. Scha, (1997). A DOP Model for Semantic Interpretation, Proceedings ACL/EACL-97, Madrid, Spain.
Google Scholar
Carroll, J., D. Weir (1997). Encoding Frequency Information in Lexicalized Grammars, Proceedings 5th International Workshop on Parsing Technologies, MIT, Cambridge.
Google Scholar
Chappelier, J., M. Rajman (1998). Extraction stochastique d’arbres d’analyse pour le modle DOP, Proceedings TALN 1998, Paris, France.
Google Scholar
Charniak, E. (1996). Tree-bank Grammars, Proceedings AAAI’96, Portland, Oregon.
Google Scholar
Charniak, E. (1997). Statistical Techniques for Natural Language Parsing, AI Magazine, Winter 1997.
Google Scholar
Charniak, E. (2000). A Maximum-Entropy-Inspired Parser. Proceedings ANLPNAACL’2000, Seattle, Washington.
Google Scholar
Chiang, D. (2000). Statistical parsing with an automatically extracted tree adjoining grammar, Proceedings ACL’2000, Hong Kong, China.
Google Scholar
Coleman, J., J. Pierrehumbert (1997). Stochastic Phonological Grammars and Acceptability, Proceedings Computational Phonology, Third Meeting of the ACL Special Interest Group in Computational Phonology, Madrid, Spain.
Google Scholar
Collins, M. (1996). A new statistical parser based on bigram lexical dependencies, Proceedings ACL’96, Santa Cruz (Ca.).
Google Scholar
Collins, M. (1997). Three generative lexicalised models for statistical parsing, Proceedings ACL’97, Madrid, Spain.
Google Scholar
Collins, M. (1999). Head-Driven Statistical Models for Natural Language Parsing, PhD-thesis, University of Pennsylvania, PA.
Google Scholar
Collins, M. (2000). Discriminative Reranking for Natural Language Parsing, Proceedings ICML-2000, Stanford, Ca.
Google Scholar
Collins, M., N. Duffy (2002). New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron. Proceedings ACL’2002, Philadelphia, PA.
Google Scholar
Cormons, B. (1999). Analyse et désambiguisation: Une approche purement à base de corpus (Data-Oriented Parsing) pour le formalisme des Grammaires Lexicales Fonctionnelles, PhD thesis, Université de Rennes, France.
Google Scholar
Eisner, J. (1996). Three new probabilistic models for dependency parsing: an exploration, Proceedings COLING-96, Copenhagen, Denmark.
Google Scholar
Eisner, J. (1997). Bilexical Grammars and a Cubic-Time Probabilistic Parser, Proceedings Fifth International Workshop on Parsing Technologies, Boston, Mass.
Google Scholar
Frank, A., J. van Genabith, L. Sadler, A. Way (2003). From Treebank Resources to LFG F-Structures. This volume.
Google Scholar
Goodman, J. (1996). Efficient Algorithms for Parsing the DOP Model, Proceedings Empirical Methods in Natural Language Processing, Philadelphia, PA.
Google Scholar
Goodman, J. (1998). Parsing Inside-Out, Ph.D. thesis, Harvard University, Mass.
Google Scholar
Johnson, M. (1998). PCFG Models of Linguistic Tree Representations, Computational Linguistics 24(4), p. 613–632.
Google Scholar
Kaplan, R. (1996). A Probabilistic Approach to Lexical-Functional Analysis, Proceedings of the 1996 LFG Conference and Workshops. CSLI Publications, Stanford, CA.
Google Scholar
Magerman, D. (1995). Statistical Decision-Tree Models for Parsing, Proceedings ACL’95, Cambridge, Mass.
Google Scholar
Marcus, M., B. Santorini, M. Marcinkiewicz (1993). Building a Large Annotated Corpus of English: the Penn Treebank, Computational Linguistics 19(2).
Google Scholar
Neumann, G. (2002). A Uniform Method for Automatically Extracting Stochastic Lexicalized Tree Grammars from Treebanks and HPSG. This volume.
Google Scholar
Rajman, M. (1995a). Apports d’une approche à base de corpus aux techniques de traitement automatique du langage naturel, PhD thesis, Ecole Nationale Supérieure des Télécommunications, Paris.
Google Scholar
Rajman, M. (1995b). Approche Probabiliste de l’Analyse Syntaxique, Traitement Automatique des Langues, vol. 36(1-2).
Google Scholar
Scha, R. (1990). Taaltheorie en Taaltechnologie; Competence en Performance, in Q.A.M. de Kort and G.L.J. Leerdam (eds.), Computertoepassingen in de Neerlandistiek, Almere: Landelijke Vereniging van Neerlandici (LVVN-jaarboek).
Google Scholar
Scha, R. (1992). Virtuele Grammatica’s en Creatieve Algoritmen, Gramma/TTT 1(1).
Google Scholar
Scholtes, J. (1992). Resolving Linguistic Ambiguities with a Neural Data-Oriented Parsing (DOP) System, in I. Aleksander and J. Taylor (eds.), Artificial Neural Networks 2, Vol. 2, Elsevier Science Publishers.
Google Scholar
Scholtes, J., S. Bloembergen (1992a). The Design of a Neural Data-Oriented Parsing (DOP) System, Proceedings of the International Joint Conference on Neural Networks, (IJCNN), Baltimore, MD.
Google Scholar
Scholtes, J., S. Bloembergen (1992b). Corpus Based Parsing with a Self-Organizing Neural Net, Proceedings of the International Joint Conference on Neural Networks, (IJCNN), Bejing, China.
Google Scholar
Sekine, S., R. Grishman (1995). A Corpus-based Probabilistic Grammar with Only Two Non-terminals, Proceedings Fourth International Workshop on Parsing Technologies, Prague, Czech Republic.
Google Scholar
Sima’an, K., R. Bod, S. Krauwer, R. Scha (1994). Efficient Disambiguation by means of Stochastic Tree Substitution Grammars, Proceedings International Conference on New Methods in Language Processing, UMIST, Manchester, UK.
Google Scholar
Sima’an, K. (1995). An optimized algorithm for Data Oriented Parsing, Proceedings International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria.
Google Scholar
Sima’an, K. (1996a). An optimized algorithm for Data Oriented Parsing, in R. Mitkov and N. Nicolov (eds.), Recent Advances in Natural Language Processing 1995, volume 136 of Current Issues in Linguistic Theory. John Benjamins, Amsterdam.
Google Scholar
Sima’an, K. (1996b). Computational Complexity of Probabilistic Disambiguation by means of Tree Grammars, Proceedings COLING-96, Copenhagen, Denmark.
Google Scholar
Sima’an, K. (1997). Explanation-Based Learning of Data-Oriented Parsing, in T. Ellison (ed.) CoNLL97: Computational Natural Language Learning, ACL’97, Madrid, Spain.
Google Scholar
Srinivas, B., A. Joshi (1995). Some novel applications of explanation-based learning to parsing lexicalized tree-adjoining grammars, Proceedings ACL’95, Cambridge (Mass.).
Google Scholar
Way, A. (1999). A Hybrid Archtecture for Robust MT using LFG-DOP, Journal of Experimental and Theoretical Artificial Intelligence 11(4).
Google Scholar
Weischedel, R., M. Meteer, R, Schwarz, L. Ramshaw, J. Palmucci (1993). Coping with Ambiguity and Unknown Words through Probabilistic Models, Computational Linguistics, 19(2).
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, University of Leeds, Leeds, LS2 9JT, UK
Rens Bod
Institute for Logic, Language and Computation, University of Amsterdam, The Netherlands
Rens Bod

Authors

Rens Bod
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Universite Paris 7, Paris, France
Anne Abeillé

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bod, R. (2003). Extracting Stochastic Grammars from Treebanks. In: Abeillé, A. (eds) Treebanks. Text, Speech and Language Technology, vol 20. Springer, Dordrecht. https://doi.org/10.1007/978-94-010-0201-1_19

Download citation

DOI: https://doi.org/10.1007/978-94-010-0201-1_19
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-1335-5
Online ISBN: 978-94-010-0201-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics