Abstract
The Data-Oriented Parsing (DOP) model employs an annotated corpus or treebank directly as a stochastic grammar. New input is parsed by combining subtrees from the treebank. The most probable analysis is estimated on the basis of the occurrence-frequencies of the treebank-subtrees. The model as originally defined imposes no constraints on the size and complexity of the subtrees that may be invoked in parsing new input. Both from a theoretical and from a computational perspective we may therefore wonder whether it is possible to impose constraints on the subtrees that are used, in such a way that the performance of the model does not deteriorate or perhaps even improves. That is the main question addressed in the current paper. Moreover, by imposing different constraints on the subtree set, we can simulate several other stochastic grammars, ranging from stochastic context-free grammars to stochastic lexicalized grammars, thus allowing for a proper performance comparison. Experiments with the ATIS and Wall Street Journal treebanks indicate that very few constraints on the treebank- subtrees are warranted. We conclude with a brief discussion of the consequences of our results.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Berg, E. van den, R. Bod, R. Scha (1994). A Corpus-Based Approach to Semantic Interpretation, Proceedings Ninth Amsterdam Colloquium, Amsterdam, The Netherlands.
Bod, R. (1992). Data Oriented Parsing (DOP), Proceedings COLING’92, Nantes, France.
Bod, R. (1993a). Using an Annotated Language Corpus as a Virtual Stochastic Grammar, Proceedings AAAI’93, Morgan Kaufmann, Menlo Park, Ca.
Bod, R. (1993b). Monte Carlo Parsing, Proceedings Third International Workshop on Parsing Technologies, Tilburg/Durbuy, The Netherlands/Belgium.
Bod, R. (1995). Enriching Linguistics with Statistics: Performance Models of Natural Language, ILLC Dissertation Series 1995-14, University of Amsterdam.
Bod, R. (1998a). Spoken Dialogue Interpretation with the DOP Model, Proceedings COLING-ACL’98, Montreal, Canada.
Bod, R. (1998b). Beyond Grammar. Stanford: CSLI Publications.
Bod, R. (2000). Parsing with the Shortest Derivation, Proceedings COL-ING’2000, Saarbrücken, Germany.
Bod, R. (2001). What is the Minimal Set of Fragments which Achieves Maximal Parse Accuracy? Proceedings ACL’2001, Toulouse, France.
Bod, R., R. Bonnema, R. Scha (1996). A Data-Oriented Approach to Semantic Interpretation, Proceedings Workshop on Corpus-Oriented Semantic Analysis, ECAI-96, Budapest, Hungary.
Bod, R., R. Kaplan (1998). A Probabilistic Corpus-Driven Model for Lexical-Functional Analysis, Proceedings COLING-ACL’98, Montreal, Canada.
Bonnema, R., R. Bod, R. Scha, (1997). A DOP Model for Semantic Interpretation, Proceedings ACL/EACL-97, Madrid, Spain.
Carroll, J., D. Weir (1997). Encoding Frequency Information in Lexicalized Grammars, Proceedings 5th International Workshop on Parsing Technologies, MIT, Cambridge.
Chappelier, J., M. Rajman (1998). Extraction stochastique d’arbres d’analyse pour le modle DOP, Proceedings TALN 1998, Paris, France.
Charniak, E. (1996). Tree-bank Grammars, Proceedings AAAI’96, Portland, Oregon.
Charniak, E. (1997). Statistical Techniques for Natural Language Parsing, AI Magazine, Winter 1997.
Charniak, E. (2000). A Maximum-Entropy-Inspired Parser. Proceedings ANLPNAACL’2000, Seattle, Washington.
Chiang, D. (2000). Statistical parsing with an automatically extracted tree adjoining grammar, Proceedings ACL’2000, Hong Kong, China.
Coleman, J., J. Pierrehumbert (1997). Stochastic Phonological Grammars and Acceptability, Proceedings Computational Phonology, Third Meeting of the ACL Special Interest Group in Computational Phonology, Madrid, Spain.
Collins, M. (1996). A new statistical parser based on bigram lexical dependencies, Proceedings ACL’96, Santa Cruz (Ca.).
Collins, M. (1997). Three generative lexicalised models for statistical parsing, Proceedings ACL’97, Madrid, Spain.
Collins, M. (1999). Head-Driven Statistical Models for Natural Language Parsing, PhD-thesis, University of Pennsylvania, PA.
Collins, M. (2000). Discriminative Reranking for Natural Language Parsing, Proceedings ICML-2000, Stanford, Ca.
Collins, M., N. Duffy (2002). New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron. Proceedings ACL’2002, Philadelphia, PA.
Cormons, B. (1999). Analyse et désambiguisation: Une approche purement à base de corpus (Data-Oriented Parsing) pour le formalisme des Grammaires Lexicales Fonctionnelles, PhD thesis, Université de Rennes, France.
Eisner, J. (1996). Three new probabilistic models for dependency parsing: an exploration, Proceedings COLING-96, Copenhagen, Denmark.
Eisner, J. (1997). Bilexical Grammars and a Cubic-Time Probabilistic Parser, Proceedings Fifth International Workshop on Parsing Technologies, Boston, Mass.
Frank, A., J. van Genabith, L. Sadler, A. Way (2003). From Treebank Resources to LFG F-Structures. This volume.
Goodman, J. (1996). Efficient Algorithms for Parsing the DOP Model, Proceedings Empirical Methods in Natural Language Processing, Philadelphia, PA.
Goodman, J. (1998). Parsing Inside-Out, Ph.D. thesis, Harvard University, Mass.
Johnson, M. (1998). PCFG Models of Linguistic Tree Representations, Computational Linguistics 24(4), p. 613–632.
Kaplan, R. (1996). A Probabilistic Approach to Lexical-Functional Analysis, Proceedings of the 1996 LFG Conference and Workshops. CSLI Publications, Stanford, CA.
Magerman, D. (1995). Statistical Decision-Tree Models for Parsing, Proceedings ACL’95, Cambridge, Mass.
Marcus, M., B. Santorini, M. Marcinkiewicz (1993). Building a Large Annotated Corpus of English: the Penn Treebank, Computational Linguistics 19(2).
Neumann, G. (2002). A Uniform Method for Automatically Extracting Stochastic Lexicalized Tree Grammars from Treebanks and HPSG. This volume.
Rajman, M. (1995a). Apports d’une approche à base de corpus aux techniques de traitement automatique du langage naturel, PhD thesis, Ecole Nationale Supérieure des Télécommunications, Paris.
Rajman, M. (1995b). Approche Probabiliste de l’Analyse Syntaxique, Traitement Automatique des Langues, vol. 36(1-2).
Scha, R. (1990). Taaltheorie en Taaltechnologie; Competence en Performance, in Q.A.M. de Kort and G.L.J. Leerdam (eds.), Computertoepassingen in de Neerlandistiek, Almere: Landelijke Vereniging van Neerlandici (LVVN-jaarboek).
Scha, R. (1992). Virtuele Grammatica’s en Creatieve Algoritmen, Gramma/TTT 1(1).
Scholtes, J. (1992). Resolving Linguistic Ambiguities with a Neural Data-Oriented Parsing (DOP) System, in I. Aleksander and J. Taylor (eds.), Artificial Neural Networks 2, Vol. 2, Elsevier Science Publishers.
Scholtes, J., S. Bloembergen (1992a). The Design of a Neural Data-Oriented Parsing (DOP) System, Proceedings of the International Joint Conference on Neural Networks, (IJCNN), Baltimore, MD.
Scholtes, J., S. Bloembergen (1992b). Corpus Based Parsing with a Self-Organizing Neural Net, Proceedings of the International Joint Conference on Neural Networks, (IJCNN), Bejing, China.
Sekine, S., R. Grishman (1995). A Corpus-based Probabilistic Grammar with Only Two Non-terminals, Proceedings Fourth International Workshop on Parsing Technologies, Prague, Czech Republic.
Sima’an, K., R. Bod, S. Krauwer, R. Scha (1994). Efficient Disambiguation by means of Stochastic Tree Substitution Grammars, Proceedings International Conference on New Methods in Language Processing, UMIST, Manchester, UK.
Sima’an, K. (1995). An optimized algorithm for Data Oriented Parsing, Proceedings International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria.
Sima’an, K. (1996a). An optimized algorithm for Data Oriented Parsing, in R. Mitkov and N. Nicolov (eds.), Recent Advances in Natural Language Processing 1995, volume 136 of Current Issues in Linguistic Theory. John Benjamins, Amsterdam.
Sima’an, K. (1996b). Computational Complexity of Probabilistic Disambiguation by means of Tree Grammars, Proceedings COLING-96, Copenhagen, Denmark.
Sima’an, K. (1997). Explanation-Based Learning of Data-Oriented Parsing, in T. Ellison (ed.) CoNLL97: Computational Natural Language Learning, ACL’97, Madrid, Spain.
Srinivas, B., A. Joshi (1995). Some novel applications of explanation-based learning to parsing lexicalized tree-adjoining grammars, Proceedings ACL’95, Cambridge (Mass.).
Way, A. (1999). A Hybrid Archtecture for Robust MT using LFG-DOP, Journal of Experimental and Theoretical Artificial Intelligence 11(4).
Weischedel, R., M. Meteer, R, Schwarz, L. Ramshaw, J. Palmucci (1993). Coping with Ambiguity and Unknown Words through Probabilistic Models, Computational Linguistics, 19(2).
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Bod, R. (2003). Extracting Stochastic Grammars from Treebanks. In: Abeillé, A. (eds) Treebanks. Text, Speech and Language Technology, vol 20. Springer, Dordrecht. https://doi.org/10.1007/978-94-010-0201-1_19
Download citation
DOI: https://doi.org/10.1007/978-94-010-0201-1_19
Publisher Name: Springer, Dordrecht
Print ISBN: 978-1-4020-1335-5
Online ISBN: 978-94-010-0201-1
eBook Packages: Springer Book Archive