Skip to main content

Extracting Stochastic Grammars from Treebanks

  • Chapter
Book cover Treebanks

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 20))

Abstract

The Data-Oriented Parsing (DOP) model employs an annotated corpus or treebank directly as a stochastic grammar. New input is parsed by combining subtrees from the treebank. The most probable analysis is estimated on the basis of the occurrence-frequencies of the treebank-subtrees. The model as originally defined imposes no constraints on the size and complexity of the subtrees that may be invoked in parsing new input. Both from a theoretical and from a computational perspective we may therefore wonder whether it is possible to impose constraints on the subtrees that are used, in such a way that the performance of the model does not deteriorate or perhaps even improves. That is the main question addressed in the current paper. Moreover, by imposing different constraints on the subtree set, we can simulate several other stochastic grammars, ranging from stochastic context-free grammars to stochastic lexicalized grammars, thus allowing for a proper performance comparison. Experiments with the ATIS and Wall Street Journal treebanks indicate that very few constraints on the treebank- subtrees are warranted. We conclude with a brief discussion of the consequences of our results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Berg, E. van den, R. Bod, R. Scha (1994). A Corpus-Based Approach to Semantic Interpretation, Proceedings Ninth Amsterdam Colloquium, Amsterdam, The Netherlands.

    Google Scholar 

  • Bod, R. (1992). Data Oriented Parsing (DOP), Proceedings COLING’92, Nantes, France.

    Google Scholar 

  • Bod, R. (1993a). Using an Annotated Language Corpus as a Virtual Stochastic Grammar, Proceedings AAAI’93, Morgan Kaufmann, Menlo Park, Ca.

    Google Scholar 

  • Bod, R. (1993b). Monte Carlo Parsing, Proceedings Third International Workshop on Parsing Technologies, Tilburg/Durbuy, The Netherlands/Belgium.

    Google Scholar 

  • Bod, R. (1995). Enriching Linguistics with Statistics: Performance Models of Natural Language, ILLC Dissertation Series 1995-14, University of Amsterdam.

    Google Scholar 

  • Bod, R. (1998a). Spoken Dialogue Interpretation with the DOP Model, Proceedings COLING-ACL’98, Montreal, Canada.

    Google Scholar 

  • Bod, R. (1998b). Beyond Grammar. Stanford: CSLI Publications.

    Google Scholar 

  • Bod, R. (2000). Parsing with the Shortest Derivation, Proceedings COL-ING’2000, Saarbrücken, Germany.

    Google Scholar 

  • Bod, R. (2001). What is the Minimal Set of Fragments which Achieves Maximal Parse Accuracy? Proceedings ACL’2001, Toulouse, France.

    Google Scholar 

  • Bod, R., R. Bonnema, R. Scha (1996). A Data-Oriented Approach to Semantic Interpretation, Proceedings Workshop on Corpus-Oriented Semantic Analysis, ECAI-96, Budapest, Hungary.

    Google Scholar 

  • Bod, R., R. Kaplan (1998). A Probabilistic Corpus-Driven Model for Lexical-Functional Analysis, Proceedings COLING-ACL’98, Montreal, Canada.

    Google Scholar 

  • Bonnema, R., R. Bod, R. Scha, (1997). A DOP Model for Semantic Interpretation, Proceedings ACL/EACL-97, Madrid, Spain.

    Google Scholar 

  • Carroll, J., D. Weir (1997). Encoding Frequency Information in Lexicalized Grammars, Proceedings 5th International Workshop on Parsing Technologies, MIT, Cambridge.

    Google Scholar 

  • Chappelier, J., M. Rajman (1998). Extraction stochastique d’arbres d’analyse pour le modle DOP, Proceedings TALN 1998, Paris, France.

    Google Scholar 

  • Charniak, E. (1996). Tree-bank Grammars, Proceedings AAAI’96, Portland, Oregon.

    Google Scholar 

  • Charniak, E. (1997). Statistical Techniques for Natural Language Parsing, AI Magazine, Winter 1997.

    Google Scholar 

  • Charniak, E. (2000). A Maximum-Entropy-Inspired Parser. Proceedings ANLPNAACL’2000, Seattle, Washington.

    Google Scholar 

  • Chiang, D. (2000). Statistical parsing with an automatically extracted tree adjoining grammar, Proceedings ACL’2000, Hong Kong, China.

    Google Scholar 

  • Coleman, J., J. Pierrehumbert (1997). Stochastic Phonological Grammars and Acceptability, Proceedings Computational Phonology, Third Meeting of the ACL Special Interest Group in Computational Phonology, Madrid, Spain.

    Google Scholar 

  • Collins, M. (1996). A new statistical parser based on bigram lexical dependencies, Proceedings ACL’96, Santa Cruz (Ca.).

    Google Scholar 

  • Collins, M. (1997). Three generative lexicalised models for statistical parsing, Proceedings ACL’97, Madrid, Spain.

    Google Scholar 

  • Collins, M. (1999). Head-Driven Statistical Models for Natural Language Parsing, PhD-thesis, University of Pennsylvania, PA.

    Google Scholar 

  • Collins, M. (2000). Discriminative Reranking for Natural Language Parsing, Proceedings ICML-2000, Stanford, Ca.

    Google Scholar 

  • Collins, M., N. Duffy (2002). New Ranking Algorithms for Parsing and Tagging: Kernels over Discrete Structures, and the Voted Perceptron. Proceedings ACL’2002, Philadelphia, PA.

    Google Scholar 

  • Cormons, B. (1999). Analyse et désambiguisation: Une approche purement à base de corpus (Data-Oriented Parsing) pour le formalisme des Grammaires Lexicales Fonctionnelles, PhD thesis, Université de Rennes, France.

    Google Scholar 

  • Eisner, J. (1996). Three new probabilistic models for dependency parsing: an exploration, Proceedings COLING-96, Copenhagen, Denmark.

    Google Scholar 

  • Eisner, J. (1997). Bilexical Grammars and a Cubic-Time Probabilistic Parser, Proceedings Fifth International Workshop on Parsing Technologies, Boston, Mass.

    Google Scholar 

  • Frank, A., J. van Genabith, L. Sadler, A. Way (2003). From Treebank Resources to LFG F-Structures. This volume.

    Google Scholar 

  • Goodman, J. (1996). Efficient Algorithms for Parsing the DOP Model, Proceedings Empirical Methods in Natural Language Processing, Philadelphia, PA.

    Google Scholar 

  • Goodman, J. (1998). Parsing Inside-Out, Ph.D. thesis, Harvard University, Mass.

    Google Scholar 

  • Johnson, M. (1998). PCFG Models of Linguistic Tree Representations, Computational Linguistics 24(4), p. 613–632.

    Google Scholar 

  • Kaplan, R. (1996). A Probabilistic Approach to Lexical-Functional Analysis, Proceedings of the 1996 LFG Conference and Workshops. CSLI Publications, Stanford, CA.

    Google Scholar 

  • Magerman, D. (1995). Statistical Decision-Tree Models for Parsing, Proceedings ACL’95, Cambridge, Mass.

    Google Scholar 

  • Marcus, M., B. Santorini, M. Marcinkiewicz (1993). Building a Large Annotated Corpus of English: the Penn Treebank, Computational Linguistics 19(2).

    Google Scholar 

  • Neumann, G. (2002). A Uniform Method for Automatically Extracting Stochastic Lexicalized Tree Grammars from Treebanks and HPSG. This volume.

    Google Scholar 

  • Rajman, M. (1995a). Apports d’une approche à base de corpus aux techniques de traitement automatique du langage naturel, PhD thesis, Ecole Nationale Supérieure des Télécommunications, Paris.

    Google Scholar 

  • Rajman, M. (1995b). Approche Probabiliste de l’Analyse Syntaxique, Traitement Automatique des Langues, vol. 36(1-2).

    Google Scholar 

  • Scha, R. (1990). Taaltheorie en Taaltechnologie; Competence en Performance, in Q.A.M. de Kort and G.L.J. Leerdam (eds.), Computertoepassingen in de Neerlandistiek, Almere: Landelijke Vereniging van Neerlandici (LVVN-jaarboek).

    Google Scholar 

  • Scha, R. (1992). Virtuele Grammatica’s en Creatieve Algoritmen, Gramma/TTT 1(1).

    Google Scholar 

  • Scholtes, J. (1992). Resolving Linguistic Ambiguities with a Neural Data-Oriented Parsing (DOP) System, in I. Aleksander and J. Taylor (eds.), Artificial Neural Networks 2, Vol. 2, Elsevier Science Publishers.

    Google Scholar 

  • Scholtes, J., S. Bloembergen (1992a). The Design of a Neural Data-Oriented Parsing (DOP) System, Proceedings of the International Joint Conference on Neural Networks, (IJCNN), Baltimore, MD.

    Google Scholar 

  • Scholtes, J., S. Bloembergen (1992b). Corpus Based Parsing with a Self-Organizing Neural Net, Proceedings of the International Joint Conference on Neural Networks, (IJCNN), Bejing, China.

    Google Scholar 

  • Sekine, S., R. Grishman (1995). A Corpus-based Probabilistic Grammar with Only Two Non-terminals, Proceedings Fourth International Workshop on Parsing Technologies, Prague, Czech Republic.

    Google Scholar 

  • Sima’an, K., R. Bod, S. Krauwer, R. Scha (1994). Efficient Disambiguation by means of Stochastic Tree Substitution Grammars, Proceedings International Conference on New Methods in Language Processing, UMIST, Manchester, UK.

    Google Scholar 

  • Sima’an, K. (1995). An optimized algorithm for Data Oriented Parsing, Proceedings International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria.

    Google Scholar 

  • Sima’an, K. (1996a). An optimized algorithm for Data Oriented Parsing, in R. Mitkov and N. Nicolov (eds.), Recent Advances in Natural Language Processing 1995, volume 136 of Current Issues in Linguistic Theory. John Benjamins, Amsterdam.

    Google Scholar 

  • Sima’an, K. (1996b). Computational Complexity of Probabilistic Disambiguation by means of Tree Grammars, Proceedings COLING-96, Copenhagen, Denmark.

    Google Scholar 

  • Sima’an, K. (1997). Explanation-Based Learning of Data-Oriented Parsing, in T. Ellison (ed.) CoNLL97: Computational Natural Language Learning, ACL’97, Madrid, Spain.

    Google Scholar 

  • Srinivas, B., A. Joshi (1995). Some novel applications of explanation-based learning to parsing lexicalized tree-adjoining grammars, Proceedings ACL’95, Cambridge (Mass.).

    Google Scholar 

  • Way, A. (1999). A Hybrid Archtecture for Robust MT using LFG-DOP, Journal of Experimental and Theoretical Artificial Intelligence 11(4).

    Google Scholar 

  • Weischedel, R., M. Meteer, R, Schwarz, L. Ramshaw, J. Palmucci (1993). Coping with Ambiguity and Unknown Words through Probabilistic Models, Computational Linguistics, 19(2).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Bod, R. (2003). Extracting Stochastic Grammars from Treebanks. In: Abeillé, A. (eds) Treebanks. Text, Speech and Language Technology, vol 20. Springer, Dordrecht. https://doi.org/10.1007/978-94-010-0201-1_19

Download citation

  • DOI: https://doi.org/10.1007/978-94-010-0201-1_19

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-1-4020-1335-5

  • Online ISBN: 978-94-010-0201-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics