Learning probabilistic context-free grammars from treebanks

  • Jose L. Verdú-Mas
  • Jorge Calera-Rubio
  • Rafael C. Carrasco
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2905)


This paper describes the application of a new model to learn probabilistic context-free grammars (PCFGs) from a tree bank corpus. The model estimates the probabilities according to a generalized k-gram scheme for trees.It allows for faster parsing,decreases considerably the perplexity of the test samples and tends to give more structured and refined parses.In addition,it also allows several smoothing techniques such as backing-off or interpolation that are used to avoid assigning zero probability to any sentence.


  1. 1.
    Frazler, L., Rayner, K.: Making and correcting errors during sentence comprehension: Eye movements in the analysis of structurally ambiguous sentences. Cognitive Psychology 14, 178–210 (1982)CrossRefGoogle Scholar
  2. 2.
    Marcus, M.P., Santorlni, B., Maicinkiewicz, M.A.: Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19, 313–330 (1993)Google Scholar
  3. 3.
    Charnlak, E.: Treebank grammars. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence, pp. 1031–1036. AAAI Press/MIT Press (1996)Google Scholar
  4. 4.
    Johnson, M.: PCFG models of linguistic tree representations. Computational Linguistics 24(4), 613–632 (1998)Google Scholar
  5. 5.
    Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  6. 6.
    Black, E., Jelinek, F., Lafferty, J.D., Magerman, D.M., Mercer, R.L., Roukos, S.: Towards history-based grammars: Using richer models for probabilistic parsing. In: Proceedings of the DARPA Speech and Natural Language Workshop, pp. 31–37 (1992)Google Scholar
  7. 7.
    Sánchez, J.A., Benedí, J.M.: Consistency of stochastic context-free grammars from probabilistic estimation based on growth transformations. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(9), 1052–1055 (1997)CrossRefGoogle Scholar
  8. 8.
    Chi, Z., Geman, S.: Estimation of probabilistic context-free grammars. Computational Linguistics 24(2), 299–305 (1998)MathSciNetGoogle Scholar
  9. 9.
    Charnlak, E.: Tree bank grammars. In: Proceedings of the Thirteenth National Confcwnce on Artificial Intelligence and the Eighth Innovative Applications of Artificial Intelligence Conference, Menlo Park, pp. 1031–1036. AAAI Press/MIT Press (1996)Google Scholar
  10. 10.
    Chappelier, J.-C., Rajman, M.: A generalized CYK algorithm for parsing stochastic CFG. In: Actes de TAPD 1998, pp. 133–137 (1998)Google Scholar
  11. 11.
    Earley, J.: An efficient context-free parsing algorithm. Communications of the ACM 13(2), 94–102 (1970)zbMATHCrossRefGoogle Scholar
  12. 12.
    Black, E., Abney, S., Flickinger, D., Gdaniec, C., Grishman, R., Harrison, P., Hindie, D., Ingria, R., Jelinek, F., Klavans, J., Liberman, M., Marcus, M., Roukos, S., Santorini, B., Strza-lkowski, T.: A procedure for quantitatively comparing the syntatic coverage of English grammars. In: Proc. Speech and Natural Language Workshop 1991, San Mateo, CA, pp. 306–311. Morgan Kauffmann, San Francisco (1991)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Jose L. Verdú-Mas
    • 1
  • Jorge Calera-Rubio
    • 1
  • Rafael C. Carrasco
    • 1
  1. 1.Departamento de Lenguajes y Sistemas InformáticosUniversidad de AlicanteAlicanteSpain

Personalised recommendations