Inducing Head-Driven PCFGs with Latent Heads: Refining a Tree-Bank Grammar for Parsing

  • Detlef Prescher
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3720)


Although state-of-the-art parsers for natural language are lexicalized, it was recently shown that an accurate unlexicalized parser for the Penn tree-bank can be simply read off a manually refined tree-bank. While lexicalized parsers often suffer from sparse data, manual mark-up is costly and largely based on individual linguistic intuition. Thus, across domains, languages, and tree-bank annotations, a fundamental question arises: Is it possible to automatically induce an accurate parser from a tree-bank without resorting to full lexicalization? In this paper, we show how to induce a probabilistic parser with latent head information from simple linguistic principles. Our parser has a performance of 85.1% (LP/LR F1), which is as good as that of early lexicalized ones. This is remarkable since the induction of probabilistic grammars is in general a hard task.


Training Corpus Latent Head Input Tree Auxiliary Node Statistical Parser 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Charniak, E.: Tree-bank grammars. Technical Report CS-96-02, Brown University (1996)Google Scholar
  2. 2.
    Charniak, E.: Parsing with context-free grammars and word statistics. Technical Report CS-95-28, Department of Computer Science, Brown University (1995)Google Scholar
  3. 3.
    Magerman, D.M.: Statistical decision-tree models for parsing. In: Proc. of ACL 1995 (1995)Google Scholar
  4. 4.
    Collins, M.: A new statistical parser based on bigram lexical dependencies. In: Proc. of the ACL 1996 (1996)Google Scholar
  5. 5.
    Johnson, M.: PCFG models of linguistic tree representations. Comp. Linguistics 24 (1998)Google Scholar
  6. 6.
    Collins, M.: Head-Driven Statistical Models for Natural Language Parsing. PhD thesis, U of Pennsylvania (1999)Google Scholar
  7. 7.
    Dubey, A., Keller, F.: Probabilistic parsing for German using sister-head dependencies. In: Proc. of ACL 2003 (2003)Google Scholar
  8. 8.
    Fissaha, S., Olejnik, D., Kornberger, R., Müller, K., Prescher, D.: Experiments in German treebank parsing. In: Matoušek, V., Mautner, P. (eds.) TSD 2003. LNCS (LNAI), vol. 2807, pp. 50–57. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  9. 9.
    Bikel, D.: Intricacies of Collins’ parsing model. Computational Linguistics (to appear)Google Scholar
  10. 10.
    Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proc. of ACL 2003 (2003)Google Scholar
  11. 11.
    Bresnan, J., Kaplan, R.M.: Lexical functional grammar: A formal system for grammatical representation. In: The Mental Representation of Grammatical Relations. MIT Press, Cambridge (1982)Google Scholar
  12. 12.
    Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statist. Soc. 39 (1977)Google Scholar
  13. 13.
    Carroll, G., Rooth, M.: Valence induction with a head-lexicalized PCFG. In: Proc. of EMNLP-3 (1998)Google Scholar
  14. 14.
    Lari, K., Young, S.J.: The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer Speech and Language 4 (1990)Google Scholar
  15. 15.
    Schmid, H.: LoPar. Design and Implementation. Technical report, IMS, U Stuttgart (1999)Google Scholar
  16. 16.
    Marcus, M., Santorini, B., Marcinkiewicz, M.: Building a large annotated corpus of english: The Penn treebank. Computational Linguistics 19 (1993)Google Scholar
  17. 17.
    Schmid, H.: Efficient parsing of highly ambiguous context-free grammars with bit vectors. In: Proc. of COLING 2004 (2004)Google Scholar
  18. 18.
    Black, E., et al.: A procedure for quantitatively comparing the syntactic coverage of English grammars. In: Proc. of DARPA 1991 (1991)Google Scholar
  19. 19.
    Chiang, D., Bikel, D.: Recovering latent information in treebanks. In: Proc. of COLING 2002 (2002)Google Scholar
  20. 20.
    Ghahramani, Z., Jordan, M.: Factorial Hidden Markov Models. Technical report. MIT (1995)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Detlef Prescher
    • 1
  1. 1.Institute for Logic, Language and ComputationUniversity of Amsterdam 

Personalised recommendations