Skip to main content
Log in

Treebank-Based Acquisition of Multilingual Unification Grammar Resources

  • Published:
Research on Language and Computation

Abstract

Deep unification- (constraint-)based grammars are usually hand-crafted. Scaling such grammars from fragments to unrestricted text is time-consuming and expensive. This problem can be exacerbated in multilingual broad-coverage grammar development scenarios. Cahill et al. (2002, 2004) and O’Donovan et al. (2004) present an automatic f-structure annotation-based methodology to acquire broad-coverage, deep, Lexical-Functional Grammar (LFG) resources for English from the Penn-II Treebank. In this paper we show how this model can be adapted to a multilingual grammar development scenario to induce robust, wide-coverage, PCFG-based LFG approximations for German from the TIGER Treebank. We show how the architecture of LFG, in particular the distinction between c-structure and f-structure representations, facilitates multilingual, treebank-based unification grammar induction, allowing us to cross-linguistically reuse the lexical extraction and parsing modules from O’Donovan et al. (2004) and Cahill et al. (2004), respectively. We evaluate our grammars against the PARC 700 Dependency Bank (King et al., 2003), against dependency structures for 2000 held-out sentences from the TIGER Corpus as well as against a hand-crafted dependency gold standard for 100 TIGER trees. Currently, our resources achieve 81.79% f-score against the PARC 700, a 2.19% improvement over the best result reported for a hand-crafted grammar in Kaplan et al. (2004), 74.6% against the 2000 held-out TIGER dependency structures and 71.08% against the 100-sentence TIGER gold standard, with substantially improved coverage compared to hand-crafted resources. We have since applied our methodology to induce wide-coverage LFG resources for Chinese (Burke et al., 2004b) from the Penn Chinese Treebank (Xue et al., 2002) and for Spanish from the CAST3LB Treebank (Civit, 2003).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • A. Abeillé (Eds) (2003) Treebanks, Building and Using Parsed Corpora Kluwer Dordrecht

    Google Scholar 

  • E. Bender D. Flickinger S. Oepen (2002) The Grammar Matrix: An Open-Source Starter-Kit for the Rapid Development of Cross-Linguistically Consistent Broad-Coverage Precision Grammars J. Carroll N. Oostdijk R. Sutcliffe (Eds) Proceedings of the Workshop on Grammar Engineering and Evaluation at the 19th International Conference on Computational Linguistics Taipei Taiwan 8–14

    Google Scholar 

  • G. Bouma G. Noord Particlevan R. Malouf (2000) Alpino: Wide-coverage Computational Analysis of Dutch W. Daelemans K. Sima’an J. Veenstra J. Zavrel (Eds) Computational Linguistics in The Netherlands 2000 Rodopi Amsterdam 45–59

    Google Scholar 

  • T. Brants S. Dipper S. Hansen W. Lezius G. Smith (2002) The TIGER Treebank E. Hinrichs K. Simov (Eds) Proceedings of the first Workshop on Treebanks and Linguistic Theories (TLT’02) Sozopol Bulgaria 24–41

    Google Scholar 

  • J. Bresnan (2001) Lexical-Functional Syntax Blackwell Oxford

    Google Scholar 

  • Burke M., Cahill A., O’Donovan R., van Genabith J., Way A. (2004a) The Evaluation of an Automatic Annotation Algorithm against the PARC 700 Dependency Bank. In: Proceedings of the Ninth International Conference on LFG. Christchurch, New Zealand, pp. 101–121.

  • Burke M., Lam O. Chan R., Cahill A., O’Donovan R., Bodomo A., van Genabith J., Way A. (2004b) Treebank-Based Acquisition of a Chinese Lexical-Functional Grammar. In Proceedings of the 18th Pacific Asia Conference on Language, Information and Computation. Tokyo, Japan, pp. 161–172.

  • Butt M., Dyvik H., King T. H., Masuichi H., Rohrer C. (2002) The Parallel Grammar Project. In Proceedings of COLING 2002, Workshop on Grammar Engineering and Evaluation. Taipei, Taiwan, pp. 1–7.

  • M. Butt King T.H. M.E. Niño F. Segond (1999) A Grammar Writer’s Cookbook CSLI Publications Stanford, CA

    Google Scholar 

  • Cahill A., Burke M., O’Donovan R., van Genabith J., Way A. (2004) Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics. Barcelona, Spain, pp. 320–327.

  • A. Cahill M. McCarthy J. Genabith Particlevan A. Way (2002) Parsing with PCFGs and Automatic F-Structure Annotation M. Butt T.H. King (Eds) Proceedings of the Seventh International Conference on LFG CSLI Publications Stanford CA 76–95

    Google Scholar 

  • A. Cahill M. McCarthy J. Genabith Particlevan A. Way (2003) Quasi-Logical Forms for the Penn Treebank H. Bunt I. Sluis Particlevan der R. Morante (Eds) Proceedings of the Fifth International Workshop on Computational Semantics, IWCS-05 Tilburg The Netherlands 55–71

    Google Scholar 

  • Charniak E. (2000) A Maximum Entropy Inspired Parser. In Proceedings of the First Annual Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL 2000). Seattle, WA, pp. 132–139.

  • Civit M. (2003) Criterios de etiquetación y desambiguación morfosintáctica de corpus en español. Ph.D. thesis, Universitat de Barcelona, Spain.

  • Collins M. (1999) Head-Driven Statistical Models for Natural Language Parsing. Ph.D. thesis, University of Pennsylvania, Philadelphia, PA.

  • Crouch R., Kaplan R., King T. H., Riezler S. (2002) A comparison of evaluation metrics for a broad coverage parser. In Proceedings of the LREC Workshop: Beyond PARSEVAL – Towards Improved Evaluation Measures for Parsing Systems. Las Palmas, Canary Islands, Spain, pp. 67–74.

  • M. Dalrymple (2001) Lexical-Functional Grammar Academic Press San Diego, CA. London

    Google Scholar 

  • Dipper S. (2003) Implementing and Documenting Large-scale Grammars – German LFG. Ph.D. thesis, IMS, University of Stuttgart. Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung (AIMS), Volume 9, Number 1.

  • D. Flickinger (2000) ArticleTitleOn Building a More Efficient Grammar by Exploiting Types Natural Language Engineering 6 IssueID1 15–28 Occurrence Handle10.1017/S1351324900002370

    Article  Google Scholar 

  • M. Forst (2003a) Treebank Conversion – Creating an f-structure bank from the TIGER Corpus M. Butt T. H. King (Eds) Proceedings of the Eighth International Conference on LFG CSLI Publications Stanford, CA 205–216

    Google Scholar 

  • Forst M. (2003b) Treebank Conversion – establishing a test suite for a broad-coverage LFG from the TIGER treebank. In Proceedings of the EACL Workshop on Linguistically Interpreted Corpora (LINC’03). Budapest, Hungary, pp. 25–32.

  • A. Frank L. Sadler J. Genabith Particlevan A. Way (2003) From Treebank Resources to LFG F-Structures A. Abeillé (Eds) Treebanks: Building and Using Syntactically Annotated Corpora Kluwer Academic Publishers Dordrecht/Boston/London, The Netherlands 367–389

    Google Scholar 

  • M. Gamon C. Lozano J. Pinkham T. Reutter (1997) Practical Experience with Grammar Sharing in Multilingual NLP J. Burstein C. Leacock (Eds) Proceedings of the Workshop From Research to Commercial Applications: Making NLP Work in Practice, 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter of the Association for Computational Linguistics (ACL-EACL’97) Spain Madrid 49–56

    Google Scholar 

  • Hemphill C.T., Godfrey J.J., Doddington G.R. (1990) The ATIS spoken language systems pilot corpus. In Proceedings of a workshop on Speech and natural language. Morgan Kaufmann Publishers Inc., Hidden Valley, PA. pp. 96–101.

  • Hockenmaier J. (2003) Parsing with Generative models of Predicate-Argument Structure. In Proceedings of the 41st Annual Conference of the Association for Computational Linguistics. Sapporo, Japan, pp. 359–366.

  • Hockenmaier J., Steedman M. (2002) Generative Models for Statistical Parsing with Combinatory Categorial Grammar. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, PA. pp. 335–342.

  • M. Johnson (1999) ArticleTitlePCFG models of linguistic tree representations Computational Linguistics 24 IssueID4 613–632

    Google Scholar 

  • R. Kaplan J. Bresnan (1982) Lexical Functional Grammar, a Formal System for Grammatical Representation J. Bresnan (Eds) The Mental Representation of Grammatical Relations MIT Press Cambridge, MA 173–281

    Google Scholar 

  • Kaplan R., Riezler S., King T.H., Maxwell J.T., Vasserman A., Crouch R. (2004), Speed and Accuracy in Shallow and Deep Stochastic Parsing. In Proceedings of the Human Language Technology Conference and the 4th Annual Meeting of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL’04). Boston, MA., pp. 97–104.

  • Kaplan R., Zaenen A. (1989) Long-distance Dependencies, Constituent Structure and Functional Uncertainty. In: Baltin M., Kroch A., (eds), Alternative Conceptions of Phrase Structure. Chicago University Press, Chicago, pp. 17–42, Reprinted in M. Dalrymple et al. (editors), Formal Issues in Lexical-Functional Grammar. CSLI Publications, 1995.

  • Kaplan R.M., Netter K., Wedekind J., Zaenen A. (1989) Translation by structural correspondences. In Proceedings of the 4th Meeting of the European Chapter of the Association for Computational Linguistics. UMIST Manchester, UK, pp. 272–281.

  • Kim R., Dalrymple M., Kaplan R., King T.~H., Masuichi H., Ohkuma T. (2003) Multilingual Grammar Development via Grammar Porting. In Proceedings of the Workshop on Ideas and Strategies for Multilingual Grammar Development, ESSLLI 2003. Vienna, Austria, pp. 49–56.

  • King T. H., Crouch R., Riezler S., Dalrymple M., Kaplan R. (2003) The PARC700 dependency bank. In Proceedings of the EACL03: 4th International Workshop on Linguistically Interpreted Corpora (LINC-03). Budapest, Hungary, pp. 1–8.

  • Macleod C., Meyers A., Grishman R. (1994) The COMLEX Syntax Project: The First Year. In Proceedings of the ARPA Workshop on Human Language Technology. Princeton, NJ., pp. 669–703.

  • Magerman D. (1994) Natural Language Parsing as Statistical Pattern Recognition. Ph.D. thesis, Department of Computer Science, Stanford University, CA.

  • Marcus M., Kim G., Marcinkiewicz M. A., MacIntyre R., Bies A., Ferguson M., Katz K., Schasberger B. (1994) The Penn Treebank: Annotating Predicate Argument Structure. In Proceedings of the ARPA Workshop on Human Language Technology. Princton, NJ., pp. 110–115.

  • H. Masuichi T. Okuma (2003) ArticleTitleJapanese Parser on the basis of the Lexical-Functional Grammar Formalism and its Evaluation Journal of Natural Language Processing 10 IssueID2 79–109

    Google Scholar 

  • Miyao Y., Ninomiya T., Tsujii J. (2003) Probabilistic modeling of argument structures including non-local dependencies. In Proceedings of the Conference on Recent Advances in Natural Language Processing (RANLP). Borovets, Bulgaria, pp. 285–291.

  • Miyao Y., Ninomiya T., Tsujii J. (2004) Corpus-oriented Grammar Development for Acquiring a Head-driven Phrase Structure Grammar from the Penn Treebank. In Proceedings of The First International Joint Conference on Natural Language Processing (IJCNLP-04). Hainan Island, China, pp. 390–397.

  • Müller S., Kasper W. (2000) HPSG Analysis of German. In Verbmobil: Foundations of Speech-to-Speech Translation. Springer-Verlag, Artificial Intelligence, Berlin, Heidelberg, New York, pp. 238–253.

  • O’Donovan R., Burke M., Cahill A., van Genabith J., Way A. (2004) Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II Treebank. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics. Barcelona, Spain, pp. 368–375.

  • C. Pollard I. Sag (1994) Head-driven Phrase Structure Grammar CSLI Publications Stanford, CA

    Google Scholar 

  • Riezler S., King T., Kaplan R., Crouch R., Maxwell J. T., Johnson M. (2002) Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques. In Proceedings of the 40th Annual Conference of the Association for Computational Linguistics (ACL-02). Philadelphia, PA. pp. 271–278.

  • Schmid H. (2004) Efficient Parsing of Highly Ambiguous Context-Free Grammars with Bit Vectors. In Proceedings of the 18th International Conference on Computational Linguistics (COLING 2004). Geneva, Switzerland, pp. 162–168.

  • Siegel M., Bender E. (2002) Efficient deep processing of Japanese. In Proceedings of the 19th International Conference on Computational linguistics (COLING 2002). Taipei, Taiwan. pp. 31–38.

  • van Genabith J., Crouch R. (1996) Direct and Underspecified Interpretations of LFG f-Structures. In 16th International Conference on Computational Linguistics (COLING 96). Copenhagen, Denmark, pp. 262–267.

  • J. Genabith Particlevan R. Crouch (1997) How to Glue a Donkey to an f-Structure or Porting a Dynamic Meaning Representation Language into LFG’s Linear Logic Based Glue Language Semantics H. Bunt R. Muskens (Eds) Computing Meaning volume 1, Studies in Linguistics and Philosophy, volume 73 Kluwer Academic Press Dordrecht, Boston and London 129–148

    Google Scholar 

  • van Genabith J., Way A., Sadler L. (1999) Data-driven Compilation of LFG Semantic Forms. In Proceedings of the EACL Workshop on Linguistically Interpreted Corpora (LINC-99). Bergen, Norway, pp. 69–76.

  • Xue N., Chiou F.-D., Palmer M. (2002) Building a Large-Scale Annotated Chinese Corpus. In Proceedings of the 19th International Conference on Computational linguistics (COLING 2002). Taipei, Taiwan.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Josef van Genabith.

About this article

Cite this article

Cahill, A., Burke, M., Forst, M. et al. Treebank-Based Acquisition of Multilingual Unification Grammar Resources. Res Lang Comput 3, 247–279 (2005). https://doi.org/10.1007/s11168-005-1296-y

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11168-005-1296-y

Keywords

Navigation