Abstract
Deep unification- (constraint-)based grammars are usually hand-crafted. Scaling such grammars from fragments to unrestricted text is time-consuming and expensive. This problem can be exacerbated in multilingual broad-coverage grammar development scenarios. Cahill et al. (2002, 2004) and O’Donovan et al. (2004) present an automatic f-structure annotation-based methodology to acquire broad-coverage, deep, Lexical-Functional Grammar (LFG) resources for English from the Penn-II Treebank. In this paper we show how this model can be adapted to a multilingual grammar development scenario to induce robust, wide-coverage, PCFG-based LFG approximations for German from the TIGER Treebank. We show how the architecture of LFG, in particular the distinction between c-structure and f-structure representations, facilitates multilingual, treebank-based unification grammar induction, allowing us to cross-linguistically reuse the lexical extraction and parsing modules from O’Donovan et al. (2004) and Cahill et al. (2004), respectively. We evaluate our grammars against the PARC 700 Dependency Bank (King et al., 2003), against dependency structures for 2000 held-out sentences from the TIGER Corpus as well as against a hand-crafted dependency gold standard for 100 TIGER trees. Currently, our resources achieve 81.79% f-score against the PARC 700, a 2.19% improvement over the best result reported for a hand-crafted grammar in Kaplan et al. (2004), 74.6% against the 2000 held-out TIGER dependency structures and 71.08% against the 100-sentence TIGER gold standard, with substantially improved coverage compared to hand-crafted resources. We have since applied our methodology to induce wide-coverage LFG resources for Chinese (Burke et al., 2004b) from the Penn Chinese Treebank (Xue et al., 2002) and for Spanish from the CAST3LB Treebank (Civit, 2003).
Similar content being viewed by others
References
A. Abeillé (Eds) (2003) Treebanks, Building and Using Parsed Corpora Kluwer Dordrecht
E. Bender D. Flickinger S. Oepen (2002) The Grammar Matrix: An Open-Source Starter-Kit for the Rapid Development of Cross-Linguistically Consistent Broad-Coverage Precision Grammars J. Carroll N. Oostdijk R. Sutcliffe (Eds) Proceedings of the Workshop on Grammar Engineering and Evaluation at the 19th International Conference on Computational Linguistics Taipei Taiwan 8–14
G. Bouma G. Noord Particlevan R. Malouf (2000) Alpino: Wide-coverage Computational Analysis of Dutch W. Daelemans K. Sima’an J. Veenstra J. Zavrel (Eds) Computational Linguistics in The Netherlands 2000 Rodopi Amsterdam 45–59
T. Brants S. Dipper S. Hansen W. Lezius G. Smith (2002) The TIGER Treebank E. Hinrichs K. Simov (Eds) Proceedings of the first Workshop on Treebanks and Linguistic Theories (TLT’02) Sozopol Bulgaria 24–41
J. Bresnan (2001) Lexical-Functional Syntax Blackwell Oxford
Burke M., Cahill A., O’Donovan R., van Genabith J., Way A. (2004a) The Evaluation of an Automatic Annotation Algorithm against the PARC 700 Dependency Bank. In: Proceedings of the Ninth International Conference on LFG. Christchurch, New Zealand, pp. 101–121.
Burke M., Lam O. Chan R., Cahill A., O’Donovan R., Bodomo A., van Genabith J., Way A. (2004b) Treebank-Based Acquisition of a Chinese Lexical-Functional Grammar. In Proceedings of the 18th Pacific Asia Conference on Language, Information and Computation. Tokyo, Japan, pp. 161–172.
Butt M., Dyvik H., King T. H., Masuichi H., Rohrer C. (2002) The Parallel Grammar Project. In Proceedings of COLING 2002, Workshop on Grammar Engineering and Evaluation. Taipei, Taiwan, pp. 1–7.
M. Butt King T.H. M.E. Niño F. Segond (1999) A Grammar Writer’s Cookbook CSLI Publications Stanford, CA
Cahill A., Burke M., O’Donovan R., van Genabith J., Way A. (2004) Long-Distance Dependency Resolution in Automatically Acquired Wide-Coverage PCFG-Based LFG Approximations. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics. Barcelona, Spain, pp. 320–327.
A. Cahill M. McCarthy J. Genabith Particlevan A. Way (2002) Parsing with PCFGs and Automatic F-Structure Annotation M. Butt T.H. King (Eds) Proceedings of the Seventh International Conference on LFG CSLI Publications Stanford CA 76–95
A. Cahill M. McCarthy J. Genabith Particlevan A. Way (2003) Quasi-Logical Forms for the Penn Treebank H. Bunt I. Sluis Particlevan der R. Morante (Eds) Proceedings of the Fifth International Workshop on Computational Semantics, IWCS-05 Tilburg The Netherlands 55–71
Charniak E. (2000) A Maximum Entropy Inspired Parser. In Proceedings of the First Annual Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL 2000). Seattle, WA, pp. 132–139.
Civit M. (2003) Criterios de etiquetación y desambiguación morfosintáctica de corpus en español. Ph.D. thesis, Universitat de Barcelona, Spain.
Collins M. (1999) Head-Driven Statistical Models for Natural Language Parsing. Ph.D. thesis, University of Pennsylvania, Philadelphia, PA.
Crouch R., Kaplan R., King T. H., Riezler S. (2002) A comparison of evaluation metrics for a broad coverage parser. In Proceedings of the LREC Workshop: Beyond PARSEVAL – Towards Improved Evaluation Measures for Parsing Systems. Las Palmas, Canary Islands, Spain, pp. 67–74.
M. Dalrymple (2001) Lexical-Functional Grammar Academic Press San Diego, CA. London
Dipper S. (2003) Implementing and Documenting Large-scale Grammars – German LFG. Ph.D. thesis, IMS, University of Stuttgart. Arbeitspapiere des Instituts für Maschinelle Sprachverarbeitung (AIMS), Volume 9, Number 1.
D. Flickinger (2000) ArticleTitleOn Building a More Efficient Grammar by Exploiting Types Natural Language Engineering 6 IssueID1 15–28 Occurrence Handle10.1017/S1351324900002370
M. Forst (2003a) Treebank Conversion – Creating an f-structure bank from the TIGER Corpus M. Butt T. H. King (Eds) Proceedings of the Eighth International Conference on LFG CSLI Publications Stanford, CA 205–216
Forst M. (2003b) Treebank Conversion – establishing a test suite for a broad-coverage LFG from the TIGER treebank. In Proceedings of the EACL Workshop on Linguistically Interpreted Corpora (LINC’03). Budapest, Hungary, pp. 25–32.
A. Frank L. Sadler J. Genabith Particlevan A. Way (2003) From Treebank Resources to LFG F-Structures A. Abeillé (Eds) Treebanks: Building and Using Syntactically Annotated Corpora Kluwer Academic Publishers Dordrecht/Boston/London, The Netherlands 367–389
M. Gamon C. Lozano J. Pinkham T. Reutter (1997) Practical Experience with Grammar Sharing in Multilingual NLP J. Burstein C. Leacock (Eds) Proceedings of the Workshop From Research to Commercial Applications: Making NLP Work in Practice, 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter of the Association for Computational Linguistics (ACL-EACL’97) Spain Madrid 49–56
Hemphill C.T., Godfrey J.J., Doddington G.R. (1990) The ATIS spoken language systems pilot corpus. In Proceedings of a workshop on Speech and natural language. Morgan Kaufmann Publishers Inc., Hidden Valley, PA. pp. 96–101.
Hockenmaier J. (2003) Parsing with Generative models of Predicate-Argument Structure. In Proceedings of the 41st Annual Conference of the Association for Computational Linguistics. Sapporo, Japan, pp. 359–366.
Hockenmaier J., Steedman M. (2002) Generative Models for Statistical Parsing with Combinatory Categorial Grammar. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, PA. pp. 335–342.
M. Johnson (1999) ArticleTitlePCFG models of linguistic tree representations Computational Linguistics 24 IssueID4 613–632
R. Kaplan J. Bresnan (1982) Lexical Functional Grammar, a Formal System for Grammatical Representation J. Bresnan (Eds) The Mental Representation of Grammatical Relations MIT Press Cambridge, MA 173–281
Kaplan R., Riezler S., King T.H., Maxwell J.T., Vasserman A., Crouch R. (2004), Speed and Accuracy in Shallow and Deep Stochastic Parsing. In Proceedings of the Human Language Technology Conference and the 4th Annual Meeting of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL’04). Boston, MA., pp. 97–104.
Kaplan R., Zaenen A. (1989) Long-distance Dependencies, Constituent Structure and Functional Uncertainty. In: Baltin M., Kroch A., (eds), Alternative Conceptions of Phrase Structure. Chicago University Press, Chicago, pp. 17–42, Reprinted in M. Dalrymple et al. (editors), Formal Issues in Lexical-Functional Grammar. CSLI Publications, 1995.
Kaplan R.M., Netter K., Wedekind J., Zaenen A. (1989) Translation by structural correspondences. In Proceedings of the 4th Meeting of the European Chapter of the Association for Computational Linguistics. UMIST Manchester, UK, pp. 272–281.
Kim R., Dalrymple M., Kaplan R., King T.~H., Masuichi H., Ohkuma T. (2003) Multilingual Grammar Development via Grammar Porting. In Proceedings of the Workshop on Ideas and Strategies for Multilingual Grammar Development, ESSLLI 2003. Vienna, Austria, pp. 49–56.
King T. H., Crouch R., Riezler S., Dalrymple M., Kaplan R. (2003) The PARC700 dependency bank. In Proceedings of the EACL03: 4th International Workshop on Linguistically Interpreted Corpora (LINC-03). Budapest, Hungary, pp. 1–8.
Macleod C., Meyers A., Grishman R. (1994) The COMLEX Syntax Project: The First Year. In Proceedings of the ARPA Workshop on Human Language Technology. Princeton, NJ., pp. 669–703.
Magerman D. (1994) Natural Language Parsing as Statistical Pattern Recognition. Ph.D. thesis, Department of Computer Science, Stanford University, CA.
Marcus M., Kim G., Marcinkiewicz M. A., MacIntyre R., Bies A., Ferguson M., Katz K., Schasberger B. (1994) The Penn Treebank: Annotating Predicate Argument Structure. In Proceedings of the ARPA Workshop on Human Language Technology. Princton, NJ., pp. 110–115.
H. Masuichi T. Okuma (2003) ArticleTitleJapanese Parser on the basis of the Lexical-Functional Grammar Formalism and its Evaluation Journal of Natural Language Processing 10 IssueID2 79–109
Miyao Y., Ninomiya T., Tsujii J. (2003) Probabilistic modeling of argument structures including non-local dependencies. In Proceedings of the Conference on Recent Advances in Natural Language Processing (RANLP). Borovets, Bulgaria, pp. 285–291.
Miyao Y., Ninomiya T., Tsujii J. (2004) Corpus-oriented Grammar Development for Acquiring a Head-driven Phrase Structure Grammar from the Penn Treebank. In Proceedings of The First International Joint Conference on Natural Language Processing (IJCNLP-04). Hainan Island, China, pp. 390–397.
Müller S., Kasper W. (2000) HPSG Analysis of German. In Verbmobil: Foundations of Speech-to-Speech Translation. Springer-Verlag, Artificial Intelligence, Berlin, Heidelberg, New York, pp. 238–253.
O’Donovan R., Burke M., Cahill A., van Genabith J., Way A. (2004) Large-Scale Induction and Evaluation of Lexical Resources from the Penn-II Treebank. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics. Barcelona, Spain, pp. 368–375.
C. Pollard I. Sag (1994) Head-driven Phrase Structure Grammar CSLI Publications Stanford, CA
Riezler S., King T., Kaplan R., Crouch R., Maxwell J. T., Johnson M. (2002) Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques. In Proceedings of the 40th Annual Conference of the Association for Computational Linguistics (ACL-02). Philadelphia, PA. pp. 271–278.
Schmid H. (2004) Efficient Parsing of Highly Ambiguous Context-Free Grammars with Bit Vectors. In Proceedings of the 18th International Conference on Computational Linguistics (COLING 2004). Geneva, Switzerland, pp. 162–168.
Siegel M., Bender E. (2002) Efficient deep processing of Japanese. In Proceedings of the 19th International Conference on Computational linguistics (COLING 2002). Taipei, Taiwan. pp. 31–38.
van Genabith J., Crouch R. (1996) Direct and Underspecified Interpretations of LFG f-Structures. In 16th International Conference on Computational Linguistics (COLING 96). Copenhagen, Denmark, pp. 262–267.
J. Genabith Particlevan R. Crouch (1997) How to Glue a Donkey to an f-Structure or Porting a Dynamic Meaning Representation Language into LFG’s Linear Logic Based Glue Language Semantics H. Bunt R. Muskens (Eds) Computing Meaning volume 1, Studies in Linguistics and Philosophy, volume 73 Kluwer Academic Press Dordrecht, Boston and London 129–148
van Genabith J., Way A., Sadler L. (1999) Data-driven Compilation of LFG Semantic Forms. In Proceedings of the EACL Workshop on Linguistically Interpreted Corpora (LINC-99). Bergen, Norway, pp. 69–76.
Xue N., Chiou F.-D., Palmer M. (2002) Building a Large-Scale Annotated Chinese Corpus. In Proceedings of the 19th International Conference on Computational linguistics (COLING 2002). Taipei, Taiwan.
Author information
Authors and Affiliations
Corresponding author
About this article
Cite this article
Cahill, A., Burke, M., Forst, M. et al. Treebank-Based Acquisition of Multilingual Unification Grammar Resources. Res Lang Comput 3, 247–279 (2005). https://doi.org/10.1007/s11168-005-1296-y
Issue Date:
DOI: https://doi.org/10.1007/s11168-005-1296-y