Research on Language and Computation

, Volume 3, Issue 2–3, pp 247–279

Treebank-Based Acquisition of Multilingual Unification Grammar Resources

  • Aoife Cahill
  • Michael Burke
  • Martin Forst
  • Ruth O’donovan
  • Christian Rohrer
  • Josef van Genabith
  • Andy Way
Article

DOI: 10.1007/s11168-005-1296-y

Cite this article as:
Cahill, A., Burke, M., Forst, M. et al. Res Lang Comput (2005) 3: 247. doi:10.1007/s11168-005-1296-y

Abstract

Deep unification- (constraint-)based grammars are usually hand-crafted. Scaling such grammars from fragments to unrestricted text is time-consuming and expensive. This problem can be exacerbated in multilingual broad-coverage grammar development scenarios. Cahill et al. (2002, 2004) and O’Donovan et al. (2004) present an automatic f-structure annotation-based methodology to acquire broad-coverage, deep, Lexical-Functional Grammar (LFG) resources for English from the Penn-II Treebank. In this paper we show how this model can be adapted to a multilingual grammar development scenario to induce robust, wide-coverage, PCFG-based LFG approximations for German from the TIGER Treebank. We show how the architecture of LFG, in particular the distinction between c-structure and f-structure representations, facilitates multilingual, treebank-based unification grammar induction, allowing us to cross-linguistically reuse the lexical extraction and parsing modules from O’Donovan et al. (2004) and Cahill et al. (2004), respectively. We evaluate our grammars against the PARC 700 Dependency Bank (King et al., 2003), against dependency structures for 2000 held-out sentences from the TIGER Corpus as well as against a hand-crafted dependency gold standard for 100 TIGER trees. Currently, our resources achieve 81.79% f-score against the PARC 700, a 2.19% improvement over the best result reported for a hand-crafted grammar in Kaplan et al. (2004), 74.6% against the 2000 held-out TIGER dependency structures and 71.08% against the 100-sentence TIGER gold standard, with substantially improved coverage compared to hand-crafted resources. We have since applied our methodology to induce wide-coverage LFG resources for Chinese (Burke et al., 2004b) from the Penn Chinese Treebank (Xue et al., 2002) and for Spanish from the CAST3LB Treebank (Civit, 2003).

Copyright information

© Springer 2005

Authors and Affiliations

  • Aoife Cahill
    • 1
  • Michael Burke
    • 1
    • 2
  • Martin Forst
    • 2
  • Ruth O’donovan
    • 1
  • Christian Rohrer
    • 3
  • Josef van Genabith
    • 1
    • 2
  • Andy Way
    • 1
    • 2
  1. 1.National Centre for Language Technology, School of ComputingDublin City UniversityDublin 9Ireland
  2. 2.Centre for Advanced StudiesIBMDublinIreland
  3. 3.Institut für Maschinelle SprachverarbeitungUniversität StuttgartStuttgartGermany

Personalised recommendations