Wide-Coverage Parsing, Semantics, and Morphology

Çakıcı, Ruket; Steedman, Mark; Bozşahin, Cem

doi:10.1007/978-3-319-90165-7_8

Ruket Çakıcı⁶,
Mark Steedman⁷ &
Cem Bozşahin⁶

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

840 Accesses
3 Citations

Abstract

Wide-coverage parsing poses three demands: broad coverage over preferably free text, depth in semantic representation for purposes such as inference in question answering, and computational efficiency. We show for Turkish that these goals are not inherently contradictory when we assign categories to sub-lexical elements in the lexicon. The presumed computational burden of processing such lexicons does not arise when we work with automata-constrained formalisms that are trainable on word-meaning correspondences at the level of predicate-argument structures for any string, which is characteristic of radically lexicalizable grammars. This is helpful in morphologically simpler languages too, where word-based parsing has been shown to benefit from sub-lexical training.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Lewis and Steedman (2014) describe what is at stake if we incorporate distributional semantics of content words but not compositional semantics coming out of such heads.
2.
The notion of morpheme is controversial in linguistics. Matthews (1974), Stump (2001), and Aronoff and Fudeman (2011) provide some discussion. Without delving into morphological theory, we shall adopt the computational view summarized by Roark and Sproat (2007): morphology can be characterized by finite-state mechanisms. Models of morphological processing differ in the way they handle lexical access. For example, two-level morphology is finite-state and linear-time in its morphotactics, but it incurs an exponential cost to do surface form-lexical form pairings during morphological processing (see Koskenniemi 1983; Koskenniemi and Church 1988, and Barton et al. 1987 for extensive discussion). On the other hand, if we have lexical categories for sub-lexical items, then, given a string and its decomposition, we can check efficiently (in polynomial time) whether the category-string correspondences are parsable: the problem is in NP (nondeterministic polynomial time) not because of parsing but because of ambiguity. Lexical access could then use the same mechanism for words, morphemes, and morpholexical rules if it wants to.
3.
For example: for the word el-ler-im-de-ki “the ones in my hands” with the morphological breakdown of el-plu-poss.1s-loc-rel, el for “hand,” the lexical ending is -plu-poss.1s-loc-rel.
4.
Notice that, left unconstrained we face n × 2¹⁶⁴ ≈ n × 10^49.4 word-like forms in Turkish, from 164 morphemes and n lexemes. A much smaller search space is attested because of morphological, semantic, and lexical constraints, but 50,000 and counting is still an enormous search space.
5.
Using complexity results in these aspects has been sometimes controversial; see, for example, Berwick and Weinberg (1982), Barton et al. (1987), and Koskenniemi and Church (1988). One view, which we do not follow, is to eliminate alternatives in the model, by insisting on using tractable algorithms, as in Tractable Cognition thesis (van Rooij 2008). The one we follow addresses complexity as a complex mixture of source and data that in the end allows efficient parsing, feasible and transparent training, and scalable performance. For example, Clark and Curran’s (2007) CCG parser is cubic time, whereas A^∗ CCG parser is exponential time in the worst case, but with training, it has a superlinear parsing performance on long sentences (Lewis and Steedman 2014). Another example of this view is PAC learnability of Valiant (2013).
6.
Item-and-Arrangement (IA) morphology treats word structure as consisting of morphemes which are put one after another, like segments. Item-and-Process morphology (IP) uses lexemes and associated processes on them, which are not necessarily segmental. Another alternative is Word-and-Paradigm, which is similar to IP but with word as the basic unit. The terminology is due to Hockett (1959).
7.
Nonconcatenative and nonsegmental morphological processes, which are not only characteristic of templatic languages but also abundant in diverse morphological typologies, such as German, Tagalog, and Alabama, are painful reminders that IA cannot be a universal model for all lexicons.
8.
What this means is that, if “archer” in (4) were a quantified phrase, for example her okçu “every archer,” then the quantifier’s lexically value-raised category would lead to her okçu := NP ^↑∖(NP∕NP): λpλqλx.(∀x)pa′x → qx. Value-raising is distribution of type-raising to arguments, as shown in the logical form.
9.
Here we pass over the mechanism that maintains lexical integrity, which has the effect of doing category combination of bound items before doing it across words. The idea was first stipulated in CCG by Bozşahin (2002) and revised for explanation in Steedman and Bozşahin (2018). In practical parser training the same effect has been achieved in various ways. For example, in a maximum entropy model of Turkish, a category feature for a word is decided based on whether it arises from a suffix of the word (Akkuş 2014). Wang et al. (2014) rely on a morphological analyzer before training, to keep category inference within a word. Ambati et al. (2013) rank intra-chunk (morphological) dependencies higher than inter-chunk (phrasal) dependencies in coming up with CCG categories, which has the same effect.
10.
Notice that the adverb kolayca is necessarily a VP modifier, unlike kolay of (7b), which is underspecified. We avoid ungrammatical coordinations involving parts of words while allowing suspended affixation, by virtue of radically lexicalizing the conjunction category. For example, [target-ACC hit and bystander-ACC missed ]-REL archer is ungrammatical, and the coordination category has the constraint (ω), which says that phonological wordhood must be satisfied by all Xs. The left conjunct in this hypothetical example could not project an X _ω because its right periphery—which projects X—would not be a phonological word, as Kabak (2007) showed. It is a forced move in CCG that such constraints on formally available combinations must be derived from information available at the perceptual interfaces.
11.
We note that another wide-coverage parser for Turkish, Eryiğit et al. (2008), which uses dependency parsing, achieves its highest results in terms equivalent to a subset of our sub-lexical training (inflectional groups, in their case). Their comparison includes word-trained lexicons. CCG adds to this perspective a richer inventory of types to train with, and the benefit of naturally extending the coverage to long-range dependencies that are abundant in large corpora, once heads of syntactic constructions bear combinatory categories in the lexicon. We say more about these aspects subsequently.
12.
Honnibal and Curran (2009), Honnibal et al. (2010), and Honnibal (2010) have shown that English benefits in parsing performance from sub-lexical training as well, although parsing in their case is word-based. One key ingredient appears to be lexicalizing the unary rules as “hat categories,” which indeed makes such CCG categories truly supertags because they can be taken into account in training before the parser sees them, whereas the previous usage of supertag in CCG is equivalent to “combinatory lexical category.”
13.
In linguistics the term “lexeme” could mean one base lexeme and all its paradigm forms receiving one and same part of speech.
14.
The example is from Çakıcı (2008). The convention we follow in display of Turkish treebank data is: word|POS|Category–gloss.
15.
Figures are from Çakıcı (2008).
16.
In fact, both interpretations are possible, and type-shifting from NP to S would be preferable. For example, “Arabadaki Mehmet.” (car-loc-ki Mehmet) could mean “Mehmet, the one in the car” or “The one in the car is Mehmet,” with the given punctuation. Differences in the interpretations are clear in the following alternative continuations: Yarın gidiyormuş./Ahmet değil. “He is leaving tomorrow/Not Ahmet.” The first one requires NP reading for the example in the beginning, and the second one S (propositional) reading. Going the other way, i.e., from a lexically specified S for a nominal predicate to an NP, is much more restricted in Turkish. Such type-shifting is in fact headed by verbal inflection.
17.
Although the gold-standard CCG categories (supertags) are used, this number is slightly less than 100%. This is possibly caused by an implementation discrepancy.

References

Akkuş BK (2014) Supertagging with combinatory categorial grammar for dependency parsing. Master’s thesis, Middle East Technical University, Ankara
Google Scholar
Ambati BR, Deoskar T, Steedman M (2013) Using CCG categories to improve Hindi dependency parsing. In: Proceedings of ACL, Sofia, pp 604–609
Google Scholar
Ambati BR, Deoskar T, Steedman M (2014) Improving dependency parsers using combinatory categorial grammar. In: Proceedings of EACL, Gothenburg, pp 159–163
Google Scholar
Aronoff M, Fudeman K (2011) What is morphology?, 2nd edn. Wiley-Blackwell, Chichester
Google Scholar
Atalay NB, Oflazer K, Say B (2003) The annotation process in the Turkish treebank. In: Proceedings of the workshop on linguistically interpreted corpora, Budapest, pp 33 – 38
Google Scholar
Bangalore S, Joshi AK (eds) (2010) Supertagging. MIT Press, Cambridge, MA
Google Scholar
Barton G, Berwick R, Ristad E (1987) Computational complexity and natural language. MIT Press, Cambridge, MA
Google Scholar
Berwick R, Weinberg A (1982) Parsing efficiency, computational complexity, and the evaluation of grammatical theories. Linguist Inquiry 13:165–192
Google Scholar
Birch A, Osborne M, Koehn P (2007) CCG supertags in factored statistical machine translation. In: Proceedings of WMT, pp 9–16
Google Scholar
Bos J, Bosco C, Mazzei A (2009) Converting a dependency treebank to a categorial grammar treebank for Italian. In: Proceedings of the international workshop on treebanks and linguistic theories, Milan, pp 27–38
Google Scholar
Bozşahin C (2002) The combinatory morphemic lexicon. Comput Linguist 28(2):145–186
Article Google Scholar
Bozşahin C (2012) Combinatory linguistics. Mouton De Gruyter, Berlin
Book Google Scholar
Çakıcı R (2005) Automatic induction of a CCG grammar for Turkish. In: Proceedings of the ACL student research workshop, Ann Arbor, MI, pp 73–78
Chapter Google Scholar
Çakıcı R (2008) Wide-coverage parsing for Turkish. PhD thesis, University of Edinburgh, Edinburgh
Google Scholar
Çakıcı R, Steedman M (2009) A wide-coverage morphemic CCG lexicon for Turkish. In: Proceedings of ESSLLI workshop on parsing with categorial grammars, Bordeaux, pp 11–15
Google Scholar
Çakıcı R, Steedman M (2018) Wide coverage CCG parsing for Turkish, in preparation
Google Scholar
Cha J, Lee G, Lee J (2002) Korean combinatory categorial grammar and statistical parsing. Comput Hum 36(4):431–453
Article Google Scholar
Clark S (2002) A supertagger for combinatory categorial grammar. In: Proceedings of the TAG+ workshop, Venice, pp 19–24
Google Scholar
Clark S, Curran JR (2006) Partial training for a lexicalized grammar parser. In: Proceedings of NAACL-HLT, New York, NY, pp 144–151
Google Scholar
Clark S, Curran JR (2007) Wide-coverage efficient statistical parsing with CCG and log-linear models. Comput Linguist 33:493–552
Article Google Scholar
Çöltekin Ç, Bozşahin C (2007) Syllable-based and morpheme-based models of Bayesian word grammar learning from CHILDES database. In: Proceedings of the annual meeting of the cognitive science society, Nashville, TN, pp 880 – 886
Google Scholar
Eryiğit G, Nivre J, Oflazer K (2008) Dependency parsing of Turkish. Comput Linguist 34(3): 357 – 389
Article Google Scholar
Göksel A (2006) Pronominal participles in Turkish and lexical integrity. Ling. Linguaggio 5(1):105–125
Google Scholar
Hall J, Nilsson J (2006) CoNLL-X shared task: multi-lingual dependency parsing. MSI Report 06060, School of Mathematics and Systems Engineering, Växjö University, Växjö
Google Scholar
Hockenmaier J (2003) Data models for statistical parsing with combinatory categorial grammar. PhD thesis, University of Edinburgh, Edinburgh
Google Scholar
Hockenmaier J (2006) Creating a CCGbank and a wide-coverage CCG lexicon for German. In: Proceedings of COLING-ACL, Sydney, pp 505–512
Google Scholar
Hockenmaier J, Steedman M (2007) CCGbank: a corpus of CCG derivations and dependency structures extracted from the Penn Treebank. Comput Linguist 33(3):356–396
Article Google Scholar
Hockett CF (1959) Two models of grammatical description. Bob-Merrill, Indianapolis, IN
Google Scholar
Hoeksema J, Janda RD (1988) Implications of process-morphology for categorial grammar. In: Oehrle RT, Bach E, Wheeler D (eds) Categorial grammars and natural language structures. D. Reidel, Dordrecht
Google Scholar
Honnibal M (2010) Hat categories: representing form and function simultaneously in combinatory categorial grammar. PhD thesis, University of Sydney, Sydney
Google Scholar
Honnibal M, Curran JR (2009) Fully lexicalising CCGbank with hat categories. In: Proceedings of EMNLP, Singapore, pp 1212–1221
Google Scholar
Honnibal M, Kummerfeld JK, Curran JR (2010) Morphological analysis can improve a CCG parser for English. In: Proceedings of COLING, Beijing, pp 445–453
Google Scholar
Kabak B (2007) Turkish suspended affixation. Linguistics 45:311–347
Article Google Scholar
Koskenniemi K (1983) Two-level morphology: a general computational model for word-form recognition and production. PhD thesis, University of Helsinki, Helsinki
Google Scholar
Koskenniemi K, Church KW (1988) Complexity, two-level morphology and Finnish. In: Proceedings of COLING, Budapest, pp 335–339
Google Scholar
Lewis M, Steedman M (2014) A^∗ CCG parsing with a supertag-factored model. In: Proceedings of EMNLP, Doha, pp 990–1000
Google Scholar
Lieber R (1992) Deconstructing morphology: word formation in syntactic theory. The University of Chicago Press, Chicago, IL
Google Scholar
MacWhinney B (2000) The CHILDES project: tools for analyzing talk, 3rd edn. Lawrence Erlbaum Associates, Mahwah, NJ
Google Scholar
Matthews P (1974) Morphology: an introduction to the theory of word-structure. Cambridge University Press, Cambridge
Google Scholar
McConville M (2006) An inheritance-based theory of the lexicon in combinatory categorial grammar. PhD thesis, University of Edinburgh, Edinburgh
Google Scholar
McDonald R, Crammer K, Pereira F (2005) Online large-margin training of dependency parsers. In: Proceedings of ACL, Ann Arbor, MI, pp 91–98
Google Scholar
Nivre J, Hall J, Nilsson J, Chanev A, Eryiğit G, Kübler S, Marinov S, Marsi E (2007) MaltParser: a language-independent system for data-driven dependency parsing. Nat Lang Eng 13(2):95–135
Google Scholar
Oflazer K (2003) Dependency parsing with an extended finite-state approach. Comput Linguist 29(4):515–544
Article Google Scholar
Oflazer K, Göçmen E, Bozşahin C (1994) An outline of Turkish morphology. www.academia.edu/7331476/An_Outline_of_Turkish_Morphology (7 May 2018)
Oflazer K, Say B, Hakkani-Tür DZ, Tür G (2003) Building a Turkish treebank. In: Treebanks: building and using parsed corpora. Kluwer Academic Publishers, Berlin
Google Scholar
Roark B, Sproat RW (2007) Computational approaches to morphology and syntax. Oxford University Press, Oxford
Google Scholar
Sak H, Güngör T, Saraçlar M (2011) Resources for Turkish morphological processing. Lang Resour Eval 45(2):249–261
Article Google Scholar
Schmerling S (1983) Two theories of syntactic categories. Linguist Philos 6(3):393–421
Article Google Scholar
Sells P (1995) Korean and Japanese morphology from a lexical perspective. Linguist Inquiry 26(2):277–325
Google Scholar
Steedman M (1996) Surface structure and interpretation. MIT Press, Cambridge, MA
Google Scholar
Steedman M (2000) The syntactic process. MIT Press, Cambridge, MA
Google Scholar
Steedman M (2011) Taking scope. MIT Press, Cambridge, MA
Book Google Scholar
Steedman M, Baldridge J (2011) Combinatory categorial grammar. In: Boyer R, Börjars K (eds) Non-transformational syntax: formal and explicit models of grammar: a guide to current models, Wiley-Blackwell, West Sussex
Google Scholar
Steedman, M. and C. Bozşahin (2018) Projecting from the Lexicon. MIT Press, (submitted)
Google Scholar
Stump GT (2001) Inflectional morphology: a theory of paradigm structure. Cambridge University Press, Cambridge
Book Google Scholar
Tse D, Curran JR (2010) Chinese CCGbank: extracting CCG derivations from the Penn Chinese treebank. In: Proceedings of COLING, Beijing, pp 1083–1091
Google Scholar
Valiant L (2013) Probably approximately correct: nature’s algorithms for learning and prospering in a complex world. Basic Books, New York, NY
Google Scholar
van Rooij I (2008) The tractable cognition thesis. Cogn Sci 32(6):939–984
Article Google Scholar
Wang A, Kwiatkowski T, Zettlemoyer L (2014) Morpho-syntactic lexical generalization for CCG semantic parsing. In: Proceedings of EMNLP, Doha, pp 1284–1295
Google Scholar
Yuret D, Türe F (2006) Learning morphological disambiguation rules for Turkish. In: Proceedings of NAACL-HLT, New York, NY, pp 328–334
Google Scholar

Download references

Author information

Authors and Affiliations

Middle East Technical University, Ankara, Turkey
Ruket Çakıcı & Cem Bozşahin
University of Edinburgh, Edinburgh, UK
Mark Steedman

Authors

Ruket Çakıcı
View author publications
You can also search for this author in PubMed Google Scholar
Mark Steedman
View author publications
You can also search for this author in PubMed Google Scholar
Cem Bozşahin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ruket Çakıcı .

Editor information

Editors and Affiliations

Carnegie Mellon University Qatar, Doha-Education City, Qatar
Kemal Oflazer
Electrical and Electronic Engineering, Boğaziçi University, Istanbul-Bebek, Turkey
Murat Saraçlar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Çakıcı, R., Steedman, M., Bozşahin, C. (2018). Wide-Coverage Parsing, Semantics, and Morphology. In: Oflazer, K., Saraçlar, M. (eds) Turkish Natural Language Processing. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-90165-7_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-90165-7_8
Published: 21 July 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-90163-3
Online ISBN: 978-3-319-90165-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics