Distributional learning of parallel multiple contextfree grammars
Abstract
Natural languages require grammars beyond contextfree for their description. Here we extend a family of distributional learning algorithms for contextfree grammars to the class of Parallel Multiple ContextFree Grammars (pmcfgs). These grammars have two additional operations beyond the simple contextfree operation of concatenation: the ability to interleave strings of symbols, and the ability to copy or duplicate strings. This allows the grammars to generate some nonsemilinear languages, which are outside the class of mildly contextsensitive grammars. These grammars, if augmented with a suitable feature mechanism, are capable of representing all of the syntactic phenomena that have been claimed to exist in natural language.
We present a learning algorithm for a large subclass of these grammars, that includes all regular languages but not all contextfree languages. This algorithm relies on a generalisation of the notion of distribution as a function from tuples of strings to entire sentences; we define nonterminals using finite sets of these functions. Our learning algorithm uses a nonprobabilistic learning paradigm which allows for membership queries as well as positive samples; it runs in polynomial time.
Keywords
Mildly contextsensitive Grammatical inference Semilinearity1 Introduction and motivation
Natural languages present some particular challenges for machine learning—primarily the fact that the classes of representations that will ultimately be required for a satisfactory description of natural language syntax are clearly much richer than the simple Markov models that underpin much machine learning work. Indeed, even contextfree grammars are insufficiently powerful to represent many linguistic phenomena. Accordingly it is important to be able to develop learning algorithms that are capable of learning the sorts of dependencies that we observe in natural languages.
We situate this problem in the field of grammatical inference. In its purest form, we are interested in algorithms which receive as input a sequence of strings of symbols, and are required to infer a grammar that represents a formal language: a set of strings that is typically infinite. We recall at this point the classic negative results of Gold (1967), who considered the situation where the learner only has access to the strings that are in the language, and there are no nontrivial restrictions on the sequences of examples that the learner must learn from. Within such a restrictive framework we can only learn classes of languages that have some language theoretic closure properties; see for example the structurally very similar algorithms for learning subclasses of regular languages given by Angluin (1982), for learning subclasses of contextfree grammars by Clark and Eyraud (2007) and for subclasses of multiple contextfree grammars (mcfgs) by Yoshinaka (2011a). These classes of languages become increasingly limited as we ascend the hierarchy: indeed, in the case of the mcfg learning approach, even some languages consisting of only a single string are not learnable. It seems important therefore to consider a somewhat weaker and less restrictive learning model. As an alternative to considering a probabilistic learning model, which enlarges the class of languages that can be learned by limiting the data sequences and the convergence criterion in a realistic way, we approach this problem by allowing the learner an additional very high quality source of information: we consider an active learning model where the learner can ask membership queries. In other words, the learner is not entirely a passive recipient of examples but can query an oracle as to whether a particular string of symbols is in the language or not. This takes us partly towards the minimally adequate teacher model (mat) introduced by Angluin (1987); also called exact query learning. We do not however go this far—we keep a stream of positive examples as part of the learning model.
We are motivated at a high level by a desire to understand first language acquisition—in general by attempting to operationalise and extend the discovery procedures of American structuralist linguistics, for which we use the umbrella term “distributional learning”. Clearly the situation of language acquisition is quite different from the highly idealised learning models we consider in this paper, notably in that the child learner can interact with the environment in a number of ways not captured by the simple idealisation of a membership query and in the importance of semantics or meaning in the acquisition process. This raises a number of methodological issues that are outside the scope of this paper: see Clark and Lappin (2011) for a further justification of the approach in this article, but see also Berwick et al. (2011) for an opposing view.
Is intended to be an approximate characterization of the linguistic intuition that sentences of a natural language are built from a finite set of clauses of bounded structures using certain simple linear operations.
These two boundaries are clearly related because of the following theorem: a language is semilinear iff it is letter equivalent to a regular language. Clearly the class of semilinear languages is not directly useful as it is uncountable and thus contains undecidable languages, but it serves to help demarcate the class(es) of mildly contextsensitive (mcs) languages (Joshi et al. 1991). All standardly used grammatical formalisms are semilinear—regular grammars, contextfree grammars, multiple contextfree grammars, tree adjoining grammars and so on all define subsets of the class of semilinear languages. Examples of nonsemilinear languages include \(\{a^{2^{n}} \mid n > 0\}\) and \(\{a^{n^{2}} \mid n > 0\}\), which can be parsed in linear time, yet cannot be expressed by mcs formalisms. Formalisms that can define nonsemilinear languages include Elementary Formal Systems (Smullyan 1961), Range Concatenation Grammars (Boullier 1999), Literal Movement Grammars (Groenink 1995) and the representation we will use in this paper, Parallel Multiple ContextFree Grammars (Seki et al. 1991), that we define in Sect. 3.
Recent work in grammatical inference has made significant progress in learning semilinear, nonregular languages using representations such as contextfree grammars (Clark and Eyraud 2007) and multiple contextfree grammars (Yoshinaka 2011a). Crucially, these representations just use concatenation—substrings are combined, but never copied. The richer operations used by mcfgs are just generalisations of concatenation to tuples of strings; these include for example various types of intercalation where a string can be inserted into a gap in another string.
There is a broad consensus that natural language string sets are semilinear, and so attention has focused largely on properties of formalisms that generate semilinear languages. We review these formalisms in Sect. 2. However there are a number of cases where linguistic data suggest that there are richer processes involved, processes that either require or might benefit from a more powerful formalism. These data, which we examine in detail in the second half of Sect. 2, are still controversial. However, regardless of what the final determinations on these examples are, it is still useful to have richer learning algorithms available since even if these formalisms are not strictly speaking necessary, the additional descriptive power that they give us may allow for a more compact and succinct grammar than we could obtain with a semilinear formalism.
In this paper we extend distributional learning to the inference of nonsemilinear languages; the major technical detail is the extension of the notion of context. We give an intuitive explanation of this in Sect. 4, and present the technical details of the learning target and algorithm together with the proof of its correctness in Sect. 5.
2 Language theory and linguistics
For many years, a default assumption in computational linguistics has been that contextfree grammars are more or less adequate for defining natural language syntax. In linguistics on the other hand, the orthodox view has been that they are clearly inadequate. The starting point for this debate is invariably the Chomsky hierarchy (Chomsky 1956). While seminal, and an immensely important contribution, this hierarchy is now showing its age in a number of respects. Most importantly since it predates the discovery or invention of the theory of computational complexity, there is no natural characterisation within this family of the class of efficiently recognizable languages—under standard assumptions this is the class ptime. Similarly, the class of contextsensitive grammars is far too powerful to be of any practical use, and moreover it is difficult to define classes between contextfree grammars and contextsensitive grammars within this family of formalisms. Finally, it is hard to attach semantics to a topdown derivation: as Kracht (2011) argues, in our view convincingly, semantic interpretation is a process that can only be viewed naturally as a process of composition proceeding bottomup.
Recently a broad consensus has been forming that it is more appropriate to base a language hierarchy on a family of bottomup systems, as in Smullyan (1961), rather than the topdown string rewriting systems in the original Chomsky hierarchy. Accordingly in this paper we consider a family of formalisms that are bottomup in this sense. A longstanding debate in computational linguistics is over the exact languagetheoretic position of natural languages, with respect to the original Chomsky hierarchy or its more modern derivatives. There are several important distinctions that we wish to keep distinct. Suppose we have a natural language like English or Kayardild, and we assume that we can in some way stipulate a sharp dividing line between grammatical and ungrammatical utterances, given a presumably infinite set of possible grammatical sentences. For each such language and a putative class of grammars \(\mathcal{G}\) we can ask the following questions. Firstly, whether the language, considered as a set of strings, lies in the class generable by the formalism—this is the weakest claim; and therefore a negative answer to the question provides the strongest evidence that \(\mathcal{G}\) is inadequate. Secondly, supposing that we can find a grammar that generates the right set of strings, we can ask further whether there is a grammar that generates the right set of structures. For example, we might have a language like Dutch, which is apparently “weakly” contextfree, in the sense that the set of strings is a contextfree language, but not strongly contextfree, in the sense that no contextfree grammar can generate an adequate set of structures. Finally, we can ask whether we can find a reasonably sized grammar that generates it.
It is important to remember that so far nobody has been able to construct an adequate grammar for any natural language in any formalism. Therefore our answers to these questions above will only be partial. We can come up with convincing negative answers: it is possible to show that a given formalism is inadequate using various mathematical techniques, typically exploiting closure properties of the formalism, such as closure under intersection with regular languages. However, we cannot at the moment come up with a definitive positive argument that a particular natural language is in a given class, since that would require producing a completely adequate grammar for that language, where adequacy is defined in one of the senses above. The absence of an argument showing that a formalism \(\mathcal{G}\) is inadequate can of course be taken as defeasible evidence that the class is adequate for natural language description. We review here the current status of this debate, and take an integrated view of the various examples that appear to be weakly or strongly beyond the expressive power of cfgs. Though our focus in this paper is on learning languages as sets of strings, we nonetheless want to be able to produce grammars that give reasonable structures.
Corresponding to this syntactic hierarchy of productions we have a corresponding strict hierarchy of grammars and languages shown in Fig. 1. At the top we have the class ptime, which is the set of all languages that can be defined using grammars of this type (Ljunglöf 2005; Groenink 1997). In this paper we target the class of pmcfgs which allows all but conjunctive rules. If we allow only conjunction together with cfg rules, then we obtain the class of Conjunctive grammars (Okhotin 2001). If we allow only mcfg rules, then we obtain the class of Multiple contextfree grammars (mcfgs), which are equivalent, modulo some minor technical details, to the class of Linear ContextFree Rewriting Systems (VijayShanker et al. 1987). We note that the class of four convergent mildly contextsensitive formalisms studied by VijayShanker and Weir (1994) are equivalent to a class intermediate between mcfg and cfg which we mark with tag in the diagram; this is equivalent to the class of wellnested mcfgs of dimension 2; indeed there is a complex infinite hierarchy on the arc between cfg and mcfg (Seki et al. 1991). As we shall see we will use two parameters, the dimension and rank, to describe the classes of grammars that we use, together with a third which controls the degree of copying.
We now consider various phenomena which motivate the use of a formalism more powerful than cfgs.
2.1 Displacement/movement
A phenomenon which was taken to indicate the necessity for a formalism more powerful than contextfree grammars is displacement or movement—typified by whmovement in English. For example, we have the declarative sentence ‘John liked that restaurant’. We can form a question from this using what is called whmovement: ‘Which restaurant did John like’?
We emphasize that these examples do not show that the set of strings is not a contextfree language. One can stay within the class of contextfree languages, and represent these by using a richer formalism that is capable of modeling these structures by using richly structured nonterminals, using for example the metagrammar approach of gpsg (Gazdar et al. 1985). However from a learnability point of view it seems to be desirable to learn these using a richer formalism, which can directly represent the displacement, rather than folding it into some feature system. More precisely, within the framework of distributional learning, displacement causes a significant problem with the inference of contextfree grammars, because it causes a potentially exponential increase in the number of nonterminals we need in a grammar. In addition, the sorts of derivation trees that we get for these examples seem to be inappropriate for supporting natural language interpretation.
2.2 Crossserial dependencies
The example which definitively established that cfgs were not weakly adequate was the case of crossserial dependencies in Swiss German (Huybrechts 1984; Shieber 1985). We present here the data in a form very close to the original presentation. In the particular dialect of Swiss German considered by Shieber, the data concerns a sequence of embedded clauses.
Let’s abstract this a little bit and consider a formal language for this non contextfree fragment of Swiss German. We consider that we have the following words or word types: N _{ a },N _{ d } which are respectively accusative and dative noun phrases, V _{ a },V _{ d } which are verb phrases that require accusative and dative noun phrases respectively, and finally C which is a complementizer which appears at the beginning of the clause. Thus the “language” we are looking at consists of sequences like CN _{ a } V _{ a } and CN _{ d } V _{ d } and CN _{ a } N _{ a } N _{ d } V _{ a } V _{ a } V _{ d }, but crucially does not contain examples where the sequence of accusative/dative markings on the noun sequence is different from the sequence of requirements on the verbs. So it does not contain CN _{ d } V _{ a }, because the verb requires an accusative and it only has a dative, nor does it include CN _{ a } N _{ d } V _{ d } V _{ a }, because though there are the right number of accusative and dative arguments (one each) they are in the wrong order—the reverse order.^{2} More precisely, for a string w in {V _{ a },V _{ d }}^{+}, we write \(\overline{w}\) for the corresponding string in {N _{ a },N _{ d }}^{∗}: formally we define \(\overline{V_{a}} = N_{a}, \overline{V_{d}} = N_{d}\), \(\overline{V_{a} \alpha} = N_{a} \overline{\alpha}\) and \(\overline{V_{d} \alpha} = N_{d} \overline{\alpha}\). The sublanguage we are concerned with is the language \(L_{sg} = \{C \overline{w} w \mid w \in\{V_{a},V_{d}\}^{+}\,\}\). This language is defined through intersection of the original language with a suitable regular language and a homomorphism relabelling the strings. Since cfgs are closed under these operations, and L _{ sg } is clearly not contextfree, this establishes the noncontextfreeness of the original language.
Example 1
2.3 Copying
While semilinearity is generally considered to be a property that holds for all natural languages, there is an increasing amount of evidence that there are some phenomena that take natural languages out of the class of semilinear languages. None of the arguments here are as conclusive as Shieber’s argument that natural languages are not weakly contextfree (Shieber 1985) but are nonetheless suggestive. In what follows we will describe fragments of languages that are not semilinear and will provide toy pmcfgs for these fragments.
2.4 Reduplication
Another important area where (nonrecursive) copying operations may occur is in morphology and phonology (Inkelas and Zoll 2005; Inkelas 2008). This can range from duplication of a limited amount of material at the beginning of a word to the complete copying of a full stem.
Languagetheoretically we need to be clear about the exact status of this copying; we focus on the most interesting case of fullstem reduplication. Let us assume that we are dealing with a language where the plural is formed by duplicating the entire stem. There are three positions: (A) one could say that the lexicon is finite, and as a result the learner need only memorise the correct plural form for each word, and thus there is no languagetheoretic issue at all; (B) one could say that it is sufficient to have a grammar which gives the copy language over some finite alphabet of phonemes; or finally (C) one could say that the learner must have a true primitive copying operation. There are two related differences between these three approaches—the first is the size of the grammars. The Type A learner will produce a grammar with one rule per lexical item. The Type B learner will have a grammar with one rule per phoneme; but the Type C learner has only one rule in total. The second difference is the rapidity of learning, or equivalently the generalisation ability of the algorithm. For example, if a Type A learner is confronted with an unseen word, then it will fail to generalise correctly, whereas a Type B learner will be able to generalize correctly. Correspondingly if we give a Type B learner a new phoneme, it will fail to generalise—only the Type C learner has learned a real copying rule. Thus deciding whether it is appropriate to require a Type A, B or C learner in each case depends on the ability of the extent to which we require the learner to generalise. If we are interested in modeling language acquisition then the three different learners make different predictions about the behaviour: Type A learners will not be able to produce the correct plural for a novel word. Type B learners will be unable to generalise if it is presented with a word containing a novel phoneme. Only Type C learners can generalise fully.
In the specific case of reduplication these three learners correspond to three different levels of the hierarchy: Type A learners can just use list grammars, Type B can use mcfgs, and Type C requires a pmcfg. We claim that a full explanation of acquisition may require a pmcfg acquisition model that has a true copying operation even if from a purely descriptive languagetheoretic approach we only need a much weaker model (see Chandlee and Heinz 2012 for an alternative view). Thus in this model we consider both full stem and partial reduplication as instances of the same copying process.
Example 2
Example 3
2.5 Casestacking
Casestacking (or Suffixaufnahme) is a comparatively rare phenomenon that occurs in some European and several Australian languages such as Martuthunira and Kayardild. In languages with casestacking a single word may receive more than one casemarker—a suffix that indicates the grammatical status of a word. In languages without casestacking, like German, a noun may receive a marking that depends on the case it receives from the verb. This serves to indicate the syntactic relationship between the noun and the verb which governs it. In English for example, which has only a vestigial case system, in the noun phrase “John’s dog” the proper noun “John” bears a genitive suffix or clitic which indicates that John is the possessor of the dog. In the noun phrase, “John’s father’s dog” (the dog of the father of John), the word “John” is embedded deeply but still only receives one suffix.

makuwa yalawujarra yakurina dangkakarrangunina mijilngunina

womanNOM catchPST fishMABL manGENINSTRMABL netINSTRMABL

The woman caught fish in the man’s net.
Here the word ‘dangka’ receives three case markers—karra GEN (a genitive marker), nguni INSTR (an instrumental case marker) and MABL a modal ablative which indicates that it is past.
A particular example of Suffixaufnahme has received recent theoretical attention: Old Georgian is a now dead language that has a particularly extreme form of suffixstacking. The exact status of the data is controversial (Michaelis and Kracht 1997; Bhatt and Joshi 2004); unfortunately Old Georgian is extinct so there is no way of verifying the exact data. Here we assume that the arguments are valid.
 {nv,nngv,nngnggv,nngnggngggv,…}

More formally, defining u _{ i }=ng ^{ i } this is the language: L _{OG}={nu _{1}⋯u _{ k } v∣k≥0}. This is not semilinear, since the total number of occurrences of g will be a quadratic function of the number of ns in the string. We can describe this string set with the following grammar.
Example 4
2.6 Yoruba
Kobele (2006) argues that Yoruba, a Nigerian language, has a certain type of recursive copying in relative clauses. Yoruba can form relative clauses by copying entire verb phrases—the verb phrases can have nouns which can have relative clauses; the end result of this is a language which under intersection with a suitable regular language, and after homomorphism gives the language \(\{a^{2^{n}} \mid n \geq0\}\). This means that Yoruba cannot be adequately represented by an mcfg. Yoruba has noun phrases of the form (Example 4.48 of Kobele 2006) ‘rira NP ti Ade ra NP’ (the fact that Ade bought NP) where NP is a noun phrase which must be copied; the two occurrences of NP must be identical.^{3}
Noun phrases can also be formed into sentences like ‘Ade ra NP’ (Ade bought NP) or ‘NP ko da’ (NP is not good).
Example 5
We can verify that this language is not semilinear by counting the number of occurrences of t in grammatical sentences. Similar phenomena occur widely in other West African languages, such as Wolof, though Yoruba has perhaps the most complex system of this type.
2.7 Chinese number names
In Mandarin Chinese, a certain subset of number names can be formed from ‘wu’ (5) and ‘zhao’ (10^{12}). Here we will write a for ‘wu’ and b for ‘zhao’. The wellformed expressions intersected with a suitable regular language form the language \(L_{\mathrm{CN}}= \{ a b^{k_{1}} a b^{k_{2}} \cdots a b^{k_{n}} \mid k_{1} > k_{2} >\cdots> k_{n} \ge0 \}\). This data is controversial as it is not clear whether the wellformedness of number expressions should form part of the syntax of the language. Here we assume that it does, in which case the language is not semilinear (Radzinski 1991). A grammar for this is:
Example 6
3 Preliminaries
We now define our formalism more precisely, starting with some basic definitions.
The sets of nonnegative and strictly positive integers are denoted by \(\mathbb{N}\) and \(\mathbb{N}_{+}\), respectively. A sequence over an alphabet Σ is called a word. The empty word is denoted by λ. Σ ^{∗} denotes the set of all words and Σ ^{+}=Σ ^{∗}−{λ}. Any subset of Σ ^{∗} is called a language (over Σ). An mword is an mtuple of words and we denote the set of mwords by \(\mathcal{S}_{m}\) for \(m \in\mathbb{N}\). Any mword is a multiword. We define \(\mathcal{S}_{\le m} = \bigcup_{i \le m} \mathcal{S}_{i}\) and \(\mathcal{S}_{*} = \bigcup_{i \in\mathbb{N}} \mathcal{S}_{i}\). We note that the only 0word is the empty tuple. We usually identify a 1word (u) with a word u, and a string of length 1 with an element of Σ.
3.1 Parallel multiple contextfree grammars
A ranked alphabet is a pair 〈N,dim〉 of an alphabet N and a function \(\mathrm{dim}:N \to\mathbb{N}_{+}\). The number dim(A) is called the dimension of A. We often simply express a ranked alphabet 〈N,dim〉 by N if no confusion arises. By N _{ d } we denote the subset of N whose elements have dimension d.
Seki et al. (1991) introduced parallel multiple contextfree grammars (pmcfgs) as a generalization of contextfree grammars. A pmcfg is a tuple G=〈Σ,N,S,P〉 where Σ is an alphabet whose letters are called terminals, N is a ranked alphabet whose elements are called nonterminals, S∈N is a special nonterminal of dimension 1 called the start symbol, and P is a set of production rules.
Example 7
The following lemma states that the pattern in a rule is easily reconstructed from a word derived using that rule.
Lemma 1
Suppose that a rule B(π):−B _{1}(x _{1}),…,B _{ k }(x _{ k }) is used to obtain \(u \in\mathcal{L}(G)\). Then u can be represented by u=π _{0}[θ(π)] for some \(\pi_{0} \in\mathcal{C}_{\dim(B),1}\) and some substitution θ whose domain consists of the variables of x _{1},…,x _{ k } exactly.
Proof
(Sketch) Suppose that we have ⊢_{ G } B(θ(π)) following ⊢_{ G } B _{ i }(v _{ i }) for i=1,…,k where θ=[v _{1},…,v _{ k }]. The derivation process solely consists of concatenation operations. Strings obtained during the derivation process are never deleted or split. Therefore, it is easily seen that if some derivation tree for ⊢_{ G } A(u) contains the derivation tree corresponding to ⊢_{ G } B(θ(π)), then u can be represented as u=π′[θ(π)] for some tuple of patterns π′. Particularly for \(u \in\mathcal{L}(G)\), we have \(u = \pi_{0}'[\theta (\boldsymbol{\pi})]\) for some \(\pi_{0}' \in\mathcal{C}_{d,*}\) with d=dim(B). By replacing all but one occurrences of a variable x by θ(x) in \(\pi_{0}'\), one obtains \(\pi_{0} \in\mathcal{C}_{d,1}\) with the desired property. □
We denote by \(\mathbb{G}(p,q,r)\) the class of pmcfgs such that the dimension of every nonterminal is at most p and every production rule has at most q nonterminals on the righthand side and is rcopying. For the grammars in Example 7, we have \(G_{1} \in\mathbb {G}(1,1,2)\), \(G_{2} \in\mathbb{G}(2,1,2)\) and \(G_{3} \in\mathbb{G}(2,2,2)\).
Theorem 1
(Seki et al. 1991)
The uniform membership problem for \(\mathbb{G}(p,q,r)\) is solvable in polynomial time whose degree is linear in pq.
4 Intuition
We will now give an informal introduction to the extension of distributional learning to these formalisms; the basic idea is quite natural but may be obscured by the unavoidable complexity of the notation.
In distributional learning we typically consider a context (l,r) which we can wrap around a substring u to give a complete string lur. Consider this context rather as a function f from a substring to a full sentence. u↦lur, which in our notation is represented by what we call a 1copying 1context lx _{1} r, an element of \(\mathcal{C}_{1,1}\).
In the derivation of a string with respect to a cfg, these functions correspond to the operation that takes the yield of a nonterminal and integrates into the rest of the sentence: given a derivation like \(S \overset{*}{\Rightarrow}lNr \overset{*}{\Rightarrow}lur\), we can consider the derivation \(S \overset{*}{\Rightarrow}lNr\) to be applying a function \(f \in\mathcal{C}_{1,1}\) to the yield of N.
In a parallel cfg, we might again have a nonterminal N that derives a string u. However, the part of the derivation that produces the whole sentence from u may include rules that copy u.
Example 8
Therefore, with this richer class of grammars we need to consider a larger class of functions that correspond to rcopying contexts, and when we consider tuples of strings in the full pmcfg formalism, to rcopying dcontexts: the class \(\mathcal{C}_{d,r}\). Given such a set of functions we can consider the ‘distribution’ in this extended sense of a substring in a language to be the set of functions that when applied to that substring give an element of the language.
In this paper we use a dual approach—the nonterminals are defined by small finite sets of patterns/functions, and incorrect rules will be eliminated by strings or tuples of strings.
In Example 8, we can see how the nonterminals that we need can be picked out. The symbol S will correspond as usual to the single simple pattern x _{1}—the identity function which corresponds to the empty context (λ,λ) in distributional learning of cfgs (e.g. Clark and Eyraud 2007). The symbol W corresponds to the 2copying context x _{1} x _{1}. It is easy to see that the set of strings generated by W is exactly the same as the set of strings which can occur in the context x _{1} x _{1}. In the notation we defined earlier we have a singleton set C _{ W }={x _{1} x _{1}} such that \(\mathcal{L}(G,W) = \{v \in\varSigma^{*} \mid C_{W}[v] \subseteq\mathcal{L}(G)\}\). Note that for any grammar, the start symbol S will be characterised by the single context x _{1}. We shall show that languages that have this nice property—that the languages defined by each nonterminal can be picked out by a small set of contexts—are learnable by a straightforward algorithm, very similar to ones that have been used before for distributional learning of cfgs.
4.1 Algorithm description
We will now give an informal description of the algorithm. The algorithm receives some examples of strings from a target language: we use D to denote the finite set of examples. In addition we assume that the learner can ask membership queries (mqs). This means that the learner, in addition to passively observing the examples it is given can construct a string and query whether it is in the language being learned or not. We discuss below in Sect. 5.2 the implications of this choice of learning model.
This rule is correct if when we apply the rule π to each of the strings in \(C_{i}^{(K)}\) the result is in fact in \(C_{0}^{\dagger}\). So we take all of the strings or tuples of strings that correspond to the nonterminals on the right hand side of the rule, combine them using the recipe in the pattern π and then test, using membership queries that the resulting tuple can occur in each of the contexts in C _{0}. The final grammar that we construct consists of all rules that pass this test.
5 Learning target and algorithm
Definition 2
By \(\mathbb{G}(p,q,r,s)\) we denote the subclass of \(\mathbb {G}(p,q,r)\) where grammars have the (r,s)fcp. The class of languages generated by grammars in \(\mathbb{G}(p,q,r,s)\) is denoted by \(\mathbb{L}(p,q,r,s)\).
Clearly the above definition is a generalisation of the sfcp (Clark 2010). All regular languages are in \(\mathbb{L}(1,1,1,1)\) and the Dyck language is in \(\mathbb{L}(1,2,1,1)\).
Recall the grammar G _{1} in Example 7, which has only one nonterminal symbol. We have \(\mathcal{L}(G_{1}) = \{ a^{2^{n}} \mid n \ge0 \} \in\mathbb {L}(1,1,2,1)\) since the 1context x _{1} always characterises the start symbol S.
There are contextfree languages which are not in \(\mathbb {L}(p,q,r,s)\) for any values of p,q,r,s such as for example, the language {a ^{ n } b ^{ m }∣n,m>0,m≠n}. This is because the required nonterminals cannot be picked out by any finite set of contexts.
5.1 Linguistic examples
We now consider the examples in Sect. 2 with respect to this representational assumption.
In the case of Swiss German, we do not need any copying operations, and we need to be able to define the two nonterminals in the grammar in Example 1: S which is of dimension 1, and D which is of dimension 2. S is trivially definable using the single context x _{1}. D could be defined using the single context π=CN _{ a } x _{1} N _{ d } V _{ a } x _{2} V _{ d }, which is a 1copying 2context. We can easily verify that {π}^{†} which is the set of 2words that can occur in this context is the set \(\{ (\bar{w}, w) \mid w \in\{V_{a},V_{d}\}^{*} \}\). This differs slightly from the set of strings generated by the nonterminal D in the specific grammar we defined earlier, in that it includes the empty biword (λ,λ). However we can modify that grammar slightly to get the following slightly larger grammar:
Example 9
In the Old Georgian case, Example 4, we can use the single context π=x _{1} x _{2} nx _{2} gv to pick out the multiwords generated by the nonterminal N of dimension 2. We recall that \(\mathcal{L}(G,N) = \{ (n,\lambda) \} \cup\{ nu_{1} \cdots u_{k} n , g^{k+1} \mid k \geq0 \}\). Note that π is a 2copying 2context—the variable x _{1} occurs twice. Suppose that (w _{1},w _{2})∈{π}^{†}. This means that w _{1} w _{2} nw _{2} gv∈L. By considering the number of occurrences of the symbols n and g in w _{1} and w _{2} we can verify that this can only happen if w _{2} is a string of zero or more occurrences of g, and therefore if \((w_{1},w_{2}) \in\mathcal{L}(G,N)\). Moreover we can see that \(\pi[\mathcal {L}(G,N)] \subseteq\mathcal{L}(G)\). Therefore we have that \(\{ \pi\}^{\dagger} = \mathcal{L}(G,N)\) as desired.
In the case of the duplication in Indonesian, as shown in Examples 2 and 3, we can in both cases find appropriate sets of contexts. Taking the mcfg grammar first, the single context x _{1} x _{2} does not suffice to pick out the nonterminal of dimension 2, P, since {x _{1} x _{2}} includes strings like (ha, khak) as well as the desired 2words. Indeed no single 1copying 2context can pick out exactly the right set of 2words. Suppose we have a 1copying 2context which will be of the form ux _{1} vx _{2} w for some strings u,v,w; then {ux _{1} vx _{2} w}^{†} will include for any string y both the 2words (y,wuyv) and (vywu,y). However, we can pick out the correct set of 2words using two distinct contexts. If we define C={ax _{1} ax _{2},bx _{1} bx _{2}}, then C ^{†}={(w,w)∣w∈Σ ^{+}}. Therefore this grammar is in the class \(\mathbb{G}(2,1,1,2)\) and the language is in \(\mathbb{L}(2,1,1,2)\).
If we consider now the pmcfg grammar for the same language, as shown in Example 3, we can define the nonterminal of dimension 2 P using the single 2copying 1context x _{1} x _{1}. Therefore this grammar is \(\mathbb{G}(1,2,2,1)\); indeed although we have written it using a grammar with a rule with two nonterminals on the right hand side, we could also have used a simpler grammar using only rules of rank 1, which would give a grammar in \(\mathbb{G}(1,1,2,1)\).
In the Yoruba case, Example 5, the NP class is picked out by the context x _{1} v _{ f } and the VP class (just the symbol v _{ f } in this trivial example) by the context nx _{1}. Therefore the grammar belongs to \(\mathbb{G}(2,2,2,1)\).
5.2 Learning model
The learner receives a presentation of positive data in the identification in the limit paradigm. We assume that our learner has in addition access to an oracle which answers membership queries (mqs), which says whether an arbitrary string u belongs to the learning target L _{∗}. See for example Yoshinaka (2010) for details.
A positive presentation of a language L _{∗} over Σ is an infinite sequence of words w _{1},w _{2},…∈Σ ^{∗} such that L _{∗}={w _{ i }∣i≥1}. A learner is given a positive presentation of the language \(L_{*} = \mathcal{L}(G_{*})\) of the target grammar G _{∗} and each time a new example w _{ i } is given, it outputs a grammar G _{ i } computed from w _{1},…,w _{ i } with the aid of a membership oracle. One may query the oracle whether an arbitrary string w is in L _{∗}, and the oracle answers in constant time. We say that a learning algorithm identifies G _{∗} in the limit from positive data and membership queries if for any positive presentation w _{1},w _{2},… of \(\mathcal{L}(G_{*})\), there is an integer n such that G _{ n }=G _{ m } for all m≥n and \(\mathcal{L}(G_{n}) = \mathcal{L}(G_{*})\). Trivially every grammar admits a successful learning algorithm. An algorithm should learn a rich class of grammars in a uniform way. We say that a learning algorithm identifies a class \(\mathbb {G}\) of grammars in the limit from positive data and membership queries if and only if it identifies all \(G \in\mathbb{G}\).
We remark that as we have membership queries, learning algorithms based on exhaustive enumeration will work, hence a learner should have further properties in terms of efficiency. Accordingly we require the learner to operate in polynomial time: we assume that the membership queries can be answered in constant time (or equivalently time polynomial in the length of the example). Thus at each step t, the learner can only use time that is bounded by a polynomial of the total size of the data seen so far, w _{1}+⋯+w _{ t }.
We note that our interest here is in showing that a particular algorithm is correct and efficient, and not in showing that a particular class of languages is learnable with respect to a particular learning model. However, we note that this learning model is not restrictive in the sense that it is possible to find trivial enumerative algorithms that can also learn the classes \(\mathbb {L}(p,q,r,s)\) we discuss in this paper using a delaying trick. The algorithm we present here does not use such tricks.
5.3 Construction of hypothesis
Hereafter we arbitrarily fix rather small natural numbers p,q,r,s≥1 and a target language \(L_{*} \in\mathbb{L}(p,q,r,s)\) to be learnt.
We would like each nonterminal Open image in new window to generate C ^{†}={v∣C[v]∈L _{∗}}. If a grammar G _{∗} generating the target language L _{∗} has a nonterminal A which is characterised by a context set C _{ A } and C _{ A }⊆K, then the nonterminal 〚C _{ A }〛 of our grammar should be used to simulate A. Recall that the start symbol of any grammar is characterised by {x _{1}}, which is the reason why 〚{x _{1}}〛 is the start symbol of our grammar.

0≤k≤q,

π is a d _{0}tuple of patterns whose concatenation is an rcopying dcontext,

x _{ i }=d _{ i } for each i=1,…,k and the variables from x _{1},…,x _{ k } constitute X _{ π },

there are \(\mathbf{v}_{i} \in\mathcal{S}_{d_{i}}\) for i=1,…,k and \(\pi_{0} \in\mathcal{C}_{d_{0},1}\) such that π _{0}[π[v _{1},…,v _{ k }]]∈F _{0},

the inclusion (4) holds.
Example 10
Example 11
If (n,g)∈K, the first rule does not satisfy (4) since x _{1} x _{1} x _{2} v[n,g]∈L _{OG} and x _{1} x _{1} x _{2} v[n,gnng]∉L _{OG}.
If (nn,g)∈K, the second rule does not satisfy (4) since x _{1} x _{2} nx _{2} gv[nn,g]∈L _{OG} and x _{1} x _{1} x _{2} v[nn,g]∉L _{OG}.
The third rule always satisfies (4) whatever K is.
Lemma 2
One can construct \(\mathcal{G}(K,F)\) in polynomial time in ∥D∥.
Proof
By F⊆Con_{≤p,r }(D) and K⊆Sub_{≤p }(D), ∥F∥ and ∥K∥ are bounded by a polynomial in ∥D∥ with a degree linear in pr and p, respectively. For each Open image in new window , the fact C≤s implies \(\hat{N} \leF^{s}\).
We first estimate the number of potential rules of the form (2). By k≤q, at most \((\hat{N}+1)^{q+1}\) combinations of nonterminals are possible. It remains to count the number of possible π. There must exist u∈F _{0}⊆D, v _{ i }∈Sub_{≤p }(u) for i=1,…,k and π _{0}∈Con_{≤p,1}(u) such that u=π _{0}[π[v _{1},…,v _{ k }]]. Determining π can be seen as determining the left and right positions in u to which each occurrence of variables in π _{0} and π corresponds. Note that π _{0} and π contain at most p and pqr occurrences of variables, respectively. Thus we extract at most u^{2(p+pqr)} variants of π from a string u∈F _{0}. Therefore, we have at most \((\hat{N}+1)^{q+1}F_{0}\ell^{2(p+pqr)}\) production rules in \(\mathcal{G}(K,F)\) where ℓ is the length of a longest word in F _{0}.
The algorithm verifies whether potential rules satisfy the last condition (4). To compute \(C_{i}^{(K)}\), we call the membership oracle on at most C _{ i }K words. To see whether (4) holds, it is enough to check the membership on at most \(C_{0}\prod_{1 \le i \le k}C_{k}^{(K)} \le sK^{q}\) words.
All in all, one can compute \(\mathcal{G}(K,F)\) in polynomial time in ∥D∥, where the degree of the polynomial linearly depends on pqrs. □
Just like the algorithms based on syntactic concept lattices (e.g. Clark 2010), we establish the following monotonicity lemma: expansion of F expands the hypothesised language while expansion of K shrinks the hypothesised language.
Lemma 3
(Monotonicity)
 (1)
If K⊆K′ and F=F′, then \(\mathcal{L}(\hat{G}) \supseteq\mathcal{L}(\hat{G}')\).
 (2)
If K=K′ and F⊆F′, then \(\mathcal{L}(\hat{G}) \subseteq\mathcal{L}(\hat{G}')\).
Proof
(1) Every rule of \(\hat{G}'\) is also a rule of \(\hat{G}\). (2) Every rule of \(\hat{G}\) is also a rule of \(\hat{G}'\). □
Lemma 4
Every context set F admits a multiword set K of a polynomial cardinality in ∥F∥ such that \(\hat{G} = \mathcal{G}(K,F)\) has no incorrect rules.
Proof
Suppose that a rule 〚C _{0}〛(π):−〚C _{1}〛(x _{1}),…,〚C _{ k }〛(x _{ k }) is incorrect. There exist multiwords \(\mathbf{v}_{i} \in C_{i}^{\dagger}\) for i=1,…,k such that C _{0}[π[v _{1},…,v _{ k }]]⊈L _{∗}. If v _{1},…,v _{ k }∈K, such a rule is suppressed. That is, at most q multiwords are enough to get rid of an incorrect rule. Recall that the number of potential rules is polynomially bounded by Fℓ with ℓ=max{u∣u∈F _{0}} by the proof of Lemma 2. This proves the lemma. □
We say that K is fiducial on F (with respect to L _{∗}) if \(\mathcal{G}(K,F)\) has no incorrect rules. If K is fiducial on F, then so is every superset of K by definition.
Lemma 5
If \(\hat{G} = \mathcal{G}(K,F)\) has no incorrect rules, then \(\mathcal{L}(\hat{G}) \subseteq L_{*}\).
Proof
We show by induction that Open image in new window implies C[v]⊆L _{∗}. This implies particularly for Open image in new window , where 〚{x _{1}}〛 is the start symbol of \(\hat{G}\), we have v∈L _{∗}.
Suppose that we have Open image in new window by the rule 〚C _{0}〛(π):−〚C _{1}〛(x _{1}),…,〚C _{ k }〛(x _{ k }) and Open image in new window for i=1,…,k. By the induction hypothesis we have C _{ i }[v _{ i }]⊆L _{∗} for i=1,…,k. (When k=0, it is the base case.) Since the rule is correct, we have C _{0}[π[v _{1},…,v _{ k }]]⊆L _{∗}. □
Let \(G_{*} = \langle\varSigma,N_{*},P_{*},S_{*} \rangle \in\mathbb {G}(p,q,r,s)\) generate L _{∗}. We say that F is adequate (with respect to G _{∗}) if F includes a characterising set C _{ A } for every nonterminal A∈N _{∗} and F _{0} contains a string \(v_{\rho}\in\mathcal{L}(G_{*})\) derived by using ρ for every rule ρ∈P _{∗}. If F is adequate, then so is every superset of F.
Lemma 6
If F is adequate, then \(L_{*} \subseteq\mathcal{L}(\mathcal {G}(K,F))\) for any K.
Proof
Let \(\hat{G} = \mathcal{G}(K,F)\). For a rule A _{0}(π):−A _{1}(x _{1}),…,A _{ k }(x _{ k }) of G _{∗}, let \(C_{i} \subseteq F_{\mathrm{dim}(A_{i})}\) be a characterising set of A _{ i } for i=0,…,k. By the assumption, there are \(\mathbf{v}_{i} \in\mathcal{L}(G_{*},A_{i})\) for i=1,…,k and a dim(A _{0})pattern π _{0} such that \(\pi_{0}[\boldsymbol{\pi}[\mathbf{v}_{1},\ldots,\mathbf{v}_{k}]] \in\mathcal{L}(G) \cap F_{0}\) by Lemma 1. Thus 〚C _{0}〛(π):−〚C _{1}〛(x _{1}),…,〚C _{ k }〛(x _{ k }) is a potential rule. For any \(\mathbf{u}_{i} \in C_{i}^{(K)} = \mathcal{L}(G_{*},A_{i}) \cap K\), we have \(\boldsymbol{\pi}[\mathbf{u}_{1},\ldots,\mathbf{u}_{k}] \in\mathcal{L}(G_{*},A_{0}) = C_{0}^{\dagger}\). That is, C _{0}[π[u _{1},…,u _{ k }]]⊆L _{∗}. Hence the rule is present in \(\hat{G}\) whatever K is. □
Lemmas 2, 4–6 mean that one can construct a right grammar from a small amount of data efficiently.
5.4 Learning algorithm
Lemma 7
If the current conjecture \(\hat{G}\) is such that \(L_{*} \nsubseteq \mathcal{L}(\hat{G})\), then the learner will discard \(\hat{G}\) at some point.
Proof
At some point, some element \(u \in L_{*}  \mathcal{L}(\hat{G})\) is given to the learner. The rule 〚{x _{1}}〛(u):− is correct but not present in \(\hat{G}\). Once the learner gets u, we obtain this rule by u∈F _{0}. □
Lemma 8
If \(\mathcal{L}(\hat{G}) \nsubseteq L_{*}\), then the learner will discard \(\hat{G}\) at some point.
Proof
By Lemma 5, the fact \(\mathcal{L}(\hat{G}) \nsubseteq L_{*}\) implies that \(\hat{G}\) has an incorrect rule 〚C _{0}〛(π):−〚C _{1}〛(x _{1}),…,〚C _{ k }〛(x _{ k }), where \(C_{0}[\boldsymbol{\pi}[C_{1}^{\dagger},\ldots,C_{k}^{\dagger}]] \nsubseteq L_{*}\). That is, there are \(\mathbf{v}_{1},\ldots,\mathbf{v}_{k} \in\mathcal{S}_{\le p}\) such that C _{ i }[v _{ i }]⊆L _{∗} for all i=1,…,k and C _{0}[π[v _{1},…,v _{ k }]]⊈L _{∗}. By v _{ i }∈Sub_{≤p }(L _{∗}), at some point the learner will have D⊆L _{∗} such that v _{ i }∈Sub_{≤p }(D) for all i=1,…,k. For K=Sub_{≤p }(D) we have \(\mathbf{v}_{i} \in C_{i}^{(K)}\) and \(C_{0}[\boldsymbol{\pi}[C_{1}^{(K)},\ldots,C_{k}^{(K)}]] \nsubseteq L_{*}\). The incorrect rule must be removed. □
Theorem 3
The learner \(\mathcal{A}(p,q,r,s)\) identifies \(\mathbb{G}(p,q,r,s)\) in the limit.
Proof
Let \(L_{*} \in\mathbb{L}(p,q,r,s)\) be the learning target. By Lemmas 7 and 8, the learner never converges to a wrong hypothesis. It is impossible that the set F is changed infinitely many times because F is monotonically expanded and at some point F will become adequate with respect to a target grammar G _{∗} generating L _{∗}, in which case the learner never updates F any more by Lemma 6. Then sometime K will be fiducial on F by Lemmas 8 and 4, where \(\hat{G}\) has no incorrect rules. Thereafter no rules will be added to or removed from \(\hat{G}\) any more. □
6 Discussion and conclusion
6.1 Related work
There is very little work that this paper can be directly compared to; it is of course, as noted earlier, an extension of recent work on distributional learning of contextfree and multiple contextfree grammars; in particular it subsumes the dual approaches to learning contextfree grammars taken in Clark (2010) as corrected by Yoshinaka (2011b). Under a different learning model, the minimally adequate teacher (mat) model, Yoshinaka and Clark (2012) show that a class of multiple contextfree grammars can be learned using distributional techniques. The class of grammars there is based on a different representational assumption: each nonterminal of dimension d corresponds to an equivalence class of dwords that are distributionally identical. As in the case of contextfree grammars, this representational assumption limits the class of languages that can be learned significantly.
We can contrast the algorithm here with the approach taken in Shinohara (1994), which concerns the inference of elementary formal systems in the Smullyan sense, in a number of respects. First, the algorithm presented here is polynomial, whereas the Shinohara result studies learnability without constraints on the computational resources of the learner, and indeed includes languages which are not in ptime. On the other hand, Shinohara considers a learning model where the learner only has positive examples, whereas we allow the learner to ask membership queries. Finally, Shinohara obtains learnability by bounding the number of clauses in the grammar. Here we do not need a priori bounds on the number of clauses—as we can learn grammars of unbounded complexity—but we do put bounds on the parameters that define the languagetheoretic hierarchy.
A different class of representations, Marcus contextual grammars, is studied by Oates et al. (2006). These are incomparable to the classes of representations that we use, but are capable of representing some noncontextfree languages, though the class studied there cannot represent all regular languages. Again this is a learning approach which only considers positive data as a source, but in this case the algorithm is also efficient, although the class of languages that can be learned is rather small. We use patterns in the definition of our rules, but the languages are very different from the pattern languages studied by Angluin (1980), though it is worth noting that every such pattern language can be defined easily by a pmcfg.
6.2 Hardness of primal approach
Among two different types of approaches in distributional learning, this paper takes the socalled dual approach for learning pmcfgs in the sense of Yoshinaka (2011b). Dual approaches are those where the nonterminals are defined by sets of contexts or generalisations of contexts. One might expect that the other, called primal approach, would work as well; these approaches define nonterminals using sets of yields of nonterminals—sets of tuples of strings. For example, using the same learning model that we use here, Yoshinaka (2010) presents a learner for learning some mcfgs using a primal approach. There the nonterminals are defined using sets of tuples of strings, and the contexts are used to eliminate the incorrect rules.
However, the simplest nonlinear grammar with the rule set {S(x _{1} x _{1}):−S(x _{1}),S(a):−} does not have the 1fkp, which contrasts with the fact that every grammar with a single nonterminal has the 1fcp. This grammar still has the 2fkp, but the authors did not yet find a nonsemilinear language generated by a grammar with the 1fkp.Let us say that a grammar G has the sfkp (finite kernel property) if every nonterminal A admits a finite string set K _{ A } of cardinality at most s such that \(\pi[K_{A}] \subseteq\mathcal{L}(G)\) iff \(\pi[\mathcal{L}(G,A)] \subseteq\mathcal{L}(G)\) for any context π.
6.3 Conclusion
In this paper, we have extended distributional learning to the inference of nonsemilinear languages. This result also includes as a corollary a significant extension of the learnable classes of mcfgs where the nonterminals are based on contexts: a dual model in the sense of Yoshinaka (2011b).
These algorithms are all polynomial but the degree, as stated earlier, depends on the product of the parameters p,q,r and s. As a result, these algorithms will not be practical, in their naive forms, for anything other than very small values for these parameters: 1 or 2 at most. Of course, there are numerous heuristic modifications that could be made to only restrict the format of the rules that are considered to some more limited subset.
The combination of these two extensions gives, for the first time, an efficiently learnable class of languages that plausibly includes all natural languages, even under the worst case that all of the questionable examples in Sect. 2 are valid; more precisely, a class where there are no arguments that suggest that there is a natural language which is not in the class. In particular we are able to learn the particular example from Swiss German, which motivated the development of the theory of mildly contextsensitive grammars. Previous primal algorithm for mcfgs were not able to learn this precise case, though they could learn some closely related languages.
This leaves open two interesting issues: finding an appropriate learnable feature calculus to represent the large set of nonterminals required, and the more fundamental question of whether these grammars are also strongly adequate: adequate not just in terms of the sets of strings that they generate but in terms of the sets of structural descriptions.
From a technical point of view, it is naturally to ask whether this learning approach can be extended beyond the class of pmcfgs to use conjunction as well. The linguistic motivations for this extension do not seem to be particularly compelling, though there may be reasons to study this in other application domains such as program induction.
Footnotes
Notes
Acknowledgements
We would like to thank the anonymous reviewers for helpful comments.
References
 Andrews, A. (1996). Semantic casestacking and insideout unification. Australian Journal of Linguistics, 16(1), 1–55. CrossRefGoogle Scholar
 Angluin, D. (1980). Finding patterns common to a set of strings. Journal of Computer and System Sciences, 21(1), 46–62. MATHMathSciNetCrossRefGoogle Scholar
 Angluin, D. (1982). Inference of reversible languages. Journal of the Association for Computing Machinery, 29(3), 741–765. MATHMathSciNetCrossRefGoogle Scholar
 Angluin, D. (1987). Learning regular sets from queries and counterexamples. Information and Computation, 75(2), 87–106. MATHMathSciNetCrossRefGoogle Scholar
 Berwick, R., Pietroski, P., Yankama, B., & Chomsky, N. (2011). Poverty of the stimulus revisited. Cognitive Science, 35, 1207–1242. CrossRefGoogle Scholar
 Bhatt, R., & Joshi, A. (2004). Semilinearity is a syntactic invariant: a reply to Michaelis and Kracht 1997. Linguistic Inquiry, 35(4), 683–692. CrossRefGoogle Scholar
 Boullier, P. (1999). Chinese numbers, MIX, scrambling, and range concatenation grammars. In Proceedings of the 9th conference of the European chapter of the association for computational linguistics (EACL 99) (pp. 8–12). Google Scholar
 Chandlee, J., & Heinz, J. (2012). Bounded copying is subsequential: implications for metathesis and reduplication. In Twelfth meeting of the ACL special interest group on computational morphology and phonology, association for computational linguistics (pp. 42–51). Google Scholar
 Chomsky, N. (1956). Three models for the description of language. IEEE Transactions on Information Theory, 2(3), 113–124. MATHCrossRefGoogle Scholar
 Clark, A. (2010). Learning context free grammars with the syntactic concept lattice. In Sempere and García (2010) (pp. 38–51). Google Scholar
 Clark, A., & Eyraud, R. (2007). Polynomial identification in the limit of substitutable contextfree languages. Journal of Machine Learning Research, 8, 1725–1745. MATHMathSciNetGoogle Scholar
 Clark, A., & Lappin, S. (2011). Linguistic nativism and the poverty of the stimulus. New York/Oxford: Wiley/Blackwell Sci. CrossRefGoogle Scholar
 Evans, N. (1995). A grammar of Kayardild: with historicalcomparative notes on Tangkic (Vol. 15). Berlin: de Gruyter. CrossRefGoogle Scholar
 Gazdar, G., Klein, E., Pullum, G., & Sag, I. (1985). Generalised phrase structure grammar. Oxford: Blackwell Sci. Google Scholar
 Gold, E. M. (1967). Language identification in the limit. Information and Computation, 10(5), 447–474. MATHGoogle Scholar
 Groenink, A. (1995). Literal movement grammars. In Proceedings of the seventh conference of the European chapter of the association for computational linguistics, University College, Dublin (pp. 90–97). CrossRefGoogle Scholar
 Groenink, A. (1997). Mild contextsensitivity and tuplebased generalizations of contextgrammar. Linguistics and Philosophy, 20(6), 607–636. CrossRefGoogle Scholar
 Huybrechts, R. A. C. (1984). The weak inadequacy of contextfree phrase structure grammars. In G. de Haan, M. Trommelen, & W. Zonneveld (Eds.), Van Periferie naar Kern, Dordrecht: Foris. Google Scholar
 Inkelas, S. (2008). The dual theory of reduplication. Linguistics, 46(2), 351–401. CrossRefGoogle Scholar
 Inkelas, S., & Zoll, C. (2005). Reduplication: doubling in morphology. Cambridge: Cambridge University Press. CrossRefGoogle Scholar
 Joshi, A., VijayShanker, K., & Weir, D. (1991). The convergence of mildly contextsensitive grammar formalisms. In P. Sells, S. Shieber, & T. Wasow (Eds.), Foundational issues in natural language processing (pp. 31–81). Cambridge: MIT Press. Google Scholar
 Kobele, G. (2006). Generating copies: an investigation into structural identity in language and grammar. PhD thesis, University of California Los Angeles. Google Scholar
 Kracht, M. (2011). Interpreted languages and compositionality. Berlin: Springer. CrossRefGoogle Scholar
 Ljunglöf, P. (2005). A polynomial time extension of parallel multiple contextfree grammar. In P. Blache, E. Stabler, J. Busquets, & R. Moot (Eds.), Lecture notes in computer science: Vol. 3492. Logical aspects of computational linguistics (pp. 177–188). Berlin: Springer. CrossRefGoogle Scholar
 Michaelis, J., & Kracht, M. (1997). Semilinearity as a syntactic invariant. In C. Retoré (Ed.), Logical aspects of computational linguistics (pp. 329–345). Berlin: Springer. CrossRefGoogle Scholar
 Oates, T., Armstrong, T., BecerraBonache, L., & Atamas, M. (2006). Inferring grammars for mildly context sensitive languages in polynomialtime. In Y. Sakakibara, S. Kobayashi, K. Sato, T. Nishino, & E. Tomita (Eds.), Lecture notes in computer science (Vol. 4201, pp. 137–147). Berlin: Springer. Google Scholar
 Okhotin, A. (2001). Conjunctive grammars. Journal of Automata, Languages and Combinatorics, 6(4), 519–535. MATHMathSciNetGoogle Scholar
 Radzinski, D. (1991). Chinese numbernames, tree adjoining languages, and mild contextsensitivity. Computational Linguistics, 17(3), 277–299. Google Scholar
 Sadler, L., & Nordlinger, R. (2006). Case stacking in realizational morphology. Linguistics, 44(3), 459–487. CrossRefGoogle Scholar
 Seki, H., Matsumura, T., Fujii, M., & Kasami, T. (1991). On multiple contextfree grammars. Theoretical Computer Science, 88(2), 191–229. MATHMathSciNetCrossRefGoogle Scholar
 Sempere, J. M. & García, P. (Eds.) (2010). Grammatical inference: theoretical results and applications. In 10th International Colloquium, ICGI 2010. Berlin: Springer. Google Scholar
 Shieber, S. M. (1985). Evidence against the contextfreeness of natural language. Linguistics and Philosophy, 8, 333–343. CrossRefGoogle Scholar
 Shinohara, T. (1994). Rich classes inferrable from positive data—lengthbounded elementary formal systems. Information and Computation, 108(2), 175–186. MATHMathSciNetCrossRefGoogle Scholar
 Smullyan, R. (1961). Theory of formal systems. Princeton: Princeton University Press. MATHGoogle Scholar
 VijayShanker, K., & Weir, D. J. (1994). The equivalence of four extensions of contextfree grammars. Mathematical Systems Theory, 27(6), 511–546. MATHMathSciNetCrossRefGoogle Scholar
 VijayShanker, K., Weir, D. J., & Joshi, A. K. (1987). Characterizing structural descriptions produced by various grammatical formalisms. In Proceedings of the 25th annual meeting of association for computational linguistics, Stanford (pp. 104–111). CrossRefGoogle Scholar
 Yoshinaka, R. (2010). Polynomialtime identification of multiple contextfree languages from positive data and membership queries. In Sempere and García (2010) (pp. 230–244). Google Scholar
 Yoshinaka, R. (2011a). Efficient learning of multiple contextfree languages with multidimensional substitutability from positive data. Theoretical Computer Science, 412(19), 1821–1831. MATHMathSciNetCrossRefGoogle Scholar
 Yoshinaka, R. (2011b). Towards dual approaches for learning contextfree grammars based on syntactic concept lattices. In G. Mauri & A. Leporati (Eds.), Lecture notes in computer science: Vol. 6795. Developments in language theory (pp. 429–440). Berlin: Springer. CrossRefGoogle Scholar
 Yoshinaka, R., & Clark, A. (2012). Polynomial time learning of some multiple contextfree languages with a minimally adequate teacher. In P. Groote & M. J. Nederhof (Eds.), Lecture notes in computer science: Vol. 7395. Formal grammar (pp. 192–207). Berlin: Springer. CrossRefGoogle Scholar