Keywords

1 Introduction

Dependency grammars and dependency treebanks do not always use a unique linguistic model for lists of elements. Some of them define an enumeration as a linked list of elements. Other grammars define a list as a set of dependencies that link the same word, the head of the list, to the elements of the list.

Categorial Dependency Grammars [5] (CDG) allow the second model with iterated dependency types. This construction introduces a list of dependencies with the same name and the same governor. The dependency structures (DS) in Fig. 1 shows a dependency A that is iterated on the left and on the right five times. A CDG compatible with the example could assign the type \([N \backslash A \backslash S/A^*/L/A^*]\) to the word ran. The dependency name A appears three times, two times as the iterative dependency type \(A^*\). With this type, other DS are also possible: Each \(A^*\) may introduce none, one or several arguments linked to ran by a dependency A.

Fig. 1.
figure 1

A dependency structure with five dependencies A

However, iterated dependency types cannot be used when a list of elements needs to be mixed with a separator like the example of Fig. 2 from corpus Sequoia [4] “Les cyclistes et vététistes peuvent se réunir ce matin, à 9h, place Jacques-Bailleurs, à l’occasion d’une sortie d’entraînement.” (fr. the cyclists and ATB bikers may meet themselves this morning, at 9, at Jacques-Bailleurs square, for a training ride)Footnote 1.

Fig. 2.
figure 2

A dependency structure with a list of modifiers separated by commas

In this example, several modifiers alternate with a punctuation sign. The verb réunir may have type . A regular expression for the part that corresponds to the modifiers and commas would be \(mod(punct\,mod)^*\) or \((mod\,punct)^*mod\). It is not an iterative choice between mod and ponct like the regular expression \((mod\Vert ponct)^*\) but a repeatable sequence of mod and ponct. In order to formalize such structures, we propose to extend CDG types with a new construction that introduces finite sequences of dependencies. The system is an extension of classical CDG because iterated dependency types can be seen as sequence iterations where the sequence has a length of one dependency name.

We also study the learnability properties of CDG with sequence iteration when the grammar has to be infered from a dependency treebank. This concept of identification in the limit is due to Gold [7]. Learning from strings refers to hypothetical grammars generated from finite sets of strings. More generally, the hypothetical grammars may be generated from finite sets of structures defined by the target grammar. This kind of learning is called learning from structures. Both concepts were intensively studied (see excellent surveys in [2, 8, 9]). This concept lead for CDG with sequence iterations to a new class of grammar that is learnable from positive examples of dependency structures (DS).

The plan of the paper is as follows. Section 2 introduces Categorial Dependency Grammars with sequence iteration and studies their parsing properties and expressive power. The section also presents the links with linear logic, noncommutative logic and Lambek Calculus. Section 3 studies the learnability properties of such grammars from positive examples of dependency structures and defines new classes of such grammars that are learnable in this context. Section 4 presents experimental studies of sequence iterations in existent DS corpora. Section 5 concludes the paper.

2 CDG with Sequence Iterations

2.1 Classical Categorial Dependency Grammars

Categorial dependency grammars can be seen as an assignment to words of first order dependency types of the form: \(t=[L_m \backslash \ldots \backslash L_1 \backslash g / R_1 /\ldots /R_n]^P\). Intuitively, \(w\mapsto [\alpha \backslash d \backslash \beta ]^P\) means that the word w has a left subordinate through dependency d (similar for the right part \([\alpha / d/\beta ]^P\)). Similarly \(w\mapsto [\alpha \backslash d^* \backslash \beta ]^P\) means that w may have 0, 1 or several left subordinates through dependency d. The head type g in \(w\mapsto [\alpha \backslash g /\beta ]^P\) means that w is governed through dependency g. The assignment of Example 1 determines the projective DS in Fig. 3.

Example 1

$$\begin{array}{lll} in &{}\mapsto &{} [c\_copul / prepos\!-\!l] \\ the &{}\mapsto &{} [det] \\ beginning &{}\mapsto &{} [det \backslash prepos\!-\!l] \\ was &{}\mapsto &{} [c\_copul \backslash S / @fs / pred] \\ word &{}\mapsto &{} [det \backslash pred] \\ . &{}\mapsto &{} [@fs] \end{array}$$
Fig. 3.
figure 3

Projective dependency structure.

The intuitive meaning of part P, called potential, is that it defines discontinuous dependencies of the word w. P is a string of polarized valencies, i.e. of symbols of four kinds: \(\swarrow \!d\) (left negative valency d), \(\searrow \!d\) (right negative valency d), \(\nwarrow \!d\) (left positive valency d), \(\nearrow \!d\) (right positive valency d). Intuitively, \(v=\nwarrow \!d\) requires a subordinate through dependency d situated somewhere on the left, whereas the dual valency \(\breve{v}=\swarrow \!d\) requires a governor through the same dependency d situated somewhere on the right. So together they describe the discontinuous dependency d. Similarly for the other pairs of dual valencies. For negative valencies \(\swarrow \!d,\searrow \!d\) are provided a special kind of types \(\#(\swarrow \!d),\) \(\#(\searrow \!d).\) Intuitively, they serve to check the adjacency of a distant word subordinate through discontinuous dependency d to a host word. The dependencies of these types are called anchor. For instance, the assignment of Example 2 determines the non-projective DS in Fig. 4.

Example 2

Fig. 4.
figure 4

Non-projective dependency structure.

Definition 1

(CDG dependency structures). Let \(W=a_1\ldots a_n\) be a list of words and \(\{d_1,\ldots ,d_m\}\) be a set of dependency names, with their dependency nature that can be either local, discontinuous or anchor. A graph \(D=(W, E)\) with labeled arcs is a dependency structure (DS) of W if it has a root, i.e. a node \(a_i\in W\) such that (i) for any node \(a\in W,\) \(a\ne a_i,\) there is a path from \(a_i\) to a and (ii) there is no arc \((a^\prime ,d,a_i)\).Footnote 2 An arc \((a,d,a^\prime )\in E\) is called dependency d from a to \(a^\prime \). a is called a governor of \(a^\prime \) and \(a^\prime \) is called a subordinate of a through d. The linear order on W is the precedence order on D.

Definition 2

(CDG types). Let \(\mathbf{C}\) be a set of local dependency names and \(\mathbf{V}\) be a set of valency names.

The expressions of the form \(\swarrow \!v\), \(\nwarrow \!v\), \(\searrow \!v\), \(\nearrow \!v\), where \(v\in \mathbf{V}\), are called polarized valencies. \(\nwarrow \!v\) and \(\nearrow \!v\) are positive, \(\swarrow \!v\) and \(\searrow \!v\) are negative; \(\nwarrow \!v\) and \(\swarrow \!v\) are left, \(\nearrow \!v\) and \(\searrow \!v\) are right. Two polarized valencies with the same valency name and orientation, but with the opposite signs are dual. An expression of one of the forms \(\#(\swarrow \!v)\), \(\#(\searrow \!v)\), \(v\in \mathbf{V}\), is called anchor type or just anchor. An expression of the form \(d^*\) where \(d\in \mathbf{C}\), is called iterated dependency type. Local dependency names, iterated dependency types and anchor types are primitive types.

An expression of the form \(t=[L_m \backslash \ldots \backslash L_1 \backslash H / R_1 \ldots /R_n]\) in which \(m,n \ge 0\), \(L_1,\ldots , L_m, R_1,\ldots ,R_n\) are primitive types and H is either a local dependency name or an anchor type, is called a basic dependency type. \(L_1,\ldots , L_m\) and \(R_1,\ldots ,R_n\) are respectively left and right argument types of t. H is called the head type of t.

A (possibly empty) string P of polarized valencies sorted using the standard lexicographical order \(<_{lex}\) compatible with the polarity order , is called a potential. A dependency type is an expression \(B^P\) in which B is a basic dependency type and P is a potential. \(\mathbf{CAT}(\mathbf{C},\mathbf{V})\) will denote the set of all dependency types over \(\mathbf{C}\) and \(\mathbf{V}\).

CDG are defined using the following calculus of dependency types.Footnote 3 These rules are relativized with respect to the word positions in the sentence, which allows to interpret them as rules of construction of DS. Namely, when a type \(B^{v_1\ldots v_k}\) is assigned to the word in a position i, we encode it using the state \((B,i)^{(v_1,i)\ldots (v_k,i)}\). In these rules, types must be adjacent.

Definition 3

(Relativized calculus of dependency types).

\(\mathbf{L^l}.\) \(\,\varGamma _1\,([C],i_1)^{P_1} ([C \backslash \beta ],i_2)^{P_2} \varGamma _2 \;\vdash \; \varGamma _1\,([\beta ],i_2)^{P_1 P_2}\varGamma _2\)

\(\mathbf{I^l}.\) \(\,\;\varGamma _1\,([C],i_1)^{P_1} ([C^* \backslash \beta ],i_2)^{P_2} \varGamma _2 \;\vdash \; \varGamma _1\,([C^* \backslash \beta ],i_2)^{P_1 P_2}\varGamma _2\)

\(\mathbf{\Omega ^l}.\) \(\varGamma _1\,([C^* \backslash \beta ],i)^{P} \varGamma _2 \;\vdash \; \varGamma _1\,([\beta ],i)^{P}\varGamma _2\)

\(\mathbf{D^l}.\)  \(\varGamma _1\,\alpha ^{P_1 (\swarrow \!C,i_1) P (\nwarrow \!C,i_2) P_2}\varGamma _2 \;\vdash \; \varGamma _1\,\alpha ^{P_1 P P_2}\varGamma _2,\)

if the potential \((\swarrow \!\!C,i_1) P (\nwarrow \!\!C,i_2)\) satisfies the following pairing rule \(\mathbf{FA}\)

(first available) and where, moreover, \(i_1 < i_2\) (non-internal constraint).Footnote 4

\(\mathbf{FA}:~~~~P\ \text{ has } \text{ no } \text{ occurrences } \text{ of }\ (\swarrow \!\!C,i) \text{ or } (\nwarrow \!\!C,i) \text{, } \text{ for } \text{ any } i \)

\(\mathbf{L^l}\) is the classical elimination rule. Eliminating the argument type \(C\ne \#(\alpha )\) it constructs the (projective) dependency C and concatenates the potentials. \(C=\#(\alpha )\) creates anchor dependencies. \(\mathbf{I^l}\) derives \(k>0\) instances of C. \(\mathbf{\Omega ^l}\) serves in particular for the case \(k=0.\) \(\mathbf{D^l}\) creates discontinuous dependencies. It pairs and eliminates dual valencies with name C satisfying the rule FA to create the discontinuous dependency C.

Now, in this relativized calculus, for every proof \(\rho \) represented as a sequence of rule applications, we may define the DS \({DS}_{x}(\rho )\) constructed in this proof. Namely, let us consider the calculus relativized with respect to a sentence x with the set of word occurrences W. Then \(DS_x(\varepsilon )=(W,\emptyset )\) is the DS constructed in the empty proof \(\rho =\varepsilon \). Now, let \((\rho , R)\) be a nonempty proof with respect to x and \((W,E)=DS_x(\rho )\). Then \(DS_x((\rho , R))\) is defined as follows:

If \(R=\mathbf{L}^{\mathbf{l}}\) or \(R=\mathbf{I}^{\mathbf{l}}\), then \(DS_x((\rho , R))=(W,E\ \cup \ \{(a_{i_2},C,a_{i_1})\})\). When C is a local dependency name, the new dependency is local. In the case where C is an anchor, this is an anchor dependency.

If \(R=\mathbf{\Omega }^{\mathbf{l}}\), then \(DS_x((\rho , R))=DS_x(\rho )\).

If \(R=\mathbf{D^l}\), then \(DS_x((\rho , R))=(W,E\ \cup \ \{(a_{i_2},C,a_{i_1})\})\) and the new dependency is discontinuous.

Definition 4

(CDG). A categorial dependency grammar (CDG) is a system \(G=( W, \mathbf{C}, \mathbf{V}, S,\lambda ),\) where W is a finite set of words, \(\mathbf{C}\) is a finite set of local dependency names containing the selected name S (an axiom), \(\mathbf{V}\) is a finite set of discontinuous dependency names and \(\lambda ,\) called lexicon, is a finite substitution on W such that \(\lambda (a) \subset \mathbf{CAT}(\mathbf{C},\mathbf{V})\) for each word \(a \in W\). \(\lambda \) is extended on sequences of words \(W^*\) in the usual way.Footnote 5

For \(G=( W, \mathbf{C}, \mathbf{V}, S,\lambda ),\) a DS D and a sentence x, let G[Dx] denote the relation:

figure a

Then the language generated by G is the set \(L(G) =_{\scriptscriptstyle df}\,\{ w\;\;\mathbf{|\!\!|}\;\;\exists D~G[D,w]\}\) and the DS-language generated by G is the set \(\varDelta (G) =_{\scriptscriptstyle df}\,\{ D \;\;\mathbf{|\!\!|}\;\;\exists w~G[D,w]\}\). \(\mathcal {D}(CDG)\) and \(\mathcal {L}(CDG)\) will denote the families of DS-languages and languages generated by these grammars.

Example 3

The proof in Fig. 5 shows that the DS in Fig. 4 belongs to the DS-language generated by a grammar containing the type assignments shown above for the French sentence Elle la lui a donnée (the word positions are not shown on types).

CDG are very expressive. Evidently, they generate all CF-languages. They can also generate non-CF languages.

Fig. 5.
figure 5

Dependency structure correctness proof.

Example 4

The following CDG generates the language \(\{a^n b^n c^n \;|\ n > 0\}\) [6]:Footnote 6

2.2 CDG with Sequences and Sequence Iterations

The extended system introduced here defines sequences and sequence iterations. An extended type \([\alpha \backslash (C_1\bullet \cdots \bullet C_n) \backslash \beta ]^P\) is viewed as a type that contains a sequence of n primitive types. It is equivalent to \([\alpha \backslash C_n \backslash \cdots \backslash C_1 \backslash \beta ]^P\) (the sequence appears in the reverse order). The starred version of a sequence \([\alpha \backslash (C_1\bullet \cdots \bullet C_n)^* \backslash \beta ]^P\) is handled as a sequence of n primitive types that can be repeated none, once or several times. This construction with \(n>1\) is not possible with classical CDG which allows only iteration of a primitive type (the case \(n=1\)). This type is equivalent to an infinite list of types:

\([\alpha \backslash \beta ]^P\),

\([\alpha \backslash (C_1\bullet \cdots \bullet C_n) \backslash \beta ]^P\equiv [\alpha \backslash C_n \backslash \cdots \backslash C_1 \backslash \beta ]^P\),

\([\alpha \backslash (C_1\bullet \cdots \bullet C_n\bullet C_1\bullet \cdots \bullet C_n) \backslash \beta ]^P\equiv [\alpha \backslash C_n \backslash \cdots \backslash C_1 \backslash C_n \backslash \cdots \backslash C_1 \backslash \beta ]^P\),

etc.

Definition 5

We call sequence iteration types the expressions \(B^P\) where P is a potential, \(B=[L_m \backslash \cdots \backslash L_1 \backslash H / \cdots /R_1 \cdots /R_n]\), H is either a local dependency name or an anchor type and \(L_m,\ldots \) \(L_1\), \(R_1\ldots \), \(R_n\) are either anchor types, local dependency names, sequences of local dependency names or sequence iterations of local dependency names (a sequence of one local dependency name is identified to a local dependency name).

Rules for CDG with sequences and sequence iterations:

\(\mathbf{L^l}.\) \(\;\,\varGamma _1\,([C],i_1)^{P_1} ([C \backslash \beta ],i_2)^{P_2} \varGamma _2 \;\vdash \; \varGamma _1\,([\beta ],i_2)^{P_1 P_2}\varGamma _2\)

\(\mathbf{C^l}.\) \(\;\varGamma _1\,([(\alpha )^* \backslash \beta ],i)^{P} \varGamma _2 \;\vdash \; \varGamma _1\,([\alpha \backslash (\alpha )^* \backslash \beta ],i)^{P}\varGamma _2\) \((\alpha )^*\) is a sequence iteration

\(\mathbf{W ^l}.\) \(\varGamma _1\,([(\alpha )^* \backslash \beta ],i)^{P}\varGamma _2 \;\vdash \; \varGamma _1\,([\beta ],i)^{P}\varGamma _2\) \((\alpha )^*\) is a sequence iteration

\(\mathbf{S^l}.\) \(\;\;\varGamma _1\,([(\alpha \bullet C) \backslash \beta ],i)^{P}\varGamma _2 \;\vdash \; \varGamma _1\,([C \backslash \alpha \backslash \beta ],i)^{P}\varGamma _2\) \((\alpha \bullet C)\) is a sequence

\(\mathbf{D^l}.\)  \(\,\varGamma _1\,\alpha ^{P_1 (\swarrow \!C,i_1) P (\nwarrow \!C,i_2) P_2}\varGamma _2 \;\vdash \; \varGamma _1\,\alpha ^{P_1 P P_2}\varGamma _2,\)

if the potential \((\swarrow \!\!C,i_1) P (\nwarrow \!\!C,i_2)\) satisfies \(\mathbf{FA}\) and if \(i_1<i_2\)

2.3 Links with Noncommutative Logic and Lambek Calculus

From a logical point of view, a CDG type \(B^P\) consists of a projective part B and a potential P. B can be seen as a logical formula in a resource sensible logic like linear logic. Because the order of formulas is also important, B can be seen either as a formula in noncommutative logic [1] or a formula in Lambek calculus [10].

In Lambek calculus, a sequence of primitive types is the product of primitive types. In the same perspective, a sequence iteration of primitive types has no equivalent in Lambek calculus.

In noncommutative logic, a type can be seen as the linear type where and are the left and right linear implications. The sequence of primitive types \((C_1\bullet \cdots \bullet C_n)\) is the multiplicative noncommutative product \((C_1\odot \cdots \odot C_n)\). The following implications are valid in noncommutative logic. They justify the rules for CDG sequences:

The sequence iteration of primitive types \((C_1\bullet \cdots \bullet C_n)^*\) corresponds to \(?(C_1\odot \cdots \odot C_n)\): An iteration is seen as the dual of the exponential of the multiplicative product of the primitive types. The following provable sequents justify the rules for CDG sequence iterations:

figure b

Thus, it is possible to interpret the projective part of CDG types as a formula of noncommutative logic. The search for a valid analysis of a sentence becomes the proof search in noncommutative logic of a sequent where the formulae are one of the possible lists of types of the words through the lexicon of a grammar. This interpretation gives automatically a compositional semantic interpretation à la Montague.

2.4 Parsing and Expressive Power

Sequences can be seen as syntactic sugar for types. Thus, they don’t change the parsing properties of languages and the expressive power of grammars. From a formal point of view, sequence iterations do not introduce new languages of string with respect to classical CDG. In fact, it is possible to emulate a sequence iteration by a simple iteration where each dependent corresponds to an element of the sequence (for instance the leftmost element of the sequence) and governs the other elements of the sequence. In contrast, sequence iterations introduce a new construction that is very common on DS corpora. For instance, the treebank Sequoia [4] models a list of elements as the alternative of an element and a punctuation mark. The introduction presents an example where the modifiers of the verb réunir alternate with commas: “Les cyclistes et vététistes peuvent se réunir ce matin, à 9h, place Jacques-Bailleurs, à l’occasion d’une sortie d’entraînement.” (fr. the cyclists and ATB bikers may meet themselves this morning, at 9, at Jacques-Bailleurs square, for a training ride).

The parsing of CDG with sequence iterations is not very different from the parsing of classical CDG (i.e. with iterated dependency type). A sequence iteration at the leftmost position of a type \([(d_1\bullet \cdots \bullet d_n)^* \backslash L_1\cdots \backslash H/R_1/\cdots ]^{P_2}\) is rewritten into \([d_{n-1} \backslash \cdots \backslash d_1 \backslash (d_1\bullet \cdots \bullet d_n)^* \backslash L_1\cdots \backslash H/R_1/\cdots ]^{P_1P_2}\) when the type \([d_n]^{P_1}\) is on its left (potentials \(P_1P_2\) may generate non-projective dependencies).

3 Learnability Results

The section studies the learnability properties of CDG with sequence iterations from positive examples of dependency structures (because sequences can be seen as syntactic sugar, the grammar are supposed to contain no sequence). It ends with the definition of a new family of classes of such grammars that are learnable in this context.

3.1 Inference Algorithm

A vicinity corresponds for a word to the part of a type that is used in a DS.

Definition 6

(Vicinity). Given a DS D, the incoming and outgoing dependencies of a word w can be either local, anchor or discontinuous. For a discontinuous dependency d on a word w, we define its polarity p (\(\nwarrow , \searrow , \swarrow , \nearrow \)), according to its direction (left, right) and as negative if it is incoming to w, positive otherwise.

Let D be a DS in which an occurrence of a word w has: the incoming projective dependency or anchor H (or the axiom S), the left projective dependencies or anchors \(L_k,\ldots ,L_1\) (in this order), the right projective dependencies or anchors \(R_1,\ldots ,R_m\) (in this order), and the discontinuous dependencies \(d_1,\ldots ,d_n\in \mathbf{V}\) with their respective polarities \(p_1,\ldots ,p_n\).

Then the vicinity of w in D is the type

$$ V(w,D) = [L_1 \backslash \cdots \backslash L_k \backslash H / R_m / \cdots / R_1]^P, $$

in which P is a permutation of \(p_1 d_1,\ldots ,p_n d_n\) in the standard lexicographical order \(<_{lex}\) compatible with the polarity order .

For instance, donnée in Fig. 4 has the vicinity . This vicinity is nearly the same as the type of donnée in the lexicon because this type doesn’t have a sequence iteration (or an iterated dependency type). The difference comes from the order of the polarized valencies \(\nwarrow \!clit\!-\!a\!-\!obj\) and \(\nwarrow \!clit\!-\!3d\!-\!obj\) that appear in a different order. The vicinity of the verb réunir in Fig. 2 is \([\textit{aff} \backslash \textit{obj:obj} / mod / ponct / mod / ponct / mod / ponct / mod]\). A type that is compatible with this vicinity could be \([\textit{aff} \backslash \textit{obj:obj} / (ponct\bullet mod)^* / mod]\). In this case, the type in the lexicon and the vicinity are different.

Definition 7

(Algorithm). Figure 6 presents an inference algorithm \(\mathbf{TGE}^{(K)}_{J-seq}\) which, for every next DS in a training sequence, transforms the observed local, anchor and discontinuous dependencies of every word into a type with repeated local dependency sequences by introducing a sequence iteration for each group of at least K consecutive identical sequences of local dependencies. J indicates the maximum internal length of the sequences that are transformed into sequence iterations.

Fig. 6.
figure 6

Inference algorithm \(\mathbf{TGE}^{(K)}_{J-seq}\); the inner loop defines \(TGen^{(K)}_{J-seq}(t_w)\) on types.

Definition 8

(Generalization). The notation \(TGen^{(K)}_{J-seq}(t_w)\), that applies the inner loop algorithm in Fig. 6 to a type \(t_w\), is extended to sets of types, lexicons and grammars, in a usual way, such that each assignment \(w \mapsto t\) becomes \(w \mapsto TGen^{(K)}_{J-seq}(t)\)

Ambiguities. Note that this process may be ambiguous. For instance, for \(K=J=2\), the generalization of \([a \backslash b \backslash a \backslash b \backslash a \backslash b \backslash a \backslash H]\) could be \([(b \bullet a)^* \backslash a \backslash H]\) or \([a \backslash (a \bullet b)^* \backslash H]\). With the same conditions on K and J, the generalization of \([b \backslash a \backslash a \backslash a \backslash a \backslash a \backslash H]\) could be \([b \backslash a^* \backslash H]\) or \([b \backslash (a\bullet a)^* \backslash a \backslash H]\). There are several ways to overcome this, such as: [ALL mode] adds all such types in the internal loop; or [LML mode] adds only the type corresponding to a leftmost longest sequence iteration with the shortest pattern. We could also consider different limiting neighbourhood conditions around the repeating pattern.

Definition 9

(LML mode). We consider three parameters of the repeated sequence: the start position, the pattern length, the total length. In the [LML mode], the three parameters have the priorities in that order: We consider first the leftmost position as the start position, then the smallest pattern length, then the maximal number of repetitions.

This mode is detailed by the following examples.

  • \(TGen^{(2)}_{2-seq}([a \backslash b \backslash a \backslash b \backslash a \backslash b \backslash a \backslash H]) =[(b \bullet a)^* \backslash a \backslash H]\) and not \([b \backslash (a \bullet b)^* \backslash H]\) because the leftmost repeated sequences for \(K=J=2\) start with the leftmost a of \([a \backslash b \backslash a \backslash b \backslash a \backslash b \backslash a \backslash H]\)

  • \(TGen^{(2)}_{2-seq}([H / a / a / a / a / a]) =[H / a^*]\) and not \([H / (a \bullet a)^*]\) because the sequences for \(a^*\) and \((a \bullet a)^*\) both start with the leftmost a in \([H / a / a / a / a / a]\) but the pattern length of \(a^*\) is one (the smallest) and the pattern length of \((a \bullet a)^*\) is two.

  • \(TGen^{(2)}_{2-seq}([H / a / b / a / b / a / b / a]) =[H / (b\bullet a)^* / a]\) and not \([H / (b\bullet a)^* / a / b / a]\) because for \(K=J=2\) even if there are two repeated sequences starting at the leftmost a with a pattern length of two (\(b\bullet a\)) that are \(a / b / a / b\) and \(a / b / a / b / a / b\), the maximal number of repetitions is three and corresponds to \(a / b / a / b / a / b\).

3.2 Algorithm Properties

Some Terminology. The following definitions are introduced for ease of writing.

Definition 10

(argument-form). By an argument-form we mean a part of a type with the form \(L_m \backslash \ldots \backslash L_1 \backslash \) or the form \( / R_1 \ldots / R_n\) where each \(L_i\), \(R_i\) is a possible argument in a CDG type (in short an argument-form is a writing fragment on one side in a CDG type).

Definition 11

(Component). By a star-component in a type or an argument-form t, we mean any \(x^* \backslash \) or \( / x^*\) occurring in the writing of t. By a primitive component in a type or an argument-form t, we mean any \(x^* \backslash \), \( / x^* \), \(d \backslash \), or \( / d\) where d is a local dependency name or an anchor type, occurring in the writing of t. These notions are extended to the form without \( \backslash \) or \( / \).

Definition 12

(Parallel Decomposition). If \(t'\) is the result of the algorithm \({TGen}^{(K)}_{J-seq}\) on \(t=[L_1 \backslash \cdots \backslash H / \cdots / R_1]^P\) in the LML mode, we can decompose in parallel: \(t=[\alpha _1 \cdots H\cdots \alpha _N]^P\) and \(t'=[\beta _1 \cdots H\cdots \beta _N]^{P'}\) where \(P'=sort(P)\), each \(\alpha _i\) is an argument-form, \(\beta _i\) is a primitive component and:

\(\beta _1 = {TGen}^{(K)}_{J-seq}(\alpha _1)\) ...\(\beta _j = {TGen}^{(K)}_{J-seq}(\alpha _j)\) ...and \(\beta _N = {TGen}^{(K)}_{J-seq}(\alpha _N)\)

The pair (\(\alpha _1 \ldots \alpha _N\), \(\beta _1 \ldots \beta _N\)) defines the parallel decomposition of \((t,t')\) in the LML mode; we call \((\alpha _i, \beta _i)\) a block and we say that each index i selects block \((\alpha _i, \beta _i)\) in the decomposition.

Construction and Key Lemmas

Definition 13

(Expansion). For any type t, we define its full expansion FE(t) as the set of types obtained from t by erasing or by replacing its star-components \(x^*\) (\(d^*\) or \((d_1 \bullet d_2)^*\) when \(J=2\)) by any successive repetitions of x.

Note. This set is infinite when there is at least one star-component, but is used as an intermediate for proofs. It corresponds to the possible vicinities that can be associated to a word in a DS.

Definition 14

(Expansion of Rank \(K'\)). For any t, type or argument-form, we define its full expansion of rank \(K'\), \(FE^{K'}(t)\), as the set of types obtained from t by erasing or by replacing all its star-components \(x^*\) by any successive repetitions of x not more than \(K'\) times.

Lemma 1

Let \(K>1\), \(J=1 \text{ or } 2\) and \(K' \ge K+1\). For any type t:

$$\begin{aligned} TGen^{(K)}_{J-seq}(FE^{K'}(t)) = TGen^{(K)}_{J-seq}(FE^{K+1}(t)) \end{aligned}$$
(1)

Proof

We show (1). Obviously \(TGen^{(K)}_{J-seq}(FE^{K+1}(t))\subseteq TGen^{(K)}_{J-seq}(FE^{K'}(t))\). We show the converse for \(J=2\) (\(J=1\) is a subcase of \(J=2\)). Suppose \(t_1 \in FE^{K'}(t_0)\), let \(t_2=TGen^{(K)}_{J-seq}(t_1)\) and let \(\alpha _j\), \(\beta _j\), for \(1 \le j \le N\) denote the parallel decomposition of \((t_1, t_2)\) in the LML mode. We discuss by induction on the construction of \(t_0\), considering the parallel decomposition.

We consider the leftmost star-component \(x^*\) in \(t_0\) repeated more than \(K+1\) times in \(t_1\). We show that we can replace it by \(t'_1\) with only \(K+1\) repetitions of this pattern instead (unchanged elsewhere).

- If \(|x|=1\), then \(x^*\) of \(t_0\) corresponds to \(d \backslash d \backslash \cdots d \backslash \) or \( / d / d \backslash \cdots d\) in \(t_1\).

(1.1) If this argument-form of \(t_1\) (and \(x^*\) of \(t_0\)) corresponds to a unique block i in the parallel decomposition of \((t_1,t_2)\), then \(\alpha _i\) contains more than \(K+1\) x and \(\beta _i=x^*\); in that case, we define \(t'_1\) by replacing in \(\alpha _i\) all the repetition of x with only \(K+1\) repetitions of x. In this case, \(x^*\) of \(t_0\) corresponds to \(K+1\) x in \(t'_1\) and the algorithm yields the same type.

(1.2) If the argument-form corresponds to several adjacent blocks in the parallel decomposition of \((t_1,t_2)\), the leftmost x is the end of a block i with \(\beta _i=(x\bullet d_1)^*\) and the others are in the block \(i+1\) with \(\beta _{i+1}=x^*\). \(\alpha _{i+1}\) contains at least K x. We define \(t'_1\) by replacing in \(\alpha _{i+1}\) all the repetition of x by only K repetitions of x. In this case, \(x^*\) of \(t_0\) corresponds to \(K+1\) x in \(t'_1\) which yields the same type (algorithm output).

- If \(|x|=2\), then x is the succession of \(d_1\) and \(d_2\) (\(x=d_2\bullet d_1\) and it corresponds to \(d_1 \backslash d_2 \backslash d_1\cdots \backslash d_1 \backslash d_2 \backslash \) or \( / d_1 / d_2 / d_1\cdots / d_1 / d_2\)):

(2.1) If \(x^*\) of \(t_0\) corresponds to a unique block i, in that case, as in (1.1), we define \(t'_1\) by replacing in \(\alpha _i\) the repetition of \(d_1\) and \(d_2\) with \(K+1\) repetitions of \(d_1\) and \(d_2\). In this case, \(x^*\) of \(t_0\) corresponds to \(K+1\) x in \(t'_1\) which yields the same type (algorithm output).

(2.2) if \(d_1 \ne d_2\) and \(x^*\) corresponds to several adjacent blocks in the parallel decomposition of \((t_1,t_2)\) starting at block i, this means that in the LML mode the leftmost \(d_1\) corresponds to the end of block i, the rightmost \(d_2\) correspond to the beginning of block \(i+2\) and the other local dependency names \(d_2,d_1,d_2\ldots ,d_2,d_1\) correspond to block \(i+1\) with \(\beta _{i+1}=(d_2\bullet d_1)^*\). We define \(t'_1\) by replacing in \(\alpha _{i+1}\) the repetition of \(d_2\) and \(d_1\) with K repetitions of \(d_2\) and \(d_1\). In this case, \(x^*\) of \(t_0\) corresponds to \(K+1\) x in \(t'_1\) which yields the same type (algorithm output).

(2.3) if \(d_1=d_2\), we have the same cases as in (1.1) and (1.2) but with more than \(2K+2\) local dependency names.

We can repeat this process until no expansion is made more than \(K+1\) times, hence the converse inclusion.

For example, if , with \(J=2, K=2, K'=4\): the decomposition for \(t_1=a \backslash a \backslash a \backslash b \backslash a \backslash b \backslash a \backslash b \backslash a \backslash b \backslash b \backslash b \backslash H \) (with \(K'=K+2\) repetitions) can be compared to that of \(t'_1=a \backslash a \backslash a \backslash b \backslash a \backslash b \backslash a \backslash b \backslash b \backslash b \backslash H \) with \(K+1\) repetitions (we recall that the display order is reverted for internal sequence as arguments):

Note that \(a \backslash a \backslash a \backslash b \backslash a \backslash b \backslash b \backslash b \backslash \), with K repetitions only, yields a different decomposition.

Corollary 1

Let \(K>1\) and \(J=1 \text{ or } 2\). For any type t the result of the algorithm \(TGen^{(K)}_{J-seq}\) on the full extension of t is a finite set and is the same set as the result of this algorithm on \(FE^{K+1}(t)\).

The definitions of \(FE^{K}\) and \(FE^{K+1}\) are extended to sets, lexicons and grammars in the usual way.

Lemma 2

Let \(K>1\) and \(J=1 \text{ or } 2\). Let G be a CDG with sequence iterations. We have:

  1. (1)

    all vicinities of words in DS of \(\varDelta (G)\) belong to some FE(t), where t is assigned by G.

  2. (2)

    if \(\sigma \) is a finite sequence in \(\varDelta (G)\), then \(\varDelta (TGE^{(K)}_{J-seq}(\sigma )) \subseteq \varDelta (G')\) where \(G'\) is \(TGE^{(K)}_{J-seq}\) on \(FE^{K+1}(G)\)

Proof

If G generates \(D \in \sigma \) where a word w occurs with a vicinity \(t_w\), for which G uses the assignment \(w \mapsto t\) in the derivation, then \(t_w\) must be in FE(t). Finally, we use Corollary 1 relating FE(t) to \(FE^{K+1}(t)\).

Theorem 1

(Convergence).  Let \(K>1\) and \(J=1 \text{ or } 2\). Let G be any CDG. The algorithm \(TGE^{(K)}_{J-seq}\) stabilizes on every training sequence in \(\varDelta (G)\) to a grammar with assignments in \(TGE^{(K)}_{J-seq}\) on \((FE^{K+1}(G))\).

Proof

We have (1) \(TGE^{(K)}_{J-seq}(\sigma [i]) \subseteq TGE^{(K)}_{J-seq}(\sigma [i\!\!+\!\!1]) \subseteq \, ...\) As observed in Lemma 2, the vicinities for the words of the DS in \(\sigma \) belong to FE(G). If we had an infinite chain of types \(t'_i = TGE^{(K)}_{J-seq}(t_i)\), with assignments \(w_i \mapsto t'_i\) in \(TGE^{(K)}_{J-seq}(\sigma [i])\), but not in \(TGE^{(K)}_{J-seq}(\sigma [i-1])\) (we could consider one such chain concerning a same word w as the lexicon of G is finite) ; now all \(t_i\) also belong to some \(FE^{K_i}(G)\), then if \(K'>K+1\), there exists \(t''_i\) in \(FE^{K+1}(G)\), such that \(t'_i = TGE^{(K)}_{J-seq}(t''_i)\), we can thus view the set of \(t'_i\) as the result of \(TGE^{(K)}_{J-seq}\) on a subset of \(FE^{K+1}(G)\) ; obviously \(FE^{K+1}(G)\) is finite, we would then have a contradiction.

Therefore for any G and any \(K>1\):

\(\exists N, \forall N'\ge N\) \(TGE^{(K)}_{J-seq}(\sigma [N'])= TGE^{(K)}_{J-seq}(\sigma [N])\)

Furthermore, if \(w \mapsto t' \in TGE^{(K)}_{J-seq}(\sigma [N])\) there exists \(w \mapsto t'' \in FE^{K+1}(G)\), such that \(t' = TGE^{(K)}_{J-seq}(t'')\): in that sense the assignments in \(TGE^{(K)}_{J-seq}(\sigma [N])\) are in \(TGE^{(K)}_{J-seq}\) on \((FE^{K+1}(G))\).

Proposition 1

Let \(K>1\) and \(J=1 \text{ or } 2\).

If G is a CDG and \(\sigma \) is a sequence in \(\varDelta (G)\) then

  1. (1)

    \(TGE^{(K)}_{J-seq}(\sigma [i]) \subseteq TGE^{(K)}_{J-seq}(\sigma [i+1])\) monotonicity/incrementality

  2. (2)

    \(\sigma [i] \subseteq \varDelta (TGE^{(K)}_{J-seq}(\sigma [i]))\) expansivity

  3. (3)

    \(\varDelta (TGE^{(K)}_{J-seq}(\sigma [i])) \subseteq \varDelta (G')\) where \(G'\) is \(TGE^{(K)}_{J-seq}\) on \(FE^{K+1}(G)\)

Proof

(1) holds by definition of the algorithm (that expands the lexicon); (2) can be shown by adapting the derivation ; (3) follows from a preceeding lemma.

3.3 A Family of Learnable Classes

Definition 15

Two grammars are said strongly equivalent if they generate the same dependency structure language. The strong equivalence criterion:

  1. (i)

    G is strongly equivalent to \(TGE^{(K)}_{J-seq}\) on \(FE^{K+1}(G)\) defines the subclass written \(\mathcal{C}CDG^{K}_{J-seq}\) of grammars satisfying (i).

Theorem 2

Let \(K>1\) and \(J=1 \text{ or } 2\). The algorithm \(TGE^{(K)}_{J-seq}\) learns the class of CDG satisfying the strong equivalence criterion (i), from labelled dependency structures.

Proof

From Proposition 1(1): \(TGE^{(K)}_{J-seq}(\sigma [i]) \subseteq TGE^{(K)}_{J-seq}(\sigma [i\!+\!1]) \subseteq ...\)

The stabilization property holds (Theorem 1):

\(\exists N, \forall N'\ge N\) \(TGE^{(K)}_{J-seq}(\sigma [N'])= TGE^{(K)}_{J-seq}(\sigma [N])\)

Then by Proposition 1(2): \(\varDelta (G) \subseteq \varDelta (TGE^{(K)}_{J-seq}(\sigma [N]))\),

and using (i) and Proposition 1(3): \(\varDelta (G) \subseteq \varDelta (TGE^{(K)}_{J-seq}(\sigma [N])) \subseteq \varDelta (G)\).

Therefore for any grammar, such that (i) we get the convergence to a grammar generating the same structure language.

Observe that this class does not impose a bound on the number of types associated to a word (in contrast to k-valued grammars). The learnability for \(J=1\) was studied in [3], with a special case of our algorithm.

4 Extended CDG and Dependency Treebanks

From Dependency Treebanks to Vicinities. Our workflow applies to data in the Conll formatFootnote 7. The CDG potentials in this section are considered as emptyFootnote 8.

For each governor unit in each corpus we have computed (using MySQL and CamelisFootnote 9): (1) its vicinity in the root simplified form \([l_1 \backslash \ldots \backslash l_n \backslash root / r_m / \ldots / r_1]\) (where \(l_1\) to \(l_n\) on the left and \(r_1\) to \(r_m\) on the right are the successive dependency names from that governor), then (2) its generalization as star-vicinity, replacing consecutive repetitions of \(d_k\) on a same side with \(d_k^*\); and (3) its generalization as vicinity_2seq following the LML mode of the algorithm in Fig. 6 for J = K = 2.

Fig. 7.
figure 7

Simplified vicinities computed on corpus Sequoia

Our development allows to mine repetitions and to call several kinds of viewers: we use the item/word description interactive viewer camelis and the sentence parse conll viewer [11] or grewFootnote 10.

Figure 7 on its left, shows the root simplified vicinities computed on corpus Sequoia; the resulting file has been loaded as an interactive information context, in Camelis; this tool manages three synchronised windows: the current query is on the top, selecting the objects on the right, their properties can be browsed in the multi-facets index on the left.

Results on the French corpus Sequoia. We consider a version of corpus Sequoia [4] that defines dependency structures. The study uses only the surface syntax dependency tree. Sequoia is not validated by a dependency grammar in the sense of Mel’čuk and does not have to follow the repeatable principle.

The process yields 530 distinct star-vicinities having repetition(s) (a star), among 2660 distinct vicinities (on 67038 units, among which 37883 governors). For example the form “notables”Footnote 11 with postag “NC” has:

figure c

We observe that:Footnote 12 consecutive repeatable dependencies \(d_1.d_1\) on the left are: aff, dep, det, mod, ponct; consecutive repeatable dependencies on the right are: coord, dep (+ dep.coord), mod (+ mod.app), obj:obj+obj.p, p_obj.o, ponct

The most frequent vicinity_star is (204 units), the most frequent vicinity_2seq is (25 units), 166 units correspond to a repetition "(mod . ponct)*". Several repeated sequences of length 2 occur, either on the left or on the right, these patterns always include a ponct dependency:

figure d

Repeated sequences of length 3, with three distinct dependencies seem to be rare. We found one sentenceFootnote 13 illustrating this case: “Ils ont vidé les supermarchés de nourriture, les pharmacies de médicaments, les usines de matériel médical, ils ont cambriolé les maisons et torturé des voisins et des amis.”, with vicinity:

figure e

Other corpus. Our development can handle other treebanks in the conll format. Table 4 summarizes some observations on two corpus, with the number of units corresponding to repetition patterns.

Fig. 8.
figure 8

Dependency repetitions, for K = 2 and sequence length J

In the fr-ud-train corpus, the most frequent vicinity_star is (194 units), the most frequent vicinity_2seq is ; 45 units correspond to a repetition (adpmod . p)*. The 18 repeating patterns are:

figure f

5 Conclusion

In this paper, we have extended classical Categorial Dependency Grammars with a new construction to handle repeatable sequences of several dependencies. The work was motivated by the observation of such patterns. We have proposed a learning algorithm. A version of this algorithm has been implemented and applied to some treebanks (in Conll). Some design and computational variants are possible depending on the repetition principle reading. On the formal side, further analysis could consider richer patterns. On the experimental side, other treebanks could be explored as well. It would also be interesting to reconsider these notions in other formalisms or application domains.