In a very general setting, compositionality can be defined as a homomorphic image (or functorial passage) from a syntactic algebra (or category) to a semantic algebra (or category). The only condition, then, is that the semantic algebra be weaker than the syntactic algebra: each syntactic operation needs to be interpretable by a semantic operation. To give a formal semantic account one would map the proof terms of a categorial grammar, or rewritings of a generative grammar, to the semantic operations of abstraction and application of some lambda term calculus. In a distributional model such as the one of Coecke et al. (2010, 2013), derivations of a Lambek grammar are interpreted by linear maps on finite dimensional vector spaces. For our presentation it will suffice to say that the Lambek calculus can be considered to be a monoidal biclosed category, which makes the mapping to the compact closure of vector spaces straightforward. However, we want to employ the copying power of non-linear lambda calculus, and so we will move from the direct interpretation below
What we end up with is in fact a more intricate target than in the direct case: target expressions are now lambda terms with a tensorial interpretation, i.e. a program with access to word embeddings. The next subsections outline the details: we consider the syntax of the Lambek calculus with limited contraction, the semantics of non-linear lambda calculus, and the interpretation of lambda terms in a lambda calculus with tensors.
Derivational Semantics: Formulas, Proofs and Terms
We start by introducing the Lambek Calculus with Limited Contraction LLC, a conservative extension of the Lambek calculus L, as defined in Jäger’s monograph (Jäger 2006).
LLC was in first instance defined to deal with anaphoric binding, but has many more applications including verb phrase ellipsis and ellipsis with anaphora. The system extends the Lambek calculus with a single binary connective | that behaves like an implication for anaphora: a formula A|B says that a formula A can be bound to produce formula B, while retaining the formula A. This non-linear behaviour allows the kind of resource multiplication in syntax that one expects when dealing with anaphoric binding and ellipsis.
More formally, formulas or types of LLC are built given a set of basic types T and using the following definition:
Definition 1
Formulas of LLC are given by the following grammar, where T is a finite set of basic formulas:
$$\begin{aligned} A,B := T \ | \ A \bullet B \ | \ A \backslash B \ | \ B /A \ | \ A|B \end{aligned}$$
Intuitively, the Lambek connectives \(\bullet , \backslash , /\) represent a ‘logic of concatenation’: \(\bullet \) represents the concatenation of resources, where \(\backslash , /\) represent directional decatenation, behaving as the residual implications with respect to the multiplication \(\bullet \). The extra connective | is a separate implication that behaves non-linearly: its deduction rules allow a mix of permutations and contractions, which effectively treat anaphora and VP ellipsis markers as phrases that look leftward to find a proper binding antecedent. Our convention is that we read A|B as a function with input of type A and an output of type B.
The rules of LLC are given in a natural deduction style in Fig. 1. The Lex rule is an axiom of the logic: it allows us to relate the judgements of the logic to the words of the lexicon. For instance, in the example proof tree provided in Fig. 2, the judgement \(\texttt {alice} : np\) is related to the word Alice, the judgement \(\texttt {bob} : np\) to the word Bob, and the judgement \(\texttt {and} : (s \backslash s) / s\) to the word and. Then, as it is usual in natural deduction, every connective has an introduction rule, marked with I and an elimination rule, marked with E. In the introduction rules for / and \(\backslash \), the variable x stands for an axiom, in the introduction rule for \(\bullet \) and eliminations rules for \(\bullet , /\) and \(\backslash \), we have proofs for the premise types \(A, B, A \bullet B, A/B\) and \(A\backslash B\), i.e. general terms N and M.
Informally speaking, the introduction rule for \(\bullet \), takes two terms M and N, one of which (M) proves a formula A and another of which (N) proves the formula B, and it pairs the terms with the tensor product of the formulae, that is, it tells us that the \(\langle M,N\rangle \) proves \(A \bullet B\). The elimination rule for \(\bullet \) takes a pair of terms, denoted by M and tells us that the first projection of M, i.e. \(\pi _1(M)\), i.e. the first element of the pair, proves A and its second projection/element proves B. The introduction rule for \(\backslash \) takes the index of the rule where formula A was proved using a term x, a proof tree which used this rule and possibly together with some other rules proved the formula B using the term M, then derives the formula \(A\backslash B\) using the lambda term \(\lambda x. M\). The lambda terms are explained later on, but for now, think of this term as a function M with the variable x. The elimination rule for \(\backslash \) is doing the opposite of what we just explained and which is what the introduction rule did. It takes a term x for formula A, a term y for formula \(A \backslash B\), then tells us that we can apply y to x to get something of type B. The rules for / are similar to these but with different ordering, which is easily checkable from their proof rules in Fig. 1.
That brings us to the main rules that differentiate LLC from L (the Lambek Calculus): the rules for |. Here, the elimination rule tells us that if somewhere in the proof we had proved A from N, and denoted the result by an index i, and then later we encounter a term M for A|B, and that i happened before \(M :A | B\), then we are allowed to eliminate | and get B by applying the term M to the term N. This rule is very similar to either of the \(\backslash \) and / rules, in that it says you can eliminate the connective by applying its term to the term of one of its compartments, i.e. its input. The exception for the | elimination rule is that it allows for that input, i.e. \([N : A]_i \) to happen not directly as the antecedent of the elimination rule, but as one of the other rules in the proof, somewhere before the current elimination rule. We can see how this rule is applicable in the proof tree of Fig. 2: we see a proof for \(\texttt {drinks} : np\backslash s\), in this occasion indexed with a label i, then quite later on in the proof (actually at the end of it), we encounter the term \(\texttt {dt drinks}: np\backslash s\), now the E|, i rule allows us to apply the latter to the former, all the way back, to obtain \(\texttt {dt}: (np \backslash n) | (np \backslash s)\). The | connective also has an introduction rule, a proper formulation of this rule, however, is more delicate. Since our anaphoric expressions are already typed in the lexicon, we do not need this rule in our paper and refer the reader for different formulations and explanations of it to Jäger’s book (Jäger 2006, pp.123–124).
The interpretation of proofs is established by a non-linear simply typed lambda term calculus, which labels the natural deduction rules of the calculus:
Definition 2
Given a countably infinite set of variables \(V = \{x,y,z\ldots \}\) , terms of \(\varvec{\lambda }\) are as in the below grammar:
$$\begin{aligned} M,N := V \ | \ \lambda x. M \ | \ M \ N \ | \ \langle M,\ N \rangle \ | \ \pi _1(M) \ | \ \pi _2(M) \end{aligned}$$
Terms obey the standard \(\alpha \)-, \(\eta \)- and \(\beta \)-conversion rules:
Definition 3
For terms of \(\varvec{\lambda }\) we define three conversion relations:
-
1.
\(\alpha \)-conversion: for any term M we have
$$\begin{aligned} \begin{array}{ccc} M&=_{\alpha }&M[x \mapsto y] \end{array} \end{aligned}$$
provided that y is a fresh variable, i.e. it does not occur in M.
-
2.
\(\eta \)-conversion: for terms M we have
$$\begin{aligned} \begin{array}{lcl} \lambda x. M\ x &{} =_{\eta } &{} M \quad \text {(x does not occur in M)} \\ \langle \pi _1(M),\ \pi _2(M) \rangle &{} =_{\eta } &{} M \\ \end{array} \end{aligned}$$
-
3.
\(\beta \)-conversion: for terms M we define
$$\begin{aligned} \begin{array}{ccc} (\lambda x.M)\ N &{} \rightarrow _{\beta } &{} M[x \mapsto N] \\ \pi _1(\langle M,\ N \rangle ) &{} \rightarrow _{\beta }M &{} \\ \pi _2(\langle M,\ N \rangle ) &{} \rightarrow _{\beta }N &{} \\ \end{array} \end{aligned}$$
We moreover write \(M \twoheadrightarrow _{\beta } N\) whenever M converts to N in multiple steps.
The full labelled natural deduction is given in Fig. 1. Proofs and terms give the basis of the derivational semantics; given a lexical map relation \(\sigma \subseteq \varSigma \times F\) for \(\varSigma \) a dictionary of words, we say that a sequence of words \(w_1,\ldots ,w_n\) derives the formula A whenever it is possible to derive a term M : A with free variables \(x_i\) of type \(\sigma (w_i)\). For the \(x_i\), one can substitute constants \(c_i\) of type \(\sigma (w_i)\) representing the meaning of the actual words \(w_1,\ldots ,w_n\). The abstract meaning of the sequence is thus given by the lambda term t. An example of such a proof is given for the elliptical phrase “Alice drinks and Bob does-too” in Fig. 2. More involved examples are given in Figs. 3 and 4; they will be discussed in Sect. 4.
Lexical Semantics: Lambdas, Tensors and Substitution
We complete the vector semantics by adding the second step in the interpretation process, which is the insertion of lexical entries for the assumptions occurring in a proof. In this step, we face the issue that interpretation directly into a vector space is not an option given that there is no copying map that is linear, while at the same time lambda terms don’t seemingly reflect vectors. We solve the issue by showing, following Muskens and Sadrzadeh (2016, 2019), that vectors can be emulated using a lambda calculus.
Lambdas and Tensors
The idea of modelling tensors with lambda calculus is to represent vectors as functions from natural numbers to the values in the underlying field. This representation treats vectors as lists of elements of a field, for instance the field of reals \(\mathbb {R}\). What the function is doing is enumerating the elements of this list. So for instance, consider the following vector
$$\begin{aligned} \overrightarrow{v} = [a,b,c, \dots ] \qquad \text {for} \ a,b,c,{\dots } \in \mathbb {R}\end{aligned}$$
The representation of \(\overrightarrow{v}\) using a function f becomes as follows
$$\begin{aligned} f(1) = a,\ f(2) = b,\ f(3) = c, \dots \text { and so on} \end{aligned}$$
For natural language applications, it is convenient to work with a fixed set of indices rather than directly working with natural numbers as the starting point. These indices will be the “context words” of the vector space model of word meaning. For demonstration purposes, suppose these context words are the following set of words
$$\begin{aligned} C = \{\text{ human, } \text{ painting, } \text{ army, } \text{ weapon, } \text{ marathon }\} \end{aligned}$$
Then a “target word”, i.e. the word whose meaning we are representing using these context words, will have values from \(\mathbb {R}\) in the entries of a vector spaces spanned by the above context set. For instance, consider three target words “warrior”, “sword”, ad “athlete”. Their vector representations are as follows:
$$\begin{aligned} \overrightarrow{\text {warrior}}= & {} [4,1,2,9,1]\\ \overrightarrow{\text {sword}}= & {} [2,3,9,2,0]\\ \overrightarrow{\text {athlete}}= & {} [6,2,0,1,9] \end{aligned}$$
In a functional notation, our index set is the set of the context words, e.g. C, as given above, and for each target word t, our function returns its value on each of the context words. So for instance, for a function f, the vector representation of “warrior” becomes as follows
$$\begin{aligned} f(\text {human}) = 4, f(\text {painting}) = 1, f(\text {army}) = 2, f(\text {weapon}) = 9, f(\text {marathon}) = 1 \end{aligned}$$
Type-theoretically, instead of working with a set of words as the domain of the representation function f, we enumerate the set of context words and use their indices as inputs to f. So we denote our set C above by indices \(i_1, i_2, \dots , i_5\), which changes the function representation to the following
$$\begin{aligned} f(i_1) = 4, f(i_2) = 1, f(i_3) = 2, f(i_4) = 9, f(i_5) = 1 \end{aligned}$$
That is, for any dimensionality n, we assume a basic type \(I_n\), representing a finite index set (in concrete models the number of index types will be finite) of context words. The underlying field, in the case of natural language applications remains the set of real numbers \(\mathbb {R}\); we denote it by the type R. For more information about \(\mathbb {R}\) as a type, see Muskens and Sadrzadeh (2016). As explained above, the type of a vector in \(\mathbb {R}^n\) becomes \(V^n = I_n \rightarrow R\). Similarly, the type of an \(n \times m\) matrix, which is vector in a space whose basis are pairs of words, is \(M^{n\times m} = I_n \rightarrow I_m \rightarrow R\). In general, we may represent an arbitrary tensor with dimensions \(n,m,\ldots ,p\) by \(T^{n\times m \ldots \times p} = I_n \rightarrow I_m \rightarrow \ldots \rightarrow I_p \rightarrow R\). We abbreviate cubes \(T^{n \times m \times p}\) to C and hypercubes \(T^{n \times m \times p \times q}\) to H. We will leave out the superscripts denoting dimensionality when they are either irrelevant or understood from the context.
By reference to index notation for linear algebra, we write \(v \ i\) as \(v_i\) whenever it is understood that i is of type I. We moreover assume constants for the basic operations of a vector space: \(0 : R, 1 : R, + : R \rightarrow R \rightarrow R, \cdot : R \rightarrow R \rightarrow R\) with their standard interpretation. Some standard operations can now be expressed using lambda terms:
Name
|
Symbol
|
Lambda term
|
Matrix transposition
|
T
|
\(\lambda mij. m_{ji} : M \rightarrow M\)
|
Matrix-Vector multiplication
|
\(\times _1\)
|
\(\lambda mvi. \sum \limits _{j} m_{ij} \cdot v_j : M \rightarrow V \rightarrow V\)
|
Cube-Vector multiplication
|
\(\times _2\)
|
\(\lambda cvij. \sum \limits _{k} c_{ijk} \cdot v_k : C \rightarrow V \rightarrow M\)
|
Hypercube-Matrix multiplication
|
\(\times _3\)
|
\(\lambda cmij. \sum \limits _{l} c_{ijkl} \cdot m_{kl} : H \rightarrow M \rightarrow M\)
|
Vector Element wise multiplication
|
\(\odot \)
|
\(\lambda uvi. u_i \cdot v_i : V \rightarrow V \rightarrow V\)
|
Vector addition
|
\(+\)
|
\(\lambda uvi. u_i + v_i : V \rightarrow V \rightarrow V\)
|
We can also express many other operations in the same, e.g. backwards matrix multiplication by composing matrix transposition with standard multiplication: \(\times ^T := \lambda mvi. \sum \limits _j m_{ji} \cdot v_j : M \rightarrow V \rightarrow V\). In the same way, it is routine to define a cube-matrix multiplication and a hypercube-cube and hypercube-vector multiplication. These operations do not occur in the current paper. Similarly, one can define addition and element wise multiplication operations between matrices, cubes, and hypercubes. In what follows, we abuse the notation and denote the latter two with the same symbols, that is with \(+\) and \(\odot \) regardless of the type of object they are adding or multiplying.
All of these operations, except for addition, are instances of the multilinear algebraic operation of tensor contraction applicable to any two tensors of arbitrary rank as long as they share at least one index. The tensor contraction between them is formed by applying the following formula:
$$\begin{aligned}&\sum _{i_1,\ldots ,i_{n+k}} A_{i_1 i_2 \cdots i_n } B_{i_{n} i_{n+1} \cdots i_{n+k}} \in \underbrace{W \otimes \cdots \otimes W}_{n+k-1}\\&\text {For} \ \sum _{i_1,\ldots ,i_{n}} A_{i_1 i_2 \cdots i_n}\in \underbrace{W \otimes \cdots \otimes W}_n \quad \text {and} \quad \sum _{i_n,\ldots ,i_{n+k}} B_{i_{n} i_{n+1} \cdots i_{n+k}} \in \underbrace{W \otimes \cdots \otimes W}_{k+1} \end{aligned}$$
Element wise multiplication between two vectors, or matrices, or tensors of the same rank is also an instance of tensor contraction, where one of the arguments of the multiplication is raised to a tensor of a higher rank, with the argument in its diagonal and its other entries padded with zero. For an instance of this see Kartsaklis (2016) where coordination is treated in a DisCoCat model, therein the author shows how the linear algebraic closed form of element wise multiplication arises as a result of a tensor contraction.
Lexical Substitution
To obtain a concrete model for a phrase, we need to replace the abstract meaning term of a proof by a concrete tensor mapping. Since we map lambda terms to lambda terms, we only need to specify how constants c are mapped to tensors. This will automatically induce a type-respecting term homomorphism \({{\mathcal {H}}}\). A general map that sends constants to a contraction friendly model is presented in Table 1.
Table 1 Translation that sends abstract terms to a tensor-based model using matrix and cube multiplication as the main operations; here an in the two other proceeding tables the atomic types are np and s The different composition operators of Table 1 seem to be different: we have matrix multiplication for adjectival phrases, intransitive sentences and verb phrases, cube multiplication for transitive sentences, and pointwise multiplication for the conjunctive coordination.
Using Table 1 , we can translate the proof term of Fig. 2 as follows:
$$\begin{aligned} (\texttt {and} \ ((\texttt {dt} \ \texttt {drinks}) \ \texttt {bob})) (\texttt {drinks} \ \texttt {alice}) : s \end{aligned}$$
and substitute the concrete terms to get the following \(\beta \)-reduced version:
$$\begin{aligned} \twoheadrightarrow _{\beta } (\mathbf {drinks} \times _1 \mathbf {alice}) \odot (\mathbf {drinks} \times _1 \mathbf {bob}) \end{aligned}$$
As another alternative, we can instantiate the proof terms in a multiplicative-additive model. This is a model where the sentences are obtained by adding their individual word embeddings and the overall result is obtained by multiplying the two sentence vectors. This model is presented in Table 2, according to which we obtain the following semantics for our example sentence above:
$$\begin{aligned} \twoheadrightarrow _{\beta } (\mathbf {drinks} + \mathbf {alice}) \odot (\mathbf {drinks} + \mathbf {bob}) \end{aligned}$$
Another alternative is Table 3, which provides the same terms with a Kronecker -based tensor semantics, originally used by Grefenstette and Sadrzadeh (2011a) to model transitive sentences.
Table 2 Translation that sends abstract terms to a multiplicative-additive model We symbolise the semantics of the basic elliptical phrase that comes out of any of these models for our example sentence as follows:
$$\begin{aligned} M(\mathbf {sub}_1, \mathbf {verb}) \ \star \ M(\mathbf {sub}_2, N(\mathbf {verb})) \end{aligned}$$
where M is a general term for an intransitive sentence, N is a term that modifies the verb tensor through the auxiliary verb, and \(\star \) is an operation that expresses the coordination of the two subclauses. For a transitive sentence version, the above changes to the following:
$$\begin{aligned} M(\mathbf {subj}_1, \mathbf {verb}, \mathbf {obj}_1) \ \star \ M(\mathbf {subj}_2, N(\mathbf {verb}), \mathbf {obj}_1) \end{aligned}$$
Such a description is very general, and in fact allows us to derive almost all compositional vector models that have been tested in the literature (see e.g., Milajevs et al. 2014). This flexibility is necessary for ellipsis because it can model the Cartesian behaviour that is unavailable in a categorical modelling of vectors and linear maps. Some models can, however, only be incorporated by changing the lexical formulas associated to the individual words. The proposal of Kartsaklis et al. (2016) is one such example. They use the coordinator to a heavy extent and their typing and vector/tensor assignments result in the following lambda semantics for the phrase “Alice drinks and Bob does too”:
$$\begin{aligned} \mathbf {drinks} \times _1 (\mathbf {alice} \odot \mathbf {bob}) \end{aligned}$$
The above is obtained by assigning an identity linear map to the auxiliary phrase ‘does too’ and then assigning a complex linear map to the coordinator ‘and’ tailored in a way that it guarantees the derivation of the final meaning. In our framework, we would need to take a similar approach, and we need to modify M to essentially return the verb-subject pair, N would be the identity, and and has to be defined with the tailored to purpose term below, which takes two pairs of subjects and verbs, but discards one copy of the verb to mimic the model of Kartsaklis et al. (2016):
$$\begin{aligned} \texttt {and} \qquad \lambda \langle s, v \rangle . \lambda \langle t, w \rangle . v \times _1 (s \odot t) \end{aligned}$$
In either case, we can reasonably derive a large class of compositional functions that can be experimented with in a variety of tasks. With these tools in hand, we can give the desired interpretation to elliptical sentences in the next section.
Table 3 Translation that sends abstract terms to a Kronecker model. We abuse the notation to denote the element wise multiplication of two matrices with the same symbol, i.e. \(\odot \), as the element wise multiplication of two vectors