# Compositional Operators in Distributional Semantics

- 978 Downloads
- 1 Citations

## Abstract

This survey presents in some detail the main advances that have been recently taking place in Computational Linguistics towards the unification of the two prominent semantic paradigms: the compositional formal semantics view and the distributional models of meaning based on vector spaces. After an introduction to these two approaches, I review the most important models that aim to provide compositionality in distributional semantics. Then I proceed and present in more detail a particular framework [7] based on the abstract mathematical setting of category theory, as a more complete example capable to demonstrate the diversity of techniques and scientific disciplines that this kind of research can draw from. This paper concludes with a discussion about important open issues that need to be addressed by the researchers in the future.

## Keywords

Natural language processing Distributional semantics Compositionality Vector space models Formal semantics Category theory Compact closed categories## Introduction

The recent developments on the syntactical and morphological analysis of natural language text constitute the first step towards a more ambitious goal of assigning a proper form of *meaning* to arbitrary text compounds. Indeed, for certain really ‘intelligent’ applications, such as machine translation, question-answering systems, paraphrase detection or automatic essay scoring, to name just a few, there will always exist a gap between raw linguistic information (such as part-of-speech labels, for example) and the knowledge of the real world that is needed for the completion of the task in a satisfactory way. Semantic analysis has exactly this role, aiming to close (or reduce as much as possible) this gap by linking the linguistic information with semantic representations that embody this elusive real-world knowledge.

The traditional way of adding semantics to sentences is a syntax-driven compositional approach: every word in the sentence is associated with a primitive symbol or a predicate, and these are combined to larger and larger logical forms based on the syntactical rules of the grammar. At the end of the syntactical analysis, the logical representation of the whole sentence is a complex formula that can be fed to a theorem prover for further processing. Although such an approach seems intuitive, it has been shown that it is rather inefficient for any practical application (for example, Bos and Markert [5] get very low recall scores for a textual entailment task). Even more importantly, the meaning of the atomic units (words) is captured in an axiomatic way, namely by ad-hoc unexplained primitives that have nothing to say about the real semantic value of the specific words.

On the other hand, distributional models of meaning work by building co-occurrence vectors for every word in a corpus based on its context, following Firth’s intuition that ‘you should know a word by the company it keeps’ [12]. These models have been proved useful in many natural language tasks (see section From Words to Sentence) and can provide concrete information for the words of a sentence, but they do not scale up to larger constituents of text, such as phrases or sentences. Given the complementary nature of these two distinct approaches, it is not a surprise that compositional abilities of distributional models have been the subject of much discussion and research in recent years. Towards this purpose researchers exploit a wide variety of techniques, ranging from simple mathematical operations like addition and multiplication to neural networks and even category theory. The purpose of this paper is to provide a concise survey of the developments that have been taking place towards the goal of equipping distributional models of meaning with compositional abilities.

The plan is the following: In section Compositional Semantics and Distributional Semantics I provide an introduction to compositional and distributional models of meaning, respectively, explaining the basic principles and assumptions on which they rely. Then I proceed to review the most important methods aiming towards their unification (section Compositionality in Distributional Approaches). As a more complete example of such a method (and as a demonstration of the multidisciplinarity of Computational Linguistics), section A Categorical Framework for Natural Language describes the framework of Coecke et al. [7], based on the abstract setting of category theory. Section Verb and Sentence Spaces provides a closer look to the form of a sentence space, and how our sentence-producing functions (i.e. the verbs) can be built from a large corpus. Finally, section Challenges and Open Questions discusses important philosophical and practical open questions and issues that form part of the current and future research.

## Compositional Semantics

Compositionality in semantics offers an elegant way to address the inherent property of natural language to produce infinite structures (phrases and sentences) from finite resources (words). The *principle of compositionality* states that the meaning of a complex expression can be determined by the meanings of its constituents and the rules used for combining them. This idea is quite old, and glimpses of it can be spotted even in works of Plato. In his dialogue *Sophist*, Plato argues that a sentence consists of a noun and a verb, and that the sentence is true if the verb denotes the action that the noun is currently performing. In other words, Plato argues that (a) a sentence has a structure, (b) the parts of the sentence have different functions and (c) the meaning of the sentence is determined by the function of its parts. Nowadays, this intuitive idea is often attributed to Gottlob Frege, who expresses similar views in his ‘Foundations of Mathematics’, originally published in 1884. In an undated letter to Philip Jourdain, included in ‘Philosophical and Mathematical Correspondence’ [14], Frege provides an explanation for the reason this idea seems so intuitive:

The possibility of our understanding propositions which we have never heard before rests evidently on this, that we can construct the sense of a proposition out of parts that correspond to words.

This forms the basis of the *productivity* argument, often used as a proof for the validity of the principle: humans only know the meaning of words, and the rules to combine them in larger constructs; yet, being equipped with this knowledge, we are able to produce new sentences that we have never uttered or heard before. Indeed, this task seems natural even for a 3-year-old child—however, its formalization in a way reproducible by a computer has been proven nothing but trivial. The modern compositional models owe a lot to the seminal work of Richard Montague (1930–1971), who has managed to present a systematic way of processing fragments of the English language in order to get semantic representations capturing their ‘meaning’ [30, 31, 32]. In his ‘Universal Grammar’ (1970b), Montague states that

There is in my opinion no important theoretical difference between natural languages and the artificial languages of logicians.

Montague supports this claim by detailing a systematization of the natural language, an approach which became known as Montague grammar. To use Montague’s method, one would need two things: first, a resource which will provide the logical forms of each specific word (a lexicon); and second, a way to determine the correct order in which the elements in the sentence should be combined in order to end up with a valid semantic representation. A natural way to address the latter, and one traditionally used in computational linguistics, is to use the syntactic structure as a means of driving the semantic derivation (an approach called *syntax-driven semantic analysis*). In other words, we assume that there exists a mapping from syntactic to semantic types, and that the composition in the syntax level implies a similar composition in the semantic level. This is known as the *rule-to-rule hypothesis* [1].

- (1)
a. every \(\vdash Det : \lambda P.\lambda Q.\forall x[P(x) \rightarrow Q(x)]\)

b. man \(\vdash N : \lambda y.man(y)\)

c. walks \(\vdash Verb_{IN} : \lambda z.walks(z)\).

The above use of formal logic (especially higher order) in conjunction with \(\lambda \)-calculus was first introduced by Montague, and from then on it constitutes the standard way of providing logical forms to compositional models. In the above lexicon, predicates of the form \(man(y)\) and \(walks(z)\) are true if the individuals denoted by \(y\) and \(z\) carry the property (or, respectively, perform the action) indicated by the predicate. From an extensional perspective, the semantic value of a predicate can be seen as the set of all individuals that carry a specific property: \(walks(john)\) will be true if the individual \(john\) belongs to the set of all individuals who perform the action of walking. Furthermore, \(\lambda \)-terms like \(\lambda x\) or \(\lambda Q\) have the role of placeholders that remain to be filled. The logical form \(\lambda y.man(y)\), for example, reflects the fact that the entity which is going to be tested for the property of manhood is still unknown and it will be later specified based on the syntactic combinatorics. Finally, the form in (1a) reflects the traditional way for representing a universal quantifier in natural language, where the still unknown part is actually the predicates acting over a range of entities.

- (2)

- (3)
\( \begin{array}{ll} NP \rightarrow Det\,\,N & \text{a noun phrase consists of a determiner and a noun.} \end{array}\)

\( \begin{array}{ll} S \rightarrow NP\, Verb_{IN} & \text{ a sentence consists of a noun phrase and an intransitive verb.} \end{array}\)

*man*(1b). The details of this reduction are presented below:

- (4)
\(\begin{array}{ll} \lambda P.\lambda Q.\forall x[P(x) \rightarrow Q(x)](\lambda y.man(y)) \\ \quad \rightarrow _\beta \lambda Q.\forall x[(\lambda y.man(y))(x) \rightarrow Q(x)] \qquad \qquad P:= \lambda y.man(y)\\ \quad \rightarrow _\beta \lambda Q.\forall x[man(x) \rightarrow Q(x)] \qquad \qquad y := x \\ \end{array}\).

Similarly, the second rule signifies that the logical form of the whole sentence is derived by the combination of the logical form of the noun phrase as computed in (4) above with the logical form of the intransitive verb (1c):

- (5)
\(\begin{array}{ll} \lambda Q.\forall x[man(x) \rightarrow Q(x)](\lambda z.walks(z)) \\ \quad \rightarrow _\beta \forall x[man(x) \rightarrow (\lambda z.walks(z))(x)] \qquad \qquad Q:= \lambda z.walks(z) \\ \quad \rightarrow _\beta \forall x[man(x) \rightarrow walks(x)] \qquad \qquad z:= x\\ \end{array}\).

Thus we have arrived at a logical form which can be seen as a semantic representation of the whole sentence. The tree below provides a concise picture of the complete semantic derivation:

- (6)

A logical form such as \(\forall x[man(x) \rightarrow walks(x)]\) simply states the truth (or falseness) of the expression given the sub-expressions and the rules for composing them. It does not provide any quantitative interpretation of the result (e.g. grades of truth) and, even more importantly, leaves the meaning of words as unexplained primitives (\(man\), \(walks,\) etc.). In the next section we will see how distributional semantics can fill this gap.

## Distributional Semantics

### The Distributional Hypothesis

*distributional hypothesis*[18], stating that words that occur in the same context have similar meanings. Various forms of this popular idea keep recurring in the literature: Firth [12] calls it collocation, while Frege himself states that ‘never ask for the meaning of a word in isolation, but only in the context of a proposition’. [13]. The attraction of this principle in the context of Computational Linguistics is that it provides a way of concretely representing the meaning of a word via mathematics: each word is a vector whose elements show how many times this word occurred in some corpus at the same context with every other word in the vocabulary. If, for example, our basis is \(\{cute,\,sleep,\,finance,\,milk\}\), the vector for word ‘cat’ could have the form \((15,7,0,22)\) meaning that ‘cat’ appeared 15 times together with ‘cute’, 7 times with ‘sleep’ and so on. More formally, given an orthonormal basis \(\{\overrightarrow{n_i}\}_i\) for our vector space, a word is represented as:

*point-wise mutual information*(PMI), which can reflect the relationship between a context word \(c\) and a target word \(t\) as follows:

^{1}originally 2,000-dimensional and projected onto two dimensions for visualization. Note how words form distinct groups of points according to their semantic correlation. Furthermore, it is interesting to see how ambiguous words behave in these models: the ambiguous word ‘mouse’ (with the two meanings to be that of a rodent and of a computer pointing device), for example, is placed almost equidistantly from the group related to IT concepts (lower left part of the diagram) and the animal group (top left part of the diagram), having a meaning that can be indeed seen as the average of both senses.

### Forms of Word Spaces

- (7)
The movie I saw and John said he really likes.

One of the earliest attempts to add syntactical information in a word space was that of Grefenstette [17], who used a structured vector space with a basis constructed by grammatical properties such as ‘subject-of-buy’ or ‘argument-of-useful’, denoting that the target word has occurred in the corpus as subject of the verb ‘buy’ or as argument of the adjective ‘useful’. The weights of a word vector were binary, either 1 for at least one occurrence or 0 otherwise. Lin [25] moves one step further, replacing the binary weights with frequency counts. Following a different path, Erk and Padó [10] argue that a single vector is not enough to capture the meaning of a word; instead, the vector of a word is accompanied by a set of vectors \(R\) representing the lexical preferences of the word for its arguments positions, and a set of vectors \(R^{-1}\), denoting the inverse relationship; that is, the usage of the word as argument in the lexical preferences of other words. Subsequent works by Thater et al. [45, 46] present a version of this idea extended with the inclusion of grammatical dependency contexts.

In an attempt to provide a generic distributional framework, Padó and Lapata [34] presented a model based on dependency relations between words. The interesting part of this work is that, given the proper parametrization, it is indeed able to essentially subsume a large amount of other works on the same subject. For this reason, it is worth of a more detailed description which I provide in the next section.

### A Dependency-Based Model

- (8)

*undirected*in order to catch relationships between e.g. a subject and an object, which otherwise would be impossible). The creation of a vector space involves the following steps:

- 1.
For each target word \(t\), collect all the undirected paths that start from this word. This will be the initial context for \(t\), denoted by \(\Pi _t\).

- 2.
Apply a context selection function \(cont\): \(W \rightarrow 2^{\Pi _t}\) (where \(W\) is the set of all tokens of type \(t\)), which assigns to \(t\) a subset of the initial context. Given a word-window \(k\), for example, this function might be based on the absolute difference between the position of \(t\) and the position of the end word for each path.

- 3.
For every path \(\pi \in cont(t)\), specify a relative importance value by applying a path value function of the form \(v\): \(\Pi \rightarrow \mathbb {R}\).

- 4.
For every \(\pi \in cont(t)\), apply a basis-mapping function \(\mu \): \(\Pi \rightarrow B\), which maps each path \(\pi \) to a specific basis element.

- 5.Calculate the co-occurrence frequency of \(t\) with a basis element \(b\) by a function \(f\): \(B \times T \rightarrow \mathbb {R}\), defined as:$$\begin{aligned} f(b,t) = \sum \limits _{w \in W(t)} \; \sum\limits _{\pi \in cont(w) \wedge \mu (\pi )=b} v(\pi).\end{aligned}$$(5)
- 6.
Finally, and in order to remove potential frequency bias due to raw counts, calculate the log-likelihood ratio \(\hbox {G}^2\) [9] for all basis elements.

### From Words to Sentence

Distributional models of meaning have been widely studied and successfully applied on a variety of language tasks, especially during the last decade with the availability of large-scale corpora, like Gigaword [15] and ukWaC [11], which provide a reliable resource for training the vector spaces. For example, Landauer and Dumais [24] use vector space models in order to model and reason about human learning rates in language; Schütze [40] performs word sense induction and disambiguation; Curran [8] shows how distributional models can be applied to automatic thesaurus extraction and Manning et al. [28] discuss possible applications in the context of information retrieval.

However, due to the infinite capacity of a natural language to produce new sentences from finite resources, no corpus, regardless its size, can be used for providing vector representations to anything but very small text fragments, usually only to words. Under this light, the provision of distributional models with compositional abilities similar to what was described in section Compositional Semantics seems a very appealing solution that could offer the best of both worlds in a unified manner. The goal of such a system would be to use the compositional rules of a grammar, as described in section Compositional Semantics, in order to combine the context vectors of the words to vectors of larger and larger text constituents, up to the level of a sentence. A sentence vector, then, could be compared with other sentence vectors, providing a way for assessing the semantic similarity between sentences as if they were words. The benefits of such a feature are obvious for many natural language processing tasks, such as paraphrase detection, machine translation, information retrieval, and so on, and in the following section I am going to review all the important approaches and current research towards this challenging goal.

## Compositionality in Distributional Approaches

### Vector Mixture Models

*mixture*of the input vectors. Figure 3 demonstrates this; each element of the output vector can be seen as an ‘average’ of the two corresponding elements in the input vectors. In the additive case, the components of the result are simply the cumulative scores of the input components. So in a sense the output element embraces both input elements, resembling a union of the input features. On the other hand, the multiplicative version is closer to intersection: a zero element in one of the input vector will eliminate the corresponding feature in the output, no matter how high the other component was.

Vector mixture models constitute the simplest compositional method in distributional semantics. Despite their simplicity, though, (or because of it) these approaches have been proved very popular and useful in many NLP tasks, and they are considered hard-to-beat baselines for many of the more sophisticated models we are going to discuss next. In fact, the comparative study of Blacoe and Lapata [4] suggests something really surprising: that, for certain tasks, additive and multiplicative models can be almost as much effective as state-of-the-art deep learning models, which will be the subject of section Deep Learning Models.

### Tensor Product and Circular Convolution

Although tensor product models solve the bag-of-words problem, unfortunately they introduce a new very important issue: given that the cardinality of the vector space is \(d\), the space complexity grows exponentially as more constituents are composed together. With \(d=300\), and assuming a typical floating-point machine representation (8 bytes per number), the vector of Eq. 10 would require \(300^5\times 8= 1.944 \times 10^{13}\) bytes (\(\approx 19.5\) terabytes). Even more importantly, the use of tensor product as above only allows the comparison of sentences that share the same structure, i.e. there is no way for example to compare a transitive sentence with an intransitive one, a fact that severely limits the applicability of such models.

^{2}

### Tensor-Based Models

^{3}of higher order) that apply on one or more arguments (vectors or tensors of lower order). An adjective, for example, is not any more a simple vector but a matrix (a tensor of order 2) that, when matrix-multiplied with the vector of a noun, will return a modified version of it (Fig. 4).

*Choi–Jamiołkowsky isomorphism*: every linear map from \(V\) to \(W\) (where \(V\) and \(W\) are finite-dimensional Hilbert spaces) stands in one-to-one correspondence with a tensor living in the tensor product space \(V \otimes W\). For the case of a multilinear map (a function with more than one argument), this can be generalized to the following:

*tensor contraction*. Given two tensors of orders \(m\) and \(n\), the tensor contraction operation always produces a new tensor of order \(n+m-2\). Under this setting, the meaning of a simple transitive sentence can be calculated as follows:

Tensor-based models provide an elegant solution to the problems of vector mixtures: they are not bag-of-words approaches and they respect the type-logical identities of special words, following an approach very much aligned with the formal semantics perspective. Furthermore, they do not suffer from the space complexity problems of models based on raw tensor product operations (section Tensor Product and Circular Convolution), since the tensor contraction operation guarantees that every sentence will eventually live in our sentence space \(S\). On the other hand, the highly linguistic perspective they adopt has also a downside: in order for a tensor-based model to be fully effective, an appropriate mapping to vector spaces should have been devised for every functional word, such as prepositions, relative pronouns or logical connectives. As we will see in section Treatment of Functional Words and Treatment of Logical Words, this problem is far from trivial; actually it constitutes one of the most important open issues, and at the moment restricts the application of these models on well-defined text structures (for example, simple transitive sentences of the form ‘subject-verb-object’ or adjective–noun compounds).

The notion of a framework where relational words act as linear maps on noun vectors has been formalized by Coecke et al. [7] in the abstract setting of category theory and compact closed categories, a topic we are going to discuss in more detail in section Categorical Framework for Natural Language. Baroni and Zamparelli’s [2] composition method for adjectives and nouns also follows the very same principle.

### Deep Learning Models

*deep learning*techniques, a class of machine learning algorithms (usually neural networks) that approach models as multiple layers of representations, where the higher-level concepts are induced from the lower-level ones. For example, Socher et al. [42, 43, 44] use recursive neural networks in order to produce compositional vector representations for sentences, with very promising results in a number of tasks. In its most general form, a neural network like this takes as input a pair of word vectors \(\overrightarrow{w_1},\overrightarrow{w_2}\) and returns a new composite vector \(\overrightarrow{y}\) according to the following equation:

Neural networks can vary in design and topology. Kalchbrenner and Blunsom [20], for example, model sentential compositionality using a *convolutional* neural network in an element-wise fashion. Specifically, the input of the network is a vector representing a single feature, the elements of which are collected across all the word vectors in the sentence. Each layer of the network applies convolutions of kernels of increasing size, producing at the output a single value that will form the corresponding feature in the resulting sentence vector. This method was used for providing sentence vectors in the context of a discourse model, and was tested with success in a task of recognizing dialogue acts of utterance within a conversation. Furthermore, it has been used as a sentence generation apparatus in a machine translation model with promising results [19].

On the contrary to the previously discussed compositional approaches, deep learning methods are based on a large amount of pre-training: the parameters \(\mathbf {W}\) and \(\overrightarrow{b}\) in the network of Fig. 5 must be learned through an iterative algorithm known as *backpropagation*, a process that can be very time-consuming and in general cannot guarantee optimality. However, the non-linearity in combination with the layered approach in which neural networks are based provides these models with great power, allowing them to simulate the behaviour of a range of functions much wider than the linear maps of tensor-based approaches. Indeed, the work of Socher et al. [42, 43, 44] has been tested in various paraphrase detection and sentiment analysis tasks, delivering results that by the time of this writing remain state-of-the-art.

## A Categorical Framework for Natural Language

Tensor-based models stand in between the two extremes of vector mixtures and deep learning methods, offering an appealing alternative that can be powerful enough and at the same time fully aligned with the formal semantics view of natural language. Actually, it has been shown that the linear-algebraic formulas for the composite meanings produced by a tensor-based model emerge as the natural consequence of a structural similarity between a grammar and finite-dimensional vector spaces. In this section I will review the most important points of this work.

### Unifying Grammar and Meaning

This result is based on the fact that the base type-logic of the framework, a pregroup grammar [23], shares the same abstract structure with finite-dimensional vector spaces, that of a compact closed category. Mathematically, the transition from grammar types to vector spaces has the form of a strongly monoidal functor, that is, of a map that preserves the basic structure of compact closed categories. The following section provides a short introduction to the category theoretic notions above.

### Introduction to Categorical Concepts

*Category theory*is an abstract area of mathematics, the aim of which is to study and identify universal properties of mathematical concepts that often remain hidden by traditional approaches. A

*category*is a collection of objects and morphisms that hold between these objects, with composition of morphisms as the main operation. That is, for two morphisms \(A\mathop{\rightarrow}\limits^{f} B\mathop{\rightarrow}\limits^{g}C\), we have \(g\circ f:A \rightarrow C\). Morphism composition is associative, so that \((f \circ g) \circ h = f \circ (g \circ h)\). Furthermore, every object \(A\) has an identity morphism \(1_A:A \rightarrow A\); for \(f:A \rightarrow B\) we moreover require that

*monoidal category*is a special type of category equipped with another associative operation, a monoidal tensor \(\otimes :\mathcal {C}\times \mathcal {C}\rightarrow \mathcal {C}\) (where \(\mathcal {C}\) is our basic category). Specifically, for each pair of objects \((A,B)\) there exists a composite object \(A\otimes B\), and for every pair of morphisms \((f:A \rightarrow C, g:B \rightarrow D)\) a parallel composite \(f\otimes g: A \otimes B \rightarrow C \otimes D\). For a

*symmetric monoidal category*, it is also the case that \(A\otimes B \cong B \otimes A\). Furthermore, there is a unit object \(I\) which satisfies the following isomorphisms:

*compact closed*if every object \(A\) has a left and right adjoint, denoted as \(A^l,\,A^r\) , respectively, for which the following special morphisms exist:

For the case of a *symmetric compact closed category*, the left and right adjoints collapse into one, so that \(A^{*}:=A^{l}=A^r.\)

*pregroup grammar*[23] is a type-logical grammar built on the rigorous mathematical basis of a pregroup algebra, i.e. a partially ordered monoid with unit 1, whose each element \(p\) has a left adjoint \(p^l\) and a right adjoint \(p^r\). These elements of the monoid are the objects of the category, the partial orders are the morphisms, the pregroup adjoints correspond to the adjoints of the category, while the monoid multiplication is the tensor product with 1 as unit. In a context of a pregroup, Eqs. 18 and 19 are transformed to the following:

- (9)

*symmetric*compact closed category); the \(\epsilon \)-maps correspond to the inner product between the involved context vectors, as follows:

*strongly monoidal functor*\(\mathcal {F}\) of the form:

The significance of the categorical framework lies exactly in this fact, that it provides an elegant mathematical counterpart of the formal semantics perspective as expressed by Montague [30], where words are represented and interact with each other according to their type-logical identities. Furthermore, it seems to imply that approaching the problem of compositionality in a tensor-based setting is a step towards the right direction, since the linear-algebraic manipulations come as a direct consequence of the grammatical derivation. The framework itself is a high-level recipe for composition in distributional environments, leaving a lot of room for further research and experimental work. Concrete implementations have been provided, for example, by Grefenstette and Sadrzadeh [16] and Kartsaklis et al. [21]. For more details on pregroup grammars and their type dictionary, see [23]. The functorial passage from a pregroup grammar to finite-dimensional vector spaces is described in detail in [22].

## Verb and Sentence Spaces

Until now I have been deliberately vague when talking about the sentence space \(S\) and the properties that a structure like this should bring. In this section I will try to discuss this important issue in more detail, putting some emphasis on how it is connected to another blurry aspect of the discussion so far, the form of relational words such as verbs. From a mathematical perspective, the decisions that need to be taken regard (a) the dimension of \(S\), that is, the cardinality of its basis; and (b) the form of the basis. In other words, how many and what kind of features will comprise the meaning of a sentence?

This question finds a trivial answer in the setting of vector mixture models; since everything in that approach lives into the same base space, a sentence vector has to share the same size and features with words. It is instructive to pause for a moment and consider what this really mean in practice. What a distributional vector for a word actually shows us is to what extent all other words in the vocabulary are related to this specific target word. If our target word is a verb, then the components of its vector can be thought as related to the *action* described by the verb: a vector for the verb ‘run’ reflects the degree to which a ‘dog’ can run, a ‘car’ can run, a ‘table’ can run and so on. The element-wise mixing of vectors \(\overrightarrow{dog}\) and \(\overrightarrow{run}\) then in order to produce a compositional representation for the meaning of the simple intransitive sentence ‘dogs run’, finds an intuitive interpretation: the output vector will reflect the extent to which things that are related to dogs can also run; in other words, it shows how *compatible* the verb is with the specific subject.

A tensor-based model, on the other hand, goes beyond a simple compatibility check between the relational word and its arguments; its purpose is to *transform* the noun into a sentence. Furthermore, the size and the form of the sentence space become tunable parameters of the model, which can depend on the specific task in hand. Let us assume that in our model we select sentence and noun spaces such that \(S \in \mathbb {R}^s\) and \(N \in \mathbb {R}^n\), respectively; here, \(s\) refers to the number of distinct features that we consider appropriate for representing the meaning of a sentence in our model, while \(n\) is the corresponding number for nouns. An intransitive verb, then, like ‘play’ in ‘kids play’, will live in \(N \otimes S \in \mathbb {R}^{n \times s}\), and will be a map \(f: N \rightarrow S\) built (somehow) in a way to take as input a noun and produce a sentence; similarly, a transitive verb will live in \(N\otimes S\otimes N\in \mathbb {R}^{n\times s\times n}\) and will correspond to a map \(f_{tr}: N\otimes N \rightarrow S\). We can now provide some intuition of how the verb space should be linked to the sentence space: the above description clearly suggests that, in a certain sense, the verb tensor should be able to somehow encode the meaning of *every* possible sentence that can be produced by the specific verb, and emit the one that matches better the given input.

The matrix for *sleep* provides a clear intuition of a verb structure that encodes all possible sentence meanings that can be emitted from the specific verb. Notice that each row of \(\overline{sleep}\) corresponds to a potential sentence meaning given a different subject vector; the role of the subject is to specify which row should be selected and produced as the output of the verb.

It is not difficult to imagine how this situation scales up when we move on from this simple example to the case of highly dimensional real-valued vector spaces. Regarding the first question we posed in the beginning of this section, this implies some serious practical limitations on how large \(s\) can be. Although it is generally true that the higher the dimension of sentence space, the subtler the differences we would be able to detect from sentence to sentence, in practice a verb tensor must be able to fit in a computer’s memory to be of any use to us; with today’s machines, this could roughly mean that \(s \le 300\).^{4} Using very small vectors is also common practice in deep learning approaches, aiming to reduce the training times of the expensive optimization process; Socher et al. [43], for example, use 50-dimensional vectors for both words and phrases/sentences.

The second question posed in this section, regarding *what kind* of properties should comprise the meaning of a sentence, seems more philosophical than technical. Although in principle model designers are free to select whatever features (that is, basis vectors of \(S\)) they think might serve better the purpose of the model, in practice an empirical approach is usually taken in order to sidestep deeper philosophical issues regarding the nature of a sentence. In a deep learning setting, for example, the sentence vector emerges as a result of an *objective function* based on which the parameters of the model have been optimized. In [42], the objective function assesses the quality of a parent vector (e.g. vector \(\overrightarrow{v}\) in Fig. 5) by how faithfully it can be deconstructed to the two original children vectors. Hence, the sentence vector at the top of the diagram in Fig. 5 is a vector constructed in a way to allow the optimal reconstruction of the vectors for its two children, the noun phrase ‘kids’ and the verb phrase ‘play games’. The important point here is that no attempt has been made to interpret the components of the sentence vector individually; the only thing that matters is how faithfully the resulting vector fulfils the adopted constraints.

## Challenges and Open Questions

The previous sections hopefully provided a concise introduction to the important developments that have been noted in the recent years on the topic of compositional distributional models. Despite this progress, however, the provision of distributional models with compositionality is an endeavour that still has a long way to go. This section outlines some important issues that current and future research should face in order to provide a more complete account to the problem.

### Evaluating the Correctness of Distributional Hypothesis

The idea presented in section Verb and Sentence Spaces for creating distributional vectors of constructions larger than single words has its roots to an interesting thought experiment, the purpose of which is to investigate the potential distributional behaviour of large phrases and sentences and the extent to which such a distributional approach is plausible or not for longer-than-words text constituents. The argument goes like this: if we had an infinitely large corpus of text, we could create a purely distributional representation of the phrase or sentence, exactly as we do for words, by taking into account the different contexts within which this text fragment occurs. Then the assumption would be that the vector produced by the composition of the individual word vectors should be a synthetic ‘reproduction’ of this distributional sentence vector. This thought experiment poses some interesting questions: First of all, it is not clear if the distributional hypothesis does indeed scale up to text constituents larger than words; second, even if we assume it does, what ‘context’ would mean in this case? For example, what would an appropriate context be of a 20-word sentence?

Although there is no such a thing as an ‘infinitely large corpus’, it would still be possible to get an insight about these important issues if we restrict ourselves to small constructs—say, two-word constituents—for which we can still get reliable frequency counts from a large corpus. In the context of their work with adjectives (shortly discussed in section Verb and Sentence Spaces), Baroni and Zamparelli [2] performed an interesting experiment along these lines using the ukWaC corpus, consisting of 2.8 billion words. As we saw, their work follows the tensor-based paradigm where adjectives are represented as linear maps learnt using linear regression and act on the context vectors of nouns. The composite vectors were compared with observed vectors of adjective–noun compounds, created by the contexts of each compound in the corpus. The results, although perhaps encouraging, are far from perfect: for 25 % of the composed vectors, the observed vector was not even in the top 1,000 of their nearest neighbours, in 50 % of the cases the observed vector was in the top 170, while only for a 25 % of the cases the observed vector was in the top 17 of the nearest neighbours.

As the authors point out, one way to explain the performance is as a result of data sparseness. However, the fact that a 2.8 billion-word corpus is not sufficient for modelling elementary two-word constructs would be really disappointing. As Pulman [37] mentions:

It is worth pointing out that a corpus of 2.83 billion is already thousands of times as big as the number of words it is estimated a 10-year-old person would have been exposed to [33], and many hundreds of times larger than any person will hear or read in a lifetime.

If we set aside for a moment the possibility of data sparseness, then we might have to start worrying about the validity of our fundamental assumptions, i.e. that of distributional hypothesis and principle of compositionality. Does the result mean that the distributional hypothesis holds only for individual words such as nouns, but it is not very effective for larger constituents such as adjective–noun compounds? Doubtful, since a noun like ‘car’ and an adjective–noun compound like ‘red car’ represent similar entities, share the same structure and occur within similar contexts. Is this then an indication that the distributional hypothesis suffers from some fundamental flaw that limits its applicability even for the case of single words? That would be a very strong and rather unjustified claim to make, since it is undoubtedly proven that distributional models can capture the meaning of words, at least to some extent, for many real-world applications (see examples in section From Words to Sentence). Perhaps we should seek the reason of the sub-optimal performance to the specific methods used for the composition in that particular experiment. The adjectives, for example, are modelled as linear functions over their arguments (nouns they modify), which raises another important question: Is *linearity* an appropriate model for composition in natural language? Further research is needed in order to provide a clearer picture of the expectations we should have regarding the true potential of compositional distributional models, and all the issues raised in this section are very important towards this purpose.

### What is ‘Meaning’?

Even if we accept that the distributional hypothesis is correct, and a context vector can indeed capture the ‘meaning’ of a word, it would be far too simplistic to assume that this holds for *every* kind of word. The meaning of some words can be determined by their denotations; it is reasonable to claim, for example, that the meaning of the word ‘tree’ is the set of all trees, and we are even able to answer the question ‘what is a tree?’ by pointing to a member of this set, a technique known as *ostensive* definition. But this is not true for all words. In ‘Philosophy’ (published in ‘Philosophical Occasions: 1912–1951’, [50]), Ludwig Wittgenstein notes that there exist certain words, like ‘time’, the meaning of which is quite clear to us until the moment we have to explain it to someone else; then we realize that suddenly we are not able any more to express in words what we certainly know—it is like we have forgotten what that specific word really means. Wittgenstein claims that ‘if we have this experience, then we have arrived at the limits of language’. This observation is related to one of the central ideas of his work: that the meaning of a word does not need to rely on some kind of definition; what really matters is the way we use this word in our everyday communications. In ‘Philosophical Investigations’ [49], Wittgenstein presents a thought experiment:

Now think of the following use of language: I send someone shopping. I give him a slip marked five red apples. He takes the slip to the shopkeeper, who opens the drawer marked ‘apples’; then he looks up the word ‘red’ in a table and finds a colour sample opposite it; then he says the series of cardinal numbers—I assume that he knows them by heart—up to the word five and for each number he takes an apple of the same colour as the sample out of the drawer. It is in this and similar ways that one operates with words.

For the shopkeeper, the meaning of words ‘red’ and ‘apples’ was given by ostensive definitions (provided by the colour table and the drawer label). But what was the meaning of word ‘five’? Wittgenstein is very direct on this:

No such thing was in question here, only how the word ‘five’ is used.

If language is not expressive enough to describe certain fundamental concepts of our world, then the application of distributional models is by definition limited. Indeed, the subject of this entire section is the concept of ‘meaning’—yet, how useful would be for us to use this resource in order to construct a context vector for this concept? To what extent would this vector be an appropriate semantic representation for the word ‘meaning’?

### Treatment of Functional Words

- (10)
\(\exists x\exists y.[\textit{man}(x) \wedge \textit{uniform}(y) \wedge \textit{in}(x,y)],\)

- (11)

- (12)

- 1.
The sentence dimension of the verb is ‘killed’; this can be seen as a collapse of the order-3 tensor into a tensor of order 2 along the sentence dimension.

- 2.
The new version of verb, now retaining information only from the dimensions linked to the subject and the object, interacts with the object and produces a new vector.

- 3.
This new vector is ‘merged’ with the subject, in order to modify it appropriately.

From a linear-algebraic perspective, the meaning of the noun phrase is given as the computation \((\overline{verb}\times \overrightarrow{obj})\odot \overrightarrow{subj}\), where \(\times \) denotes matrix multiplication and \(\odot \) point-wise multiplication. Note that this final equation does not include any tensor for ‘that’; the word solely acts as a ‘router’, moving information around and controlling the interactions between the content words. The above treatment of relative pronouns is flexible and possibly opens a door for modelling other functional words, such as prepositions. What remains to be seen is how effective it can be in an appropriate large-scale evaluation, since the current results, although promising, are limited to a small example dataset.

### Treatment of Logical Words

*quantum logic*, originated by Birkho and von Neumann [3], where the logical connectives operate on linear subspaces of a vector space. Under this setting, the negation of a subspace is given by its orthogonal complement. Given two vector spaces \(A\) and \(V\), with \(A\) to be a subspace of \(V\) (denoted by \(A \le V\)), the orthogonal complement of \(A\) is defined as follows:

An excellent introduction to quantum logic, specifically oriented to information retrieval, can be found in [47].

### Quantification

*all*men,

*some*women,

*three*cats and

*at least*four days. This fits nicely in the logical view of formal semantics. Consider for example the sentence ‘John loves every woman’, which has the following logical form:

- (13)
\(\forall x (\textit{woman}(x) \rightarrow \textit{loves}(john,x)).\)

Given that the set of women in our universe is \(\{\textit{alice},\textit{mary},\textit{helen}\},\) the above expression will be true if relation \(loves\) includes the pairs \((john,alice),(john,mary),(john,helen).\) Unfortunately, this treatment does not make sense in vector space models since they lack any notion of individuality; furthermore, creating a context vector for a quantifier is meaningless. As a result, quantifiers are just considered ‘noise’ and ignored in the current practice, producing unjustified equalities such as \(\overrightarrow{every\,man}=\overrightarrow{some\,man}=\overrightarrow{man}\). The extent to which vector spaces can be equipped with quantification (and whether this is possible or not) remains another open question for further research. Preller [36] provides valuable insights on the topic of moving from functional to distributional models and how these two approaches are related.

## Some Closing Remarks

Compositional distributional models of meaning constitute a technology with great potential, which can drastically influence and improve the practice of natural language processing. Admittedly our efforts are still in their infancy, which should be evident from the discussion in section Challenges and Open Questions. For many people, the ultimate goal of capturing the meaning of a sentence in a computer’s memory might currently seem rather utopic, something of theoretical only interest for the researchers to play with. That would be wrong; computers are already capable of doing a lot of amazing things: they can adequately translate text from one language to another, they respond to vocal instructions, they score to TOEFL^{5} tests at least as well as human beings do (see, for example, [24]), and—perhaps the most important of all—they offer us all the knowledge of the world from the convenience of our desk with just few mouse clicks. What at least I hope is apparent from the current presentation is that compositional distributional models of meaning is a technology that slowly but steadily evolves to a *useful tool*, which is after all the ultimate purpose of every scientific research.

## Footnotes

- 1.
BNC is a 100 million-word text corpus consisting of samples of written and spoken English. It can be found online at http://www.natcorp.ox.ac.uk/.

- 2.
Actually, Plate proposes a workaround for the commutativity problem; however, this is not quite satisfactory and not specific to his model, since it can be used with any other commutative operation such as vector addition or point-wise multiplication.

- 3.
Here, the word

*tensor*refers to a geometric object that can be seen as a generalization of a vector in higher dimensions. A matrix, for example, is an order-2 tensor. - 4.
To understand why, imagine that a ditransitive verb is a tensor of order 4 (a ternary function); by taking \(s=n=300\) this means that the required space for just one ditransitive verb would be \(300^4 \times 8\) bytes per number \(\approx 65\) gigabytes.

- 5.
Test of English as a Foreign Language.

## Notes

### Acknowledgments

This paper discusses topics that I study and work on the past two and a half years, an endeavour that would be doomed to fail without the guidance and help of Mehrnoosh Sadrzadeh, Stephen Pulman and Bob Coecke. I would also like to thank my colleagues in Oxford Nal Kalchbrenner and Edward Grefenstette for all those insightful and interesting discussions, as well as Anne Preller for her useful comments on quantification. The contribution of the two anonymous reviewers to the final form of this article was invaluable; their suggestions led to a paper which has been tremendously improved compared to the original version. Last, but not least, support by EPSRC Grant EP/F042728/1 is gratefully acknowledged.

## References

- 1.Bach E (1976) An extension of classical transformational grammar. In: Proceedings of the conference at Michigan State University on problems in linguistic metatheory, Michigan State University, Lansing, p 183–224Google Scholar
- 2.Baroni M, Zamparelli R (2010) Nouns are vectors, adjectives are matrices. In: Proceedings of conference on empirical methods in natural language processing (EMNLP), Seattle, p 1427–1432Google Scholar
- 3.Birkhoff G, von Neumann J (1936) The logic of quantum mechanics. Ann Math 37:823–843CrossRefGoogle Scholar
- 4.Blacoe W, Lapata M (2012) A comparison of vector-based representations for semantic composition. In: Proceedings of the joint conference on empirical methods in natural language processing and computational natural language learning. Association for Computational Linguistics, Jeju Island, Korea, p 546–556Google Scholar
- 5.Bos J, Markert K (2006) When logical inference helps determining textual entailment (and when it doesn’t). In: Proceedings of the second PASCAL challenges workshop on recognizing textual entailment, Citeseer, Venice, ItalyGoogle Scholar
- 6.Clark S, Pulman S (2007) Combining symbolic and distributional models of meaning. In: Proceedings of the AAAI spring symposium on quantum interaction, p 52–55, Stanford, California, March 2007Google Scholar
- 7.Coecke B, Sadrzadeh M, Clark S (2010) Mathematical foundations for distributed compositional model of meaning. Lambek Festschrift. Ling Anal 36:345–384Google Scholar
- 8.Curran J (2004) From distributional to semantic similarity. PhD thesis, School of Informatics, University of EdinburghGoogle Scholar
- 9.Dunning T (1993) Accurate methods for the statistics of surprise and coincidence. Comput Ling 19(1):61–74Google Scholar
- 10.Erk K, Padó S (2008) A structured vector–space model for word meaning in context. In: Proceedings of conference on empirical methods in natural language processing (EMNLP), p 897–906Google Scholar
- 11.Ferraresi A, Zanchetta E, Baroni M, Bernardini S (2008) Introducing and evaluating ukWaC, a very large web-derived corpus of English. In: Proceedings of the 4th web as corpus workshop (WAC-4) Can we beat Google, p 47–54Google Scholar
- 12.Firth J (1957) A synopsis of linguistic theory 1930–1955. In: Studies in linguistic analysis, Philological Society, Oxford, p 1–32Google Scholar
- 13.Frege G (1980a) The foundations of arithmetic: a logico-mathematical enquiry into the concept of number (Translation: Austin, J.L.). Northwestern Univ Press, EvanstonGoogle Scholar
- 14.Frege G (1980b) Letter to jourdain. In: Gabriel G (ed) Philoshopical and mathematical correspondence. Chicago University Press, Chicago, pp 78–80Google Scholar
- 15.Graff D, Kong J, Chen K, Maeda K (2003) English gigaword. Linguistic Data Consortium, PhiladelphiaGoogle Scholar
- 16.Grefenstette E, Sadrzadeh M (2011) Experimental support for a categorical compositional distributional model of meaning. In: Proceedings of conference on empirical methods in natural language processing (EMNLP), p 1394–1404Google Scholar
- 17.Grefenstette G (1994) Explorations in automatic thesaurus discovery. Springer, Heidelberg, GermanyCrossRefGoogle Scholar
- 18.Harris Z (1968) Mathematical structures of language. Wiley, New YorkGoogle Scholar
- 19.Kalchbrenner N, Blunsom P (2013a) Recurrent continuous translation models. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), Association for Computational Linguistics, Seattle, USAGoogle Scholar
- 20.Kalchbrenner N, Blunsom P (2013b) Recurrent convolutional neural networks for discourse compositionality. In: Proceedings of the workshop on continuous vector space models and their compositionality, Bulgaria, SofiaGoogle Scholar
- 21.Kartsaklis D, Sadrzadeh M, Pulman S (2012) A unified sentence space for categorical distributional-compositional semantics: theory and experiments. In: Proceedings of 24th international conference on computational linguistics (COLING 2012): Posters, The COLING 2012 Organizing Committee, Mumbai, India, p 549–558Google Scholar
- 22.Kartsaklis D, Sadrzadeh M, Pulman S, Coecke B (2014) Reasoning about meaning in natural language with compact closed categories and frobenius algebras. In: Chubb J, Eskandarian A, Harizanov V (eds.) Logic and algebraic structures in quantum computing and information, Association for Symbolic Logic Lecture Notes in Logic, Cambridge University Press, Cambridge (To appear)Google Scholar
- 23.Lambek J (2008) From word to sentence. Polimetrica, MilanGoogle Scholar
- 24.Landauer T, Dumais S (1997) A solution to Plato’s problem: the latent semantic analysis theory of acquision, induction, and representation of knowledge. Psychol Rev 104(2):211–240Google Scholar
- 25.Lin D (1998) Automatic retrieval and clustering of similar words. In: Proceedings of the 17th international conference on Computational linguistics, vol. 2. Association for Computational Linguistics, Morristown, p 768–774Google Scholar
- 26.Lowe W (2001) Towards a theory of semantic space. In: Proceedings of the 23rd Annual conference of the Cognitive Science Society, Edinburgh, p 576–581Google Scholar
- 27.Lowe W, McDonald S (2000) The direct route: mediated priming in semantic space. In: Proceedings of the 22nd Annual conference of the Cognitive Science Society. Philadelphia, PA, p 675–680Google Scholar
- 28.Manning C, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, CambridgeCrossRefGoogle Scholar
- 29.Mitchell J, Lapata M (2008), Vector-based models of semantic composition. In: Proceedings of the 46th Annual meeting of the Association for Computational Linguistics, Columbus, p 236–244Google Scholar
- 30.Montague R (1970a) English as a formal language. Linguaggi nella società e nella tecnica. Edizioni di Comunità, MilanGoogle Scholar
- 31.Montague R (1970b) Universal grammar. Theoria 36:373–398CrossRefGoogle Scholar
- 32.Montague R (1973) The proper treatment of quantification in ordinary english. In: Hintikka J et al (eds.) Approaches to natural language, Reidel, Dordrecht, p 221–242Google Scholar
- 33.Moore RK (2003) A comparison of the data requirements of automatic speech recognition systems and human listeners. In: Interspeech: 8th European conference of speech communication and technology, ISCA, vol 3. Geneva, Switzerland, p 2582–2584Google Scholar
- 34.Padó S, Lapata M (2007) Dependency-based construction of semantic space models. Comput Ling 33(2):161–199CrossRefGoogle Scholar
- 35.Plate T (1991) Holographic reduced representations: convolution algebra for compositional distributed representations. In: Proceedings of the 12th international joint conference on artificial intelligence, Morgan Kaufmann, San Mateo, CA, p 30–35Google Scholar
- 36.Preller A (2013) From functional to compositional models. In: Proceedings of the 10th conference of quantum physics and logic (QPL 2013), Barcelona, SpainGoogle Scholar
- 37.Pulman S (2013) Combining compositional and distributional models of semantics. In: Heyden C, Sadrzadeh M, Grefenstette E (eds) Quantum physics and linguistics: a compositional, diagrammatic discourse. Oxford University Press, OxfordGoogle Scholar
- 38.Rubenstein H, Goodenough J (1965) Contextual correlates of synonymy. Commun ACM 8(10):627–633CrossRefGoogle Scholar
- 39.Sadrzadeh M, Clark S, Coecke B (2013) The frobenius anatomy of word meanings I: subject and object relative pronouns. J Logic Comput 23(6):1293–1317CrossRefGoogle Scholar
- 40.Schütze H (1998) Automatic word sense discrimination. Comput Ling 24:97–123Google Scholar
- 41.Smolensky P (1990) Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artif Intell 46:159–216CrossRefGoogle Scholar
- 42.Socher R, Huang E, Pennington J, Ng A, Manning C (2011) Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. Adv Neural Inf Process Syst 24:801–809Google Scholar
- 43.Socher R, Huval B, Manning CD, Ng AY (2012) Semantic compositionality through recursive matrix-vector spaces. In: Conference on empirical methods in natural language processing, p 1201–1211Google Scholar
- 44.Socher R, Manning C, Ng A (2010) Learning continuous phrase representations and syntactic parsing with recursive neural networks. In: Proceedings of the NIPS-2010 deep learning and unsupervised feature learning workshop, p 1–9Google Scholar
- 45.Thater S, Fürstenau H, Pinkal M (2010) Contextualizing semantic representations using syntactically enriched vector models. In: Proceedings of the 48th Annual meeting of the Association for Computational Linguistics, Uppsala, Sweden, p 948–957Google Scholar
- 46.Thater S, Fürstenau H, Pinkal M (2011) Word meaning in context: a simple and effective vector model. In: Proceedings of the 5th international joint conference of natural language processing. Asian Federation of Natural Language Processing, Chiang Mai, Thailand, p 1134–1143Google Scholar
- 47.van Rijsbergen K (2004) The geometry of information retrieval. Cambridge University Press, CambridgeCrossRefGoogle Scholar
- 48.Widdows D (2003) Orthogonal negation in vector spaces for modelling word-meanings and document retrieval. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1. Association for Computational Linguistics, p 136–143Google Scholar
- 49.Wittgenstein L (1963) Philosophical investigations. Blackwell, OxfordGoogle Scholar
- 50.Wittgenstein L (1993) Philosophy. In: Klagge J, Nordmann A (eds.) Philosophical occasions 1912–1951, Hackett, Indianapolis, p 171Google Scholar