1 Introduction

Computational models across tasks potentially profit from combining corpus-based, textual information with perceptional information, because word meanings are grounded in the external environment and sensorimotor experience, so they cannot be learned only based on linguistic symbols, cf. the grounding problem (Harnad 1990). Accordingly, various approaches on determining semantic relatedness have been shown to improve by using multi-modal models that enrich textual linguistic representations with information from visual, auditory, or cognitive modalities (Feng and Lapata 2010, Silberer and Lapata 2012, Roller and im Walde 2013, Bruni et al. 2014, Kiela et al. 2014, Kiela and Clark 2015, Lazaridou et al. 2015).

While multi-modal models may be realized as either count or predict approaches, increasing attention is being devoted to the development, improvement and properties of low-dimensional continuous word representations (so-called embeddings), following the success of word2vec (Mikolov et al. 2013). Similarly, recent advances in computer vision and particularly in the field of deep learning have led to the development of better visual representations. Here, features are extracted from convolutional neural networks (CNNs) (LeCun et al. 1998), that were previously trained on object recognition tasks. For example, Kiela and Bottou (2014) showed that CNN-based image representations perform superior in semantic relatedness prediction than other visual representations, such as an aggregation of SIFT features (Lowe 1999) into a bag of visual words (Sivic and Zisserman 2003).

Insight into the typically high-dimensional CNN-based representations is sparse, however. It is known that dimension reduction techniques, such as Singular Value Decomposition (SVD), improve performance on word similarity tasks when applied to word representations (Deerwester et al. 1990). In particular, Bullinaria and Levy (2012) observed highly significant improvements after applying SVD to standard corpus vectors. In addition, Nguyen et al. (2016) proposed a method to remove noisy information from word embeddings, resulting in superior performance on a variety of word similarity and relatedness benchmark tests.

In this paper, we provide an in-depth exploration of improving visual representations within a semantic model that predicts semantic similarity and relatedness, by applying dimensionality reduction and denoising. Furthermore, we introduce a novel approach that modifies visual representations in relation to corpus-based textual information. Following the methodology from Kiela et al. (2016), evaluations are carried out across three different CNN architectures, three different image sources and two different evaluation datasets. We assess the performance of the visual modality by itself, and we zoom into a multi-modal setup where the visual representations are combined with textual representations. Our findings show that all methods but SVD improve the visual representations. This improvement is especially large on the word relatedness task.

2 Methods

In this section we introduce two dimensionality reduction techniques (Sect. 2.1), a denoising approach (Sect. 2.2) and our new approach \( ContextVision \) (Sect. 2.3).

2.1 Dimensionality Reduction

Singular Value Decomposition (SVD) (Golub and Van Loan 1996.) is a matrix algebra operation that can be used to reduce matrix dimensionality yielding a new high-dimensional space. SVD is a commonly used technique, also refered to as Latent Semantic Analysis (LSA) when applied to word similarity. Non-negative matrix factorization (NMF) (Lee and Seung 1999) is a matrix factorisation approach where the reduced matrix contains only non-negative real numbers (Lin 2007). NMF has a wide range of applications, including topic modeling, (soft) clustering and image feature representation (Lee and Seung 1999).

2.2 Denoising

Nguyen et al. (2016) proposed a denoising method (DEN) that uses a non-linear, parameterized, feed-forward neural network as a filter on word embeddings to reduce noise. The method aims to strengthen salient context dimensions and to weaken unnecessary contexts. While Nguyen et al. (2016) increase the dimensionality, we apply the same technique to reduce dimensionality.

2.3 Context-Based Visual Representations

Our novel model \( ContextVision \) (CV) strengthens visual vector representations by taking into account corpus-based contextual information. Inspired by Lazaridou et al. (2015), our model jointly learns the linguistic and visual vector representations by combining two modalities (i.e., the linguistic modality and the visual modality). Differently to the multi-modal Skip-gram model by Lazaridou et al. (2015), we focus on improving the visual representation, while Lazaridou et al. aim to improve the linguistic representation, without performing updates on the visual representation, which are fixed in advance.

The linguistic modality uses contextual information and word negative contexts, and in the visual modality the visual vector representations are strengthened by taking the corresponding word vector representations, the contextual information, and the visual negative contexts into account.

We start out with describing the Skip-gram with negative sampling (SGNS) (Levy and Goldberg 2014) which is a variant of the Skip-gram model (Mikolov et al. 2013). Given a plain text corpus, SGNS aims to learn word vector representations in which words that appear in similar contexts are encoded by similar vector representations. Mathematically, SGNS model optimizes the following objective function:

$$\begin{aligned} \mathrm {J}_{SGNS}&\,=\,\sum \limits _{w \in V_W} {\sum \limits _{c \in V_C}} \mathrm {J}_{ling}(w,\,c) \end{aligned}$$
(1)
$$\begin{aligned} \mathrm {J}_{ling}(w,\,c)&\,=\,\#(w,\,c)\log \sigma ({w},\,{c}) \nonumber \\&\quad \,+\,k_l \cdot \mathbb {E}_{c_N \sim P_D} [\log \sigma (-{w},\,{c}_N)] \end{aligned}$$
(2)

where \(\mathrm {J}_{ling}(w,\,c)\) is trained on a plain-text corpus of words \(w \in V_W\) and their contexts \(c \in V_C\), with \(V_W\) and \(V_C\) the word and context vocabularies, respectively. The collection of observed words and context pairs is denoted as D; the term \(\#(w,\,c)\) refers to the number of times the pair \((w,\,c)\) appeared in D; the term \(\sigma (x)\) is the sigmoid function; the term \(k_l\) is the number of linguistic negative samples and the term \(c_N\) is the linguistic sampled context, drawn according to the empirical unigram distribution P. In our model, SGNS is applied to learn the linguistic modality.

In the visual modality, we improve the visual representations through contextual information; therefore the dimensionality of visual representations and linguistic representations needs to be equal in size. We rely on the denoising approach (Nguyen et al. 2016) to reduce the dimensionality of visual representations. The visual vector representations are then enforced by (i) directly increasing the similarity between the visual and the corresponding linguistic vector representations, and by (ii) encouraging the contextual information which co-occurs with the linguistic information. More specifically, we formulate the objective function of the visual modality, \(\mathrm {J}_{vision}(v_w,\,c)\), as follows:

$$\begin{aligned} \mathrm {J}_{vision}(v_w,\,c)&= \#(v_w,\,c)(cos({w},{v}_w) \nonumber \\&\quad \,\, + \min \{0, \theta - cos({v}_w,\,{c})\,+\,cos({w},\,{c})\}) \nonumber \\&\quad \,\, + k_v \cdot \mathbb {E}_{c_V \sim P_V} [\log \sigma (-{v}_w,\,{c}_V)] \end{aligned}$$
(3)

where \(\mathrm {J}_{vision}(v_w,\,c)\) is trained simultaneously with \(\mathrm {J}_{ling}(w,\,c)\) on the plain-text corpus of words w and their contexts c. \(v_w\) represents the visual information corresponding to the word w; and term \(\theta \) is the margin; \(cos({x},\,{y})\) refers to the cosine similarity between x and y. The terms \(k_v\), \(\mathbb {E}_{c_V}\), and \(P_V\) are similarly defined as the linguistic modality. Note that if a word w is not associated with the corresponding visual information \(v_w\), then \(\mathrm {J}_{vision}(v_w,\,c)\) is set to 0.

In the final step, the objective function which is used to improve the visual vector representations combines Eqs. 1, 2, and 3 by the objective function in Eq. 4:

$$\begin{aligned} \mathrm {J}\,=\,\sum \limits _{w \in V_W} {\sum \limits _{c \in V_C}} (\mathrm {J}_{ling}(w,\,c) \,+\,\mathrm {J}_{vision}(v_w,\,c) ) \end{aligned}$$
(4)

3 Experiments

3.1 Experimental Settings

We use an English Wikipedia dumpFootnote 1 from June 2016 as the corpus resource for training the \( ContextVision \), containing approximately 1.9B tokens. We train our model with 300 dimensions, a window size of 5, 15 linguistic negative samples, 1 visual negative sample, and 0.025 as the learning rate. The threshold \(\theta \) is set to 0.3. For the other methods dimensionality reduction is set to 300Footnote 2 dimensions. For the resources of image data, we rely on the publically available visual embeddings taken from Kiela et al. (2016) Footnote 3. The data was obtained from three different image sources, namely Google, Bing, and Flickr. For each image source three state-of-the-art convolutional network architectures for image recognition were applied: AlexNet (Krizhevsky et al. 2012), GoogLeNet (Szegedy et al. 2015) and VGGNet (Simonyan and Zisserman 2014). In each source–CNN combination, the visual representation of a word is simply the centroid of the vectors of all images labeled with the word (mean aggregation). This centroid has 1024 dimensions for GoogLeNet and 4096 dimensions for the remaining two architectures. The size of the visual vocabulary for Google, Bing, and Flickr after computing the centroids is 1578, 1578, and 1582 respectively. For evaluation we relied on two human-annotated datasets, namely the 3000 pairs from MEN (Bruni et al. 2014) and the 999 pairs from SimLex (Hill et al. 2015). MEN focuses on relatedness, and SimLex focuses on similarity.

3.2 Visual Representation Setup

Table 1 shows the results for each of the previously introduced methods, as well as the unmodified image representation (Default). It can be seen that NMF, DEN and CV increase performance on all settings except for the combination Google & AlexNet. The performance of SVD is always remarkably similar to its original representations.

Table 1. Comparing dimensionality reduction techniques, showing Spearman’s \(\rho \) on SimLex-999 and MEN. * marks significance over the Default.

Furthermore we computed the average difference for each method across all settings, as shown in Table 2. The performance increased especially on the MEN relatedness task. Here NMF obtains on average a rho correlation of \({\approx }.10\) higher than its original representations. Also DEN and CV show a clear improvement, with the latter being most useful for the SimLex task.

Table 2. Average gain/loss in \(\rho \) across sources and architectures, in comparison to Default.

To ensure significance we conducted Steiger’s test (Steiger 1980) of the difference between two correlations. We compared each ouf the methods against its Default performance.

Out of the 19 settings, NMF obtained significant improvements with *=\(p<0.001\) in 11 cases. Despite having a lower average gain (Table 2), DEN and CV obtained more significant improvements.

In total we observed most significant improvements on images taken from bing and with the CNN GoogLeNet.

3.3 Multi-modal Setup

In the previous section we explored the performance of the visual representations alone.

We now investigate their performance in a multi-modal setup, combining them with a textual representation. Using the same parameters as in Sect. 3.1 we created word representations relying on an SGNS model (Mikolov et al. 2013). We combined the representations by scoring level fusion (or late fusion). Following Bruni et al. (2014) and Kiela and Clark (2015) we investigate the impact of both modalities by varying a weight threshold (\(\alpha \)). Similarity is computed as follows:

$$\begin{aligned} sim(x,\,y)\,=\,\alpha \cdot ling(x,\,y) +(1-\alpha )\cdot vis(x,\,y) \end{aligned}$$
(5)
Fig. 1.
figure 1

(a) Comparing multi-modal results on SimLex-999. Image representation from bing using AlexNet. Y-Axis shows Spearman’s \(\rho \). X-axis changes impact of each modality, from only image to the far left to only textual representation. (b) Multi-modal results on MEN. Image representation from flickr using AlexNet.

Here \(ling(x,\,y)\) is cosine similarity based on the textual representation only and \(vis(x,\,y)\) for using the visual space.

For the following experiment we focus on AlexNet, varying the image resource between bing for the SimLex task and flickr for the MEN task. The results are shown in Fig. 1a for SimLex, and in Fig. 1b for MEN.

It can be seen that all representations obtain superior performance on the text-only representation (black dashed line, SimLex \(\rho \,=\,.384\), MEN \(\rho \,=\,.741\)). The highest correlation can be obtained using the DEN or VC representations for SimLex. Interestingly these two methods obtain best performance when given equal weight to both modalities (\(\alpha \) = 0.5) while the remaining methods as well as the unmodified default representations obtain a peak in performance when given more weight to the textual representation. A similar picture emerges regarding the results on MEN, where also NMF obtains superior results (.748).

4 Conclusion

We successfully applied dimensionality reduction as well as denoising techniques, plus a newly proposed method \( ContextVision \) to enhance visual representations within semantic vector space models. Except for SVD, all investigated methods showed significant improvements in single - and multi-modal setups on the task of predicting similarity and relatedness.