9.1 Introduction

As introduced in Wikipedia, a modality is the classification of a single independent channel of sensory input/output between a computer and a human. To be more general, modalities are different means of information exchange between human beings and the real world. The classification is usually based on the form in which information is presented to a human. Typical modalities in the real world include texts, audio, images, videos, etc.

Cross-modal representation learning is an important part of representation learning. In fact, artificial intelligence is inherently a multi-modal task [30]. Human beings are exposed to multi-modal information every day, and it is normal for us to integrate information from different modalities and make comprehensive judgments. Furthermore, different modalities are not independent, but they have correlations more or less. For example, the judgment of a syllable is made by not only the sound we hear but also the movement of the lips and tongue of the speaker we see. An experiment in [48] shows that a voiced /ba/ with a visual /ga/ is perceived by most people as a /da/. Another example is human beings’ ability to consider the 2D image and 3D scan of the same object together and reconstruct its structure: correlations between image and scan can be found based on the fact that a discontinuity of depth in the scan usually indicates a sharp line in the image [52]. Inspired by this, it is natural for us to consider the possibility of combining inputs from multi-modalities in our artificial intelligence systems and generate cross-modal representation.

Ngiam et al. [52] explore the probability of merging multi-modalities into one learning task. The authors divide a typical machine learning task into three stages: feature learning, supervised learning, and prediction. And they further propose four kinds of learning settings for multi-modalities:

(1) Single-modal learning: all stages are all done on just one modality.

(2) Multi-modal fusion: all stages are all done with all modalities available.

(3) Cross-modal learning: in the feature learning stage, all modalities are available, but in supervised learning and prediction, only one modality is used.

(4) Shared representation learning: in the feature learning stage, all modalities are available. In supervised learning, only one modality is used, and in prediction, a different modality is used.

Experiments show promising results for these multi-modal tasks. When more modalities are provided (such as multi-modal fusion, cross-modal learning, and shared representation learning), the performance of the system is generally better.

In the following part of this chapter, we will first introduce cross-modal representation models, which are fundamental parts of cross-modal representation learning in NLP. And then, we will introduce several critical applications, such as image captioning, visual relationship detection, and visual question answering.

9.2 Cross-Modal Representation

Cross-modal representation learning aims to build embeddings using information from multiple modalities. Existing cross-modal representation models involving text modality can be generally divided into two categories: (1) [30, 77] try to fuse information from different modalities into unified embeddings (e.g., visually grounded word representations). (2) Researchers also try to build embeddings for different modalities in a common semantic space, which allows the model to compute cross-modal similarity. Such cross-modal similarity can be further utilized for downstream tasks, such as zero-shot recognition [5, 14, 18, 53, 65] and cross-media retrieval [23, 55]. In this section, we will introduce these two kinds of cross-modal representation models, respectively.

9.2.1 Visual Word2vec

Computing word embeddings is a fundamental task in representation learning for natural language processing. Typical word embedding models (like Word2vec [49]) are trained on a text corpus. These models, while being extremely successful, cannot discover implicit semantic relatedness between words that could be expressed in other modalities. Kottur et al. [30] provide an example: even though eat and stare at seem are unrelated from text, images might show that when people are eating something they would also tend to stare at it. This implies that considering other modalities when constructing word embeddings may help capture more implicit semantic relatedness.

Vision, being one of the most critical modalities, has attracted attention from researchers seeking to improve word representation. Several models that incorporate visual information and improve word embeddings with vision have been proposed. We introduce two typical word representation models incorporating visual information in the following.

9.2.1.1 Word Embedding with Global Visual Context

Xu et al. [77] propose a model that makes a natural attempt to incorporate visual features. It claims that in most word representation models, only local context information (e.g., trying to predict a word using neighboring words and phrases) is considered. Global text information (e.g., the topic of the passage), on the other hand, is often neglected. This model extends a simple local context model by using visual information as global features (see Fig. 9.1).

Fig. 9.1
figure 1

The architecture for word embedding with global visual context

The input of the model is an image I and a sequence describing it. It is based on a simple local context language model: when we consider a certain word \(w_t\) in a sequence, its local feature is the average of embeddings of words in a window, i.e., \(\{ w_{t-k}, \ldots , w_{t-1}, w_{t+1}, \ldots , w_{t+k} \}\). The visual feature is computed directly from the image I using a CNN and then used as the global feature. The local feature and the global feature are then concatenated into a vector \(\mathbf{f}\). The predicted probability of a word \(w_t\) (in this blank) is the softmax normalized product of \(\mathbf{f}\) and the word embedding \(\mathbf{w}_t\):

$$\begin{aligned}&o_{w_t} = \mathbf {w}_t^T \mathbf {f}, \end{aligned}$$
(9.1)
$$\begin{aligned} P(w_t | w_{t-k}, \ldots , w_{t-1},&w_{t+1}, \ldots , w_{t+k}; I) = \frac{\exp (o_{w_t})}{\sum _i{\exp (o_{w_i})}}. \end{aligned}$$
(9.2)

The model is optimized by maximizing the average of log probability:

$$\begin{aligned} \mathscr {L} = \frac{1}{T} \sum _{t=k}^{T-k} \log P(w_t | w_{t-k},\ldots , w_{t-1}, w_{t+1},\ldots , w_{t+k}; I). \end{aligned}$$
(9.3)

The classification error will be back-propagated to local text vector (i.e., word embeddings), visual vector, and all model parameters. This accomplishes jointly learning for a set of word embeddings, a language model, and the model used for visual encoding.

9.2.1.2 Word Embedding with Abstract Visual Scene

Kottur et al. [30] also propose a neural model to capture fine-grained semantics from visual information. Instead of focusing on literal pixels, the abstract scene behind the vision is considered. The model takes a pair of the visual scene and a related word sequence (Iw) as input. At each training step, a window is used upon the word sequence w, forming a subsequence \(S_w\). All the words in \(S_w\) will be fed into the input layer using one-hot encoding, and therefore the dimension of the input layer is \(|\mathscr {V}|\), which is also the size of the vocabulary. The words are then transformed into their embeddings, and the hidden layer is the average of all these embeddings. The size of the hidden layer is \(N_H\), which is also the dimension of the word embeddings. The hidden layer and the output layer are connected by a full connection matrix of dimension \(N_H * N_K\) and a softmax function. The output layer can be regarded as a probability distribution over a discrete-valued function \(g(\cdot )\) of the visual scene I (details will be given in the following paragraph). The entire model is optimized by minimizing the objective function:

$$\begin{aligned} \mathscr {L} = - \log P( g(w) | S_w ). \end{aligned}$$
(9.4)

The most important part of the model is the function \(g(\cdot )\). It maps the visual scene I into the set \(\{1, 2, \ldots , N_K\}\), which indicates what kind of abstract scene it is. In practice, it is learned offline using K-means clustering, and each cluster represents the semantics of one kind of visual scenes, consequently, the word sequence w, which is designed to be related to the scene.

9.2.2 Cross-Modal Representation for Zero-Shot Recognition

Large-scale datasets partially support the success of deep learning methods. Even though the scales of datasets continue to grow larger, and more categories are involved, the annotation of datasets is expensive and time-consuming. For many categories, there are very limited or even no instances, which restricts the scalability of recognition systems.

Zero-shot recognition is proposed to solve the problem as mentioned above, which aims to classify instances of categories that have not been seen during training. Many works propose to utilize cross-modal representation for zero-shot image classification [5, 14, 18, 53, 65]. Specifically, image representation and category representation are embedded into a common semantic space, where similarities between image and category representations can serve for further classification. For example, in such a common semantic space, the embedding of an image of cat is expected to be closer to the embedding of category cat than the embedding of category truck.

9.2.2.1 Deep Visual-Semantic Embedding

The challenge of zero-shot learning lies in the absence of instances of unseen categories, which makes it challenging to obtain well-performed classifiers of unseen categories. Frome et al. [18] present a model that utilizes both labeled images and information from the large-scale plain text for zero-shot image classification. They try to leverage semantic information from word embeddings and transfer it to image classification systems.

Their model is motivated by the fact that word embeddings incorporate semantic information of concepts or categories, which can be potentially utilized as classifiers of corresponding categories. Similar categories cluster well in semantic space. For example, in word embedding space, the nearest neighbors of the term tiger shark are similar kinds of sharks, such as bull shark, blacktip shark, sandbar shark, and oceanic whitetip shark. In addition, boundaries between different clusters are clear. The aforementioned properties indicate that word embeddings can be further utilized as classifiers for recognition systems.

Specifically, the model first pretrains word embeddings using the Skip-gram text model on large-scale Wikipedia articles. For visual feature extraction, the model pretrains a deep convolutional neural network for 1, 000 object categories on ImageNet. The pretrained word embeddings and the convolutional neural network are used to initialize the proposed Deep Visual-Semantic Embedding model (DeViSE).

To train the proposed model, they replace the softmax layer of the pretrained convolutional neural network with a linear projection layer. The model is trained to predict the word embeddings of categories for images using a hinge ranking loss:

$$\begin{aligned} \mathscr {L}(I, y) = \sum _{j \ne y} \max [0, \gamma - \mathbf {w}_{y}\mathbf {M}\mathbf {I} + \mathbf {w}_j\mathbf {M}\mathbf {I}], \end{aligned}$$
(9.5)

where \(\mathbf {w}_{y}\) and \(\mathbf {w}_{j}\) are the learned word embeddings of the positive label and sampled negative label, respectively, \(\mathbf {I}\) denotes the feature of the image obtained from the convolutional neural network, \(\mathbf {M}\) is the trainable parameters in linear projection layer, and \(\gamma \) is a hyperparameter in hinge ranking loss. Given an image, the objective requires the model to produce a higher score for the correct label than randomly chosen labels, where the score is defined as the dot product of the projected image feature and word embedding of terms.

At test time, given a test image, the score of each possible category is obtained using the same approach during training. Note that a crucial difference at test time is that the classifiers (word embeddings) are expanded to all possible categories, including unseen categories. Thus the model is capable of predicting unseen categories.

Experiment results show that DeViSE can make zero-shot predictions with more semantically reasonable errors, which means that even if the prediction is not exactly correct, it is semantically related to the ground truth class. However, a drawback is that although the model can utilize semantic information in word embeddings to make zero-shot image classification, using word embeddings as classifiers restricts the flexibility of the model, which results in inferior performance in the original 1, 000 categories compared to the original softmax classifier.

9.2.2.2 Convex Combination of Semantic Embeddings

Inspired by DeViSE, [53] proposes a model ConSE that tries to utilize semantic information from word embeddings for zero-shot classification. A vital difference to DeViSE is that they obtain the semantic embedding of test image using a convex combination of word embeddings of seen categories. The score of the corresponding category determines the weights of the composing word embeddings.

Specifically, they train a deep convolutional neural network on seen categories. At test time, given a test image I (possibly from unseen categories), they obtain the top T confident predictions of seen categories, where T is a hyperparameter. Then the semantic embedding f(I) of I is determined by the convex combination of semantic embeddings of the top T confident categories, which can be formally defined as follows:

$$\begin{aligned} f(I) = \frac{1}{Z} \sum ^T_{t=1}P(\hat{y}_0(I, t) |I) \cdot \mathbf {w}(\hat{y}_0(I, t)), \end{aligned}$$
(9.6)

where \(\hat{y}_0(I, t)\) is the tth most confident training label for I, \(\mathbf {w}(\hat{y}_0(I, t))\) is the semantic embedding (word embedding) of \(\hat{y}_0(I, t)\), and Z is a normalization factor given by

$$\begin{aligned} Z = \sum ^T_{t=1} P(\hat{y}_0(I, t) | I). \end{aligned}$$
(9.7)

After obtaining the semantic embedding f(I), the score of the category m is given by the cosine similarity of f(I) and \(\mathbf {w}(m)\).

The motivation of ConSE is that they assume novel categories can be modeled as the convex combination of seen categories. If the model is highly confident about a prediction, (i.e., \(P(\hat{y}_0( {I}, 1) | {I}) \approx 1\)), the semantic embedding f(I) will be close to \(\mathbf {w}(\hat{y}_0( {I}, 1))\). If the predictions are ambiguous, (e.g., \(P(\mathtt {tiger}| {I})=0.5, P(\mathtt {lion}| {I})=0.5\)), the semantic embedding f(I) will be between \(\mathbf {w}(\mathtt {lion})\) and \(\mathbf {w}(\mathtt {tiger})\). And they expect the semantic embedding \(f( {I}) = 0.5\mathbf {w}(\mathtt {lion}) + 0.5\mathbf {w}(\mathtt {tiger})\) to be close to the semantic embedding \(\mathbf {w}(\mathtt {liger})\) (a hybrid cross between lions and tigers).

Although ConSE and DeViSE share many similarities, there are also some crucial differences. DeViSE replaces the softmax layer of the pretrained visual model with a projection layer, while ConSE preserves the softmax layer. ConSE does not need to be further trained and uses a convex combination of semantic embeddings to perform zero-shot classification at test time. Experiment results show that ConSE outperforms DeViSE on unseen categories, indicating better generalization capability. However, the performance of ConSE on seen categories is not as competitive as DeViSE and the original softmax classifier.

9.2.2.3 Cross-Modal Transfer

Socher et al. [65] present a cross-modal representation model for zero-shot recognition. In their model, all word vectors are initialized with pretrained 50-dimensional word vectors and are kept fixed during training. Each image is represented by a vector \(\mathbf {I}\) constructed by a deep convolutional neural network. They first project an image into semantic word spaces by minimizing

$$\begin{aligned} \mathscr {L}(\varTheta ) = \sum _{y\in Y_s}\sum _{I^{(i)}\in X_y} \Vert \mathbf {w}_y - \theta ^{(2)}f(\theta ^{(1)} \mathbf {I}^{(i)})\Vert ^2, \end{aligned}$$
(9.8)

where \(Y_s\) denotes the set of images’ classes which can be seen in training data, \(X_y\) denotes the set of images’ vectors of class y, \(\mathbf {w}_y\) denotes the word vector of class y, and \(\varTheta = (\theta ^{(1)}, \theta ^{(2)})\) denotes parameters of the 2-layer neural network with \(f(\cdot ) = \tanh (\cdot )\) as activation function.

They observe that instances from unseen categories are usually outliers of the complete data manifold. Following this observation, they first classify an instance into seen and unseen categories via outlier detection methods. Then the instance is classified using corresponding classifiers.

Formally, they marginalize a binary random variable \(V\in \{s, u\}\) which denotes whether an instance belongs to seen categories or unseen categories separately, which means probability is given as

$$\begin{aligned} P(y|I) = \sum _{V \in \{s, u\}} P(y| V, I) P(V | I). \end{aligned}$$
(9.9)

For seen image classes, they simply use softmax classifier to determine P(y|sI), while for unseen classes, they assume an isometric Gaussian distribution around each of the novel class word vectors and assign classes based on their likelihood. To detect novelty, they calculate a Local Outlier Probability by Gaussian error function.

9.2.3 Cross-Modal Representation for Cross-Media Retrieval

Learning cross-modal representation from different modalities in a common semantic space allows one to easily compute cross-modal similarities, which can facilitate many important cross-modal tasks, such as cross-media retrieval. With the rapid growth of multimedia data such as text, image, video, and audio on the Internet, the need to retrieve information across different modalities has become stronger. Cross-media retrieval is an important task in the multimedia area, which aims to perform retrieval across different modalities such as text and image. For example, a user may submit an image of a white horse, and retrieve relevant information from different modalities, such as textual descriptions of horses, and vice versa.

A significant challenge of cross-modal retrieval is the domain discrepancies between different modalities. Besides, for a specific area of interest, cross-modal data can be insufficient, which limits the performance of existing cross-modal retrieval methods. Many works have focused on the challenges as mentioned above in cross-modal retrieval [23, 24].

9.2.3.1 Cross-Modal Hybrid Transfer Network

Huang et al. [24] present a framework that tries to relieve the cross-modal data sparsity problem by transfer learning. They propose to leverage knowledge from a large-scale single-modal dataset to boost the model training on the small-scale dataset. The massive auxiliary dataset is denoted as the source domain, and the small-scale dataset of interest is denoted as the target domain. In their work, they adopt ImageNet [12], a large-scale image database as the source domain.

Formally, a training set consists of data from source domain \(Src = \{I_s^p, y_s^p\}^P_{p=1}\) and target domain \(Tar_{tr} = \{(I_s^j, t_s^j), y_s^j\}^J_{j=1}\), where (It) is the image/text pair with label y. Similarly, a test set can be denoted as \(Tar_{te} = \{(I_s^m, t_s^m), y_s^m\}^M_{m=1}\). The goal of their model is to transfer knowledge from Src to boost the model performance on \(Tar_{te}\) for cross-media retrieval.

Their model consists of a modal-sharing transfer subnetwork and a layer-sharing correlation subnetwork. In modal-sharing transfer subnetwork, they adopt the convolutional layers of AlexNet [32] to extract image features for source and target domains, and use word vectors to obtain text features. The image and text features pass through two fully connected layers, where single-modal and cross-modal knowledge transfer are performed.

Single-modal knowledge transfer aims to transfer knowledge from images in the source domain to images in the target domain. The main challenge is the domain discrepancy between the two image datasets. They propose to solve the domain discrepancy problem by minimizing the Maximum Mean Discrepancy (MMD) of image modality between the source and target domains. MMD is calculated in a layer-wise style in the fully connected layers. By minimizing MMD in reproduced kernel Hilbert space, the image representations from source and target domains are encouraged to have the same distribution, so knowledge from images in the source domain is expected to transfer to images in the target domain. Besides, the image encoder in the source domain is also fine-tuned by optimizing softmax loss on labeled image instances.

Cross-modal knowledge transfer aims to transfer knowledge between image and text in the target domain. Text and image representations from an annotated pair in the target domain are encouraged to be close to each other by minimizing their Euclidean distance. The cross-modal transfer loss of image and text representations is also computed in a layer-wise style in the fully connected layers. The domain discrepancy between image and text modalities is expected to be reduced in high-level layers.

In layer-sharing correlation subnetwork, representations from modal-sharing transfer subnetwork in the target domain are fed into shared fully connected layers to obtain the final common representation for both image and text. As the parameters are shared between two modalities, the last two fully connected layers are expected to capture the cross-modal correlation. Their model also utilizes label information in the target domain by minimizing softmax loss on labeled image/text pairs. After obtaining the final common representations, cross-media retrieval can be achieved by simply computing the nearest neighbors in semantic space.

9.2.3.2 Deep Cross-Media Knowledge Transfer

As an extension of [23, 24] also focuses on dealing with domain discrepancy and insufficient cross-modal data for cross-media retrieval in specific areas, Huang and Peng [23] present a framework that transfers knowledge from a large-scale cross-media dataset (source domain) to boost the model performance on another small-scale cross-media dataset (target domain).

A crucial difference from [24] is that the dataset in the source domain also consists of image/text pairs with label annotations instead of a single-modal setting in [24]. Since both domains contain image and text media types, domain discrepancy comes from the media-level discrepancy in the same media type, and correlation-level discrepancy in image/text correlation patterns between different domains. They propose to transfer intra-media semantic and inter-media correlation knowledge by jointly reducing domain discrepancies on media-level and correlation-level.

To extract the distributed features for different media types, they adopt VGG19 [63] for image encoder and Word CNN [29] for text encoder. The two domains have the same architecture but do not share parameters. The extracted image/text features pass through two fully connected layers, respectively, where the media-level transfer is performed. Similar to [24], they reduce domain discrepancies within the same modalities by minimizing Maximum Mean Discrepancy (MMD) between the source and target domains. The MMD is computed in a layer-wise style to transfer knowledge within the same modalities. They also minimize Euclidean distance between image/text representations pairs in both source and target domains to preserve the semantic information across modalities.

Correlation-level transfer aims to reduce domain discrepancy in image/text correlation patterns in different domains. In two domains, both image and text representations share the last two fully connected layers to obtain the common representation for each domain. They optimize layer-wise MMD loss between the shared fully connected layers in different domains for correlation-level knowledge transfer, which encourages source and target domains to have the same image/text correlation patterns. Finally, both domains are trained with label information of image/text pairs. Note that the source domain and target domain do not necessarily share the same label set.

In addition, they propose a progressive transfer mechanism, which is a curriculum learning method aiming to promote the robustness of the model training. This is achieved by selecting easy samples for model training in the early period, and gradually increases the difficulty during the training. The difficulty of training samples is measured according to the bidirectional cross-media retrieval consistency.

9.3 Image Captioning

Image captioning is the task of automatically generating natural language descriptions for images. It is a fundamental task in artificial intelligence, which connects natural language processing and computer vision. Compared with other computer vision tasks, such as image classification and object detection, image captioning is significantly harder for two reasons: first, not only objects but also relationships between them have to be detected; second, besides basic judgments and classification, natural language sentences have to be generated.

Traditional methods for image captioning are usually using retrieval models or generation models, of which the ability to generalize is comparatively weaker compared with that of novel deep neural network models. In this section, we will introduce several typical models of both genres in the following.

9.3.1 Retrieval Models for Image Captioning

The primary pipeline of retrieval models is (1) represent images and/or sentences using special features; (2) for new images and/or sentences, search for probable candidates according to the similarity of features.

Linking words to images has a rich history, and [50] (a retrieval model) is the first image annotation system. This paper tries to build a keyword assigning system for images from labeled data. The pipeline is as follows:

(1) Image segmentation. Every image is divided into several parts, using the simplest rectangular division. The reason for doing so is that an image is typically annotated with multiple labels, each of which often corresponds to only a part of it. Segmentation would help reduce noises in labeling.

(2) Feature extraction. Features of every part of the image are extracted.

(3) Clustering. Feature vectors of image segments are divided into several clusters. Each cluster accumulates word frequencies and thereby calculates word likelihood. Concretely,

$$\begin{aligned} P(w_{i}|c_{j}) = \frac{P(c_{j}|w_{i})P(w_{i})}{\sum _{k}{P(c_{j}|w_{k})P(w_{k})}} = \frac{n_{ji}}{N_{j}}, \end{aligned}$$
(9.10)

where \(n_{ji}\) is the number of times word \(w_{i}\) appears in cluster j, and \(N_{j}\) is the number of times that all words appear in cluster j. The calculation is based on using frequencies as probabilities.

(4) Inference. For a new image, the model divides it into segments, extracts features for every part, and finally, aggregates keywords assigned to every part to obtain the final prediction.

The key idea of this model is image segmentation. Take a landscape picture, for instance, there are two parts: mountain and sky, and both parts will be annotated with both labels. However, if another picture has two parts mountain and river, the two mountain parts would hopefully be in the same cluster and discover that they share the same label mountain. In this way, labels can be assigned to the correct part of the image, and noises could be alleviated.

Another typical retrieval model is proposed by [17], which can assign a linking score between an image and a sentence. An intermediate space of meaning calculates this score of linking. The representation of the meaning space is a triple in the form of \(\langle \)object, action, scene \(\rangle \). Each slot of the triple has a finite discrete candidate set. The problem of mapping images and sentences into the meaning space involves solving a Markov random field.

Different from the previous model, this system can do not only image caption, but also do the inverse, that is, given a sentence, the model provides certain probable associated images. At the inference stage, the image (sentence) is first mapped to the intermediate meaning space, then we search in the pool for the sentence (image) that has the best matching score.

After that, researchers also proposed a lot of retrieval models which consider different kinds of characteristics of the images, such as [21, 28, 34].

9.3.2 Generation Models for Image Captioning

Different from the retrieval-based model, the basic pipeline of generation models is (1) use computer vision techniques to extract image features, (2) generate sentences from these features using methods such as language models or sentence templates.

Kulkarni et al. [33] propose a system that makes a tight connection between the particular image and the sentence generating process. The model uses visual detectors to detect specific objects, as well as attributes of a single object and relationships between multiple objects. Then it constructs a conditional random field to incorporate unary image potentials and higher order text potentials and thereby predicts labels for the image. Labels predicted by conditional random fields (CRF) is arranged as a triple, e.g., \(\langle \langle \mathtt {white}, \mathtt {cloud}\rangle , \mathtt {in}, \langle \mathtt {blue}, \mathtt {sky}\rangle \rangle \).

Then sentences are generated according to the labels. There are two ways to build a sentence based on the triple skeleton. (1) The first is to use an n-gram language model. For example, when trying to decide whether or not to put a glue word x between a pair of meaningful words (which means they are inside the triple) a and b, the probabilities \(\hat{p}(axb)\) and \(\hat{p}(ab)\) are compared for the decision. \(\hat{p}\) is the standard length-normalized probability of the n-gram language model. (2) The second is to use a set of descriptive language templates, which alleviates the problem of grammar mistakes in the language model.

Further, [16] proposes a novel framework to explicitly represent the relationship between image structure and its caption sentence’s structure. The method, Visual Dependency Representation, detects objects in the image, and detects the relationship between these objects based on the proposed Visual Dependency Grammar, which includes eight typical relations like beside or above. Then the image can be arranged as a dependency graph, where nodes are objects and edges are relations. This image dependency graph can be aligned with the syntactic dependency representation of the caption sentence. The paper further provides four templates to generating descriptive sentences from the extracted dependency representation.

Besides these two typical works, there are massive generation models for image captioning, such as [15, 35, 78].

9.3.3 Neural Models for Image Captioning

In [33], it was claimed in 2011 that in image captioning tasks: Natural language generation still remains an open research problem. Most previous work is based on retrieval and summarization. From 2015, inspired by advances in neural language model and neural machine translation, a number of end-to-end neural image captioning models based on the encoder-decoder system have been proposed. These new models significantly improve the ability to generate natural language descriptions.

9.3.3.1 The Basic Model

Traditional machine translation models typically stitch many subtasks together, such as individual word translation and reordering, to perform sentence and paragraph translation. Recent neural machine translation models, such as [8], use a single encoder-decoder model, which can be optimized by stochastic gradient descent conveniently. The task of image captioning is inherently analogous to machine translation because it can also be regarded as a translation task, where the source “language” is an image. The encoders and decoders used for machine translations are typically RNNs, which is a natural selection for sequences of words. For image captioning, CNN is chosen to be the encoder, and RNN is still used as the decoder.

Fig. 9.2
figure 2

The architecture of encoder-decoder framework for image captioning

Vinyals et al. [70] is the most typical model which uses encoder-decoder for image captioning (see Fig. 9.2). Concretely, a CNN model is used to encode the image into a fix length vector, which is believed to contain the necessary information for captioning. With this vector, an RNN language model is used to generate natural language descriptions, and this is the decoder. Here, the decoder is similar to the LSTM used for machine translation. The first unit takes the image vector as the input vector, and the rest units take the previous word embedding as input. Each unit outputs a vector \(\mathbf{o}\) and passes a vector to the next unit. \(\mathbf{o}\) is further fed into a softmax layer, whose output \(\mathbf{p}\) is the probability of each word within the vocabulary. The ways to deal with these calculated probabilities are different in training and testing:

Training. These probabilities \(\mathbf{p}\) are used to calculate the likelihood of the provided description sentences. Considering the nature of RNNs, it is easy to model the joint probability into conditional probabilities.

$$\begin{aligned} \log P(s|I) = \sum _{t=0}^{N}{\log P(w_{t}|I, w_{0},\ldots , w_{t-1})}, \end{aligned}$$
(9.11)

where \(s=\{w_{1}, w_{2},..., w_{N}\}\) is the sentence and its words, \(w_{0}\) is a special START token, and I is the image. Stochastic gradient descent can thereby be performed to optimize the model.

Testing. There are multiple approaches to generate sentences given an image. The first one is called Sampling. For each step, the single word with the highest probability in \(\mathbf{p}\) is chosen, and used as the input of the next unit until the END token is generated or a maximal length is reached. The second one is called Beam Search. For each step (now the length of sentences is t), k best sentences are kept. Each of them generates several new sentences of length \(t+1\), and again, only k new sentences are kept. Beam Search provides a better approximation for

$$\begin{aligned} s^{*} = \arg \max _s{\log P({s|I})}. \end{aligned}$$
(9.12)

9.3.3.2 Variants of the Basic Model

The research on image captioning tightly follows that on machine translation. Inspired by [6], which uses attention mechanism in machine translation, [76] introduces visual attention into the encoder-decoder image captioning model.

The major bottleneck of [70] is the fact that information from the image is shown to the LSTM decoder only at the first decoding unit, which actually requires the encoder to squeeze all useful information into one fixed-length vector. In contrast, [76] does not require such compression. The CNN encoder does not produce one vector for the entire image; instead, it produces L region vectors \(\mathbf {I}_i\), each of which is the representation of a part of the image. At every step of decoding, the inputs include standard LSTM inputs (i.e., output and hidden state of last step \(\mathbf {o}_{t-1}\) and \(\mathbf {h}_{t-1}\)), and an input vector \(\mathbf{z}\) from the encoder. Here, \(\mathbf{z}\) is the weighted sum of image vectors \(\mathbf {I}_i\): \(\mathbf{z} = \sum _i{\alpha _i \mathbf {I}_i}\), where \(\alpha _i\) is the weight computed from \(\mathbf {I}_i\) and \(\mathbf {h}_{t-1}\). Throughout the training process, the model learns to focus on parts of the image for generating the next word by producing larger weights \(\alpha \) on more relevant parts, as shown in Fig. 9.3.Footnote 1

Fig. 9.3
figure 3

An example of image captioning with attention mechanism

While the above paper uses soft attention for the image, [27] makes explicit alignment between image fragments and sentence fragments before generating a description for the image. In the first stage, the alignment stage, sentence and image fragments are aligned by being mapped to a shared space. Concretely, sentence fragments (i.e., n consecutive words) are encoded using a bidirectional LSTM into the embeddings \(\mathbf{s}\), and image fragments (i.e., part of the image, and also the entire image) are encoded using a CNN into the embeddings \(\mathbf{I}\). The similarity score between image I and sentence s is computed as

$$\begin{aligned} \text {sim}(I, s) = \sum _{t \in g_s}{\text {max}_{i \in g_I}(0, \mathbf {I}_i^\top \mathbf {s}_t)}, \end{aligned}$$
(9.13)

where \(g_s\) is the sentence fragment set of sentence s, and \(g_I\) is the image fragment set of image I. The alignment is then optimized by minimizing the ranking loss \(\mathscr {L}\) for both sentences and images:

$$\begin{aligned} \mathscr {L} = \sum _I { \left[ \sum _s {\max (0, \text {sim}(I, s)-\text {sim}(I,I)+1)} + \sum _s {\max (0, \text {sim}(s,I)-\text {sim}(I, I)+1)} \right] }. \end{aligned}$$
(9.14)

The assumption for this alignment procedure is similar to [50] (see Sect. 9.3.1): all description sentences are regarded as (possibly noisy) labels for every image section and are based on the massive training data, the model would hopefully be trained to align caption sentences to their corresponding image fragments. The second stage is similar to the basic model in [70], but the alignment results are used to provide more precise training data.

As mentioned above, [76] makes the decoder have the ability to focus attention on the different parts of the image for different words. However, there are some nonvisual words in the decoding process. For example, words such as the and of are more dependent on semantic information than visual information. Furthermore, words such as phone followed by cell or meter before near the parking are usually generated by the language model. To avoid the gradient of a nonvisual word decreasing the effectiveness of visual attention, in the process of generating captions, [43] adopts an adaptive attention model with a visual sentinel. At each time step, the model needs to determine that it depends on an image region or a visual sentinel.

Adaptive attention model [43] uses attention in the process of generating a word rather than updating the LSTM state; it utilizes “visual sentinel" vector \(\mathbf {x}_t\) and image region vectors \(\mathbf {I}_i\). Here, \(\mathbf {x}_t\) is produced by the inputs and states of LSTM at time step t, while \(\mathbf {I}_i\) is provided from CNN encoder. Then the adaptive context vector \(\hat{\mathbf {c}}_t\) is the weighted sum of L image region vectors \(\mathbf {I}_i\) and visual sentinel \(\mathbf{x}_t\):

$$\begin{aligned} \hat{\mathbf {c}}_t= \sum _i^L{\alpha _i \mathbf {I}_i}+\alpha _{L+1}\mathbf {x}_t, \end{aligned}$$
(9.15)

where \(\alpha _{i}\) are the weights computed by \(\mathbf {I}_i, \mathbf {x}_t\), and the LSTM hidden state \(\mathbf {h}_t\). We have \(\sum _{i=1}^{L+1}\alpha _{i}=1\). Finally the probability of a word in vocabulary at time t can be calculated as a residual form:

$$\begin{aligned} \mathbf {p}_t ={\text {Softmax}}(\mathbf {W}_p(\hat{\mathbf {c}}_t+\mathbf {h}_t)), \end{aligned}$$
(9.16)

where \(\mathbf {W}_p\) is a learned weight parameter.

Many existing image captioning models with attention allocate attention over image’s regions, whose size is often \(7\times 7\) or \(14\times 14\) decided by the last pooling layer in CNN encoder. Anderson et al. [2] first calculate attention at the level of objects. It first employs Faster R-CNN [58] which is trained on ImageNet [60] and Genome [31] to predict attribute class, such as an open oven, green bottle, floral dress, and so on. After that, it applies attention over valid bounding boxes to get fine-grained attention for helping the caption generation.

Besides, [11] rethinks the form of latent states in image captioning, which usually compresses two-dimensional visual feature maps encoded by CNN to a one-dimensional vector as the input of the language model. They find that the language model with 2D states can preserve the spatial locality, which can link the input visual domain and output linguistic domain observed by visualizing the transformation of hidden states.

Word embeddings and hidden states in [11] are 3D tensors of size \(C\times H\times W\), which means C channels, each of size \(H\times W\). The encoded features maps will be directly inputted to the 2D language model instead of going through an average pooling layer. In the 2D language model, the convolution operator takes the place of matrix multiplication in the 1D model, and mean pooling will be used to generate the output word probability distribution from 2D hidden states. Figure 9.4 shows activated region of a latent channel at the tth step. When we set a threshold for the activated regions, it is revealed that the special channels are associated with specific nouns in the decoding process, which help get a better understanding of the process of generating captions.

Fig. 9.4
figure 4

An example of the activated region of a latent channel

Traditional methods train the caption model by maximizing the likelihood of training examples, which forms a gap between the optimization objective and evaluating metrics. To alleviate the problem, [59] uses reinforcement learning to directly maximize the CIDEr metric [69]. CIDEr reflects the diversity of generated captions by giving high weights to the low-frequency n-grams in the training set, which demonstrates that people prefer detailed captions rather than universal ones, like a boy is playing a game. To encourage the distinctiveness of captions, [10] adopts contrastive learning. Their model learns to discriminate the caption of a given image and the caption of an alike image by maximizing the difference between ground truth positive pair and mismatch negative pair. The experiment shows that contrastive learning increases the diversity of captions significantly.

Furthermore, automatic evaluation metrics, such as BLEU [54], METEOR [13], ROUGE [38], CIDEr [69], SPICE [1], and so on, may neglect some novel expressions restrained by the ground truth captions. To better evaluate the naturalness and diversity of captions, [9] proposes a framework based on Conditional Generative Adversarial Networks, whose generator tries to achieve a higher score in the evaluator, while the evaluator tries to distinguish between the generated caption and human descriptions for a given image, as well as between the given image and the mismatch description. The user study shows that the trained generator can generate natural and diverse captions than the model trained by maximum likelihood estimate, while the trained evaluator is more consistent with human’s evaluation.

Besides the works we introduced above, there are also a mass of variants of the basic encoder-decoder model such as [20, 26, 40, 45, 51, 71, 73].

9.4 Visual Relationship Detection

Visual relationship detection is the task of detecting objects in an image and understanding the relationship between them. While detecting the objects is always based on semantic segmentation or object detection methods, such as R-CNN, understanding the relationship is the key challenge of this task. While detecting visual relation with image information is intuitive and effective [25, 62, 84], leveraging information from language can further boost the model performance [37, 41, 82].

9.4.1 Visual Relationship Detection with Language Priors

Lu et al. [41] propose a model that uses language priors to enhance the performance on infrequent relationships for which sufficient training instances are hard to obtain solely from images. The overall architecture is shown in Fig. 9.5.

Fig. 9.5
figure 5

The architecture of visual relationship detection with language prior

They first train a CNN to calculate the unnormalized relations’ probability obtained from visual inputs by

$$\begin{aligned} P_V(R_{\langle i, j, k\rangle }, \varTheta |\langle O_1, O_2\rangle ) = P_i(O_1)(\mathbf {z}_k^{\top }{\text {CNN}}(O_1, O_2) + s_k)P_j(O_2), \end{aligned}$$
(9.17)

where \(P_i(O_j)\) denotes the probability that bounding box \(O_j\) is entity i, and \(CNN(O_1, O_2)\) is the joint feature of box \(O_1\) with box \(O_2\). \(\varTheta = \{\mathbf {z}_k, s_k\}\) is the set of parameters.

Besides, language prior is considered in this model by calculating the unnormalized probability that the entity pair \(\langle i, j\rangle \) has the relation k:

$$\begin{aligned} P_f(R, \mathbf {W}) = \mathbf {r}_k^{\top } [\mathbf {w}_i; \mathbf {w}_j] + b_k, \end{aligned}$$
(9.18)

where \(\mathbf {w}_i\) and \(\mathbf {w}_j\) are the word embeddings of the text of subject and object, respectively, \(\mathbf {r}_k\) is the learned relational embedding of the relation k.

Given the probabilities of a relation from visual and textual inputs, respectively, the authors combine them into the integrated probability of a relation. The final prediction is the one with maximal integrated probability:

$$\begin{aligned} R^* = \max _{R}P_V(R_{\langle i, j, k\rangle }|\langle O_1, O_2\rangle )P_f(R, \mathbf {W}). \end{aligned}$$
(9.19)

The rank of the ground truth relationship R with bounding boxes \(O_1\) and \(O_2\) is maximized using the following rank loss function:

$$\begin{aligned} \begin{aligned} C(\varTheta , \mathbf {W}) = \sum _{\langle O_1, O_2\rangle , R} \max \{1- P_V(R, \varTheta |\langle O_1, O_2\rangle )P_f(R, \mathbf {W}) \\ + \max _{\langle O_1^\prime , O_2^\prime \rangle \ne \langle O_1, O_2\rangle , R^\prime \ne R} P_V(R^\prime , \varTheta |\langle O_1^\prime , O_2^\prime \rangle )P_f(R^\prime , \mathbf {W}), 0 \}. \end{aligned} \end{aligned}$$
(9.20)

In addition to the loss that optimizes the rank of the ground truth relationships, the authors also propose two regularization functions for language priors. The final loss function of this model is defined as

$$\begin{aligned} \mathscr {L} = C(\varTheta , \mathbf {W}) + \lambda _1L(\mathbf {W}) + \lambda _2 K(\mathbf {W}). \end{aligned}$$
(9.21)

\(K(\mathbf {W})\) is a variance function to make the similar relationships’ corresponding \(f(\cdot )\) function closer:

$$\begin{aligned} K(\mathbf {W}) = Var\{\frac{[P_f(R, \mathbf {W}) - P_f(R', \mathbf {W})]^2}{d(R, R')} \}, \forall R, R', \end{aligned}$$
(9.22)

where \(d(R, R^\prime )\) is the sum of the cosine distances (in Word2vec space) between the two objects and the predicates of the two relationships R and \(R^\prime \).

\(L(\mathbf {W})\) is a function to encourage less-frequent relation to have a lower f() score. When R occurs more frequently than \(R'\), we have

$$\begin{aligned} L(\mathbf {W}) = \sum _{R, R'} \max \{P_f(R', \mathbf {W}) - P_f(R, \mathbf {W}) + 1, 0\}. \end{aligned}$$
(9.23)

9.4.2 Visual Translation Embedding Network

Inspired by recent progress in knowledge representation learning, [82] proposes VTransE, a visual translation embedding network. Objects and the relationship between objects are modeled as TransE [7] like vector translation. VTransE first projects subject and object into the same space as relation translation vector \(\mathbf {r} \in \mathbb {R}^r\). Subject and object could be denoted as \(\mathbf {x}_s, \mathbf {x}_o\in \mathbb {R}^M\) in the feature space, where \(M \gg r\). Similar to TransE relationship, VTransE establishes a relationship as

$$\begin{aligned} \mathbf {W}_s\mathbf {x}_s + \mathbf {r} \sim \mathbf {W}_o\mathbf {x}_o, \end{aligned}$$
(9.24)

where \(\mathbf {W}_s\) and \(\mathbf {W}_o\) are projection matrices. The overall architecture is shown in Fig. 9.6.

Fig. 9.6
figure 6

The architecture of VTransE model

9.4.3 Scene Graph Generation

Li et al. [37] further formulate visual relation detection as a scene graph generation task, where nodes correspond to objects and directed edges correspond to visual relations between objects, as shown in Fig. 9.7.

Fig. 9.7
figure 7

An illustration for scene graph generation

This formulation allows [37] to leverage different levels of context information, such as information from objects, phrases (i.e., \(\langle \)subject, predicate, object\(\rangle \) triples), and region captions, to boost the performance of visual relation detection. Specifically, [37] proposes to construct a graph that aligns these three levels of information and perform feature refinement via message passing, as shown in Fig. 9.8. By leveraging complementary information from different levels, the performances of different tasks are expected to be mutually improved.

Dynamic Graph Construction. Given an image, they first generate three kinds of proposals that correspond to three kinds of nodes in the proposed graph structure. The proposals include object proposals, phrase proposals, and region proposals. The object and region proposals are generated using Region Proposal Network (RPN) [57] trained with ground truth bounding boxes. Given N object proposals, phrase proposals are constructed based on \(N(N-1)\) object pairs that fully connect the object proposals with direct edges, where each direct edge represents a potential phrase between an object pair.

Each phrase proposal is connected to the corresponding subject and object with two directed edges. A phrase proposal and a region proposal are connected if their overlap exceeds a certain fraction (e.g., 0.7) of the phrase proposal. There are no direct connections between objects and regions since they can be indirectly connected via phrases.

Fig. 9.8
figure 8

Dynamical graph construction. a The input image. b Object (bottom), phrase (middle), and caption region (top) proposals. c The graph modeling connections between proposals. Some of the phrase boxes are omitted

Feature Refinement. After obtaining the graph structure of different levels of nodes, they perform feature refinement by iterative message passing. The message passing procedure is divided into three parallel stages, including object refinement, phrase refinement, and region refinement.

In object feature refinement, the object proposal feature is updated with gated features from adjacent phrases. Given an object i, the aggregated feature from regions that are linked to object i via subject-predicate edges \(\hat{\mathbf {x} }_i^{p \rightarrow s}\) can be defined as follows:

$$\begin{aligned} \hat{\mathbf {x} }_i^{p \rightarrow s} = \frac{1}{\Vert E_{i, p}\Vert } \sum _{(i, j)\in E_{s, p}} f_{\langle o, p\rangle } (\mathbf {x} _i^{(o)}, \mathbf {x} _j^{(p)}) \mathbf {x} _j^{(p)}, \end{aligned}$$
(9.25)

where \(E_{s,p}\) is the set of subject predicate connections, and \(E_{i,p}\) denotes the number of phrases connected with the object i as the subject predicate pairs. \(f_{\langle o, p\rangle }\) is a learnable gate function that controls the weights of information from different sources:

$$\begin{aligned} f_{\langle o, p\rangle } (\mathbf {x} _i^{(o)}, \mathbf {x} _j^{(p)}) = \sum ^K _{k=1} \text {Sigmoid}(\omega ^{(k)}_{\langle o, p\rangle } \cdot [\mathbf {x} _i^{(o)}; \mathbf {x} _j^{(p)}]), \end{aligned}$$
(9.26)

where \(\omega ^{(k)}_{\langle o, p\rangle }\) is a gate template used to calculate the importance of the information from a subject-predicate edge and K is the number of templates. The aggregated feature from object-predicate edges \(\hat{\mathbf {x} }_i^{p \rightarrow o}\) can be similarly computed.

After obtaining information \(\hat{\mathbf {x} }_i^{p \rightarrow s}\) and \(\hat{\mathbf {x} }_i^{p \rightarrow o}\) from adjacent phrases, the object refinement at time step t can be defined as follows:

$$\begin{aligned} \mathbf {x} _{i, t+1}^{(o)} = \mathbf {x} _{i, t}^{(o)} + f^{(p \rightarrow s)}(\hat{\mathbf {x} }_i^{p \rightarrow s}) + f^{(p \rightarrow o)}(\hat{\mathbf {x} }_i^{p \rightarrow o}), \end{aligned}$$
(9.27)

where \(f(\cdot ) = \mathbf {W} \text {ReLU}(\cdot )\), \(\mathbf {W}\) is a learnable parameter and not shared between \(f^{(p \rightarrow s)}(\cdot )\) and \(f^{(p \rightarrow o)}(\cdot )\).

The refinement scheme of phrases and regions is similar to objects. The only difference is the information sources: Phrase proposals receive information from adjacent objects and regions, and region proposals receive information from phrases.

After feature refinement via iterative message passing, the feature of different levels of nodes can be used for corresponding tasks. Region features can be used as the initial state of a language model to generate region captions. Phrase features can be used to predict visual relation between objects, which composes of the scene graph of the image.

In comparison with scene graph generation methods that model the dependencies between relation instances by attention mechanism or message passing, [47] decomposes the scene graph task into a mixture of two phases: extracting primary relations from input, and completing the scene graph with reasoning. The authors propose a Hybrid Scene Graph generator (HRE) that combines these two phases in a unified framework and generates scene graphs from scratch.

Specifically, HRE first encodes the object pair into representations and then employs a neural relation extractor resolving primary relations from inputs and a differentiable inductive logic programming model that iteratively completes the scene graph. As shown in Fig. 9.9, HRE contains two units, a pair selector and a relation predictor, and runs in an iterative way.

Fig. 9.9
figure 9

Framework of HRE that detects primary relations from inputs and iteratively completes the scene graph via inductive logic programming

At each time step, the pair selector takes a look at all object pairs \(P^-\) that have not been associated with a relation and chooses the next pair of entities whose relation is to be determined. The relation predictor utilizes the information contained in all pairs \(P^+\) whose relations have been determined, and the contextual information of the pair to make the prediction on the relation. The prediction result is then added to \(P^+\) and benefits future predictions.

To encode object pair into representations, HRE extends the union box encoder proposed by [41] by adding the object features (what are the objects) and their locations (where are the objects) into the object pair representation, as shown in Fig. 9.10.

Fig. 9.10
figure 10

Object pair encoder of HRE

Relation Predictor. The relation predictor is composed of two modules: a neural module predicting the relations between entities based on the given context (i.e., a visual image) and a differentiable inductive logic module performing reasoning on \(P^+\). Both modules predict the relation score between a pair of objects individually. The relation scores from the two modules are finally integrated by multiplication.

Pair Selector. The selector works as the predictor’s collaborator with the goal to figure out the next relation which should be determined. Ideally, the choice \(p^*\) made by the selector should satisfy the condition that all relations that will affect the predictor’s prediction on \(p^*\) should be sent to the predictor ahead of \(p^*\). HRE implements the pair selector as a greedy selector which always chooses the entity pair from \(P^-\) to be added to \(P^+\) as the entity pair of which the relation predictor is most confident in its prediction.

It is worth noting that the task of scene graph generation resembles document-level relation extraction in many aspects. Both tasks seek to extract structured graphs consisting of entities and relations. Also, they need to model the complex dependencies between entities and relations in a rich context. We believe both tasks are worthy to explore for future research.

9.5 Visual Question Answering

Visual Question Answering (VQA) aims to answer natural language questions about an image, and can be seen as a single turn of dialogue about a picture. In this section, we will introduce widely used VQA datasets and several typical VQA models.

9.5.1 VQA and VQA Datasets

VQA was first proposed in [46]. They first propose a single-world approach by modeling the probability of an answer a given question q and a world w by

$$\begin{aligned} P({a}|{q}, w) = \sum _{{z}} P({a}|{z}, w) P({z} | {q}), \end{aligned}$$
(9.28)

where z is a latent variable associated with the question and the world w is a representation of the image. They further extend the single-world approach to a multi-world approach by marginalizing over different segments s of the given image. The probability of an answer a given question q and a world w is given by

$$\begin{aligned} P({a}|{q}, {s}) = \sum _{w}\sum _{{z}}P({a}|w, {z}) P(w|s) P({z}|{q}). \end{aligned}$$
(9.29)

They also release the first dataset of VQA named as DAQUAR in their paper.

Besides DAQUAR, researchers also release a lot of VQA datasets with various characteristics. The most widely used dataset was released in [4], where the authors provided cases and experimental evidence to demonstrate that to answer these questions, a human or an algorithm should use features of the image and external knowledge. Figure 9.11 shows examples of VQA dataset released in [4]. It is also demonstrated that this problem cannot be solved by converting images to captions and answering questions according to captions. Experiment results show that the performance of vanilla methods is still far from human.

Fig. 9.11
figure 11

Examples of VQA dataset

In fact, there are also other existing datasets for Visual QA such as Visual7W [85], Visual Madlibs [80], COCO-QA [56], and FM-IQA [19].

9.5.2 VQA Models

Besides, [4, 46] further investigate approaches to solve specific types of questions in VQA. Moreover, [83] proposes an approach to solve “YES/NO” questions. Note that the model is an ensemble model of two similar models: Q-model and Tuple-model, the difference between which will be described later. The overall approach can be divided into two steps: (1) Language Parsing and (2) Visual Verification. In the former step, they extract \(\langle \)P, R, S\(\rangle \) tuples from questions first by parsing it and assigning an entity to each word. Then they summarize the parsed sentences through removing “stop words”, auxiliary verbs, and all words before a nominal subject or passive nominal subject, and further split the summary into PRS arguments according to the part of speech of phrases. The difference between Q-Model and Tuple-model is that the Q-model is the one used in their previous work [4], embedding the question into a dense 256-dim vector by LSTM, while Tuple-model is to convert \(\langle \)P, R, S\(\rangle \) tuples into 256-dim embeddings by MLP. As for the Visual Verification step, they use the same feature of images as in [39] which was encoded into the dense 256-dim vector by an inner-product layer followed by a tanh layer. These two vectors are passed through an MLP to produce the final output (“Yes” or “No”).

Moreover, [61] proposes a method to calculate attention \(\alpha _j\) by the set of image features \(\mathbf {I} = (\mathbf {I}_1, \mathbf {I}_2, \dots , \mathbf {I}_K)\) and the question embedding \(\mathbf {q}\) by

$$\begin{aligned} \alpha _j = (\mathbf {W}_1 \mathbf {I}_j + \mathbf {b}_1)^\top (\mathbf {W}_2 \mathbf {q} + \mathbf {b}_2), \end{aligned}$$
(9.30)

where \(\mathbf {W}_1\),\(\mathbf {W}_2\),\(\mathbf {b}_1\),\(\mathbf {b}_2\) are trainable parameters.

Attention based techniques are quite efficient for filtering noises that are irrelevant to the question. However, some questions are only related to some small regions, which encourages researchers to use stacked attention to further filtering noises. We refer readers to Fig. 1b in [79] for an example of stacked attention.

Yang et al. [79] further extend the attention-based model used in [61], which employs LSTMs to predict the answer. They take the question as input and attend to different regions in the image to obtain additional input. The key idea is to gradually filter out noises and pinpoint the regions that are highly relevant to the answer by reasoning through multiple stacked attention layers progressively. The stacked attention could be calculated by stacking:

$$\begin{aligned} \mathbf {h}_A^k = \tanh (\mathbf {W}_1^k\mathbf {I} \oplus (\mathbf {W}_2^k\mathbf {u}^{k-1}+\mathbf {b}_A^k)). \end{aligned}$$
(9.31)

Note that we denote the addition of a matrix and a vector by \(\oplus \). The addition between a matrix and a vector is performed by adding each column of the matrix by the vector. \(\mathbf {u}\) is a refined query vector that combines information from the question and image regions. \(\mathbf {u}^0\) (i.e, \(\mathbf {u}\) from the first attention layer with \(k=0\)) could be initialized as the feature vector of the question. \(\mathbf {h}_A^k\) is then used to compute \(\mathbf {p}_I^k\), which corresponds to the attention probability of each image region,

$$\begin{aligned} \mathbf {p}_I^k = \text {Softmax} (\mathbf {W}_3^k\mathbf {h}_A^k + \mathbf {b}_P^k). \end{aligned}$$
(9.32)

\(\mathbf {u}^k\) could be iterated by

$$\begin{aligned} \tilde{\mathbf {I}}^k = \sum _i \mathbf {p}_i^k \mathbf {I}_i, \end{aligned}$$
(9.33)
$$\begin{aligned} \mathbf {u}^k = \mathbf {u}^{k-1} + \tilde{\mathbf {I}}^k. \end{aligned}$$
(9.34)

That is, in every layer, the model progressively uses the combined question and image vector \(\mathbf {u}^{k-1}\) as the query vector for attending the image region to obtain the new query.

The above models attend only on images, but questions should also be attended. [44] calculates co-attention by

$$\begin{aligned} \mathbf {Z} = \tanh (\mathbf {Q}^{\top }\mathbf {W} \mathbf {I}), \end{aligned}$$
(9.35)

where \(\mathbf {Z}_{ij}\) represents the affinity of the ith word and jth region. Figure 9.12 shows the hierarchical co-attention model.

Fig. 9.12
figure 12

The architecture of hierarchical co-attention model

Another intuitive approach is to use external knowledge from knowledge bases, which will help us better explain the implicit information hiding behind the image. Such an approach was proposed in [75], which first encodes the image into captions and vectors representing different attributes of the image to retrieve documents about a different part of the images from knowledge bases. Documents are encoded through doc2vec [36]. The representation of captions, attributes, and documents are transformed and concatenated to form the initial vector of an LSTM, which is trained in Seq2seq fashion. Details of the model are shown in Fig. 9.13.

Fig. 9.13
figure 13

The architecture of VQA incorporating external knowledge bases

Neural Module Network is a framework for constructing deep networks with a dynamic computational structure, which was first proposed in [3]. In such a framework, every input is associated with a layout that provides a template for assembling an instance-specific network from a collection of shallow network fragments called modules. The proposed method processes the input question through two separate ways: (1) parsing and laying out several modules, and (2) encoding by an LSTM. The corresponding picture is processed by the modules laid out according to the question, the types of which are predefined, find, transform, combine, describe, and measure. The authors defined find to be a transformation from Image to Attention map, transform to be a mapping from one Attention to another, combine to be a combination of two Attention, describe to be a description relying on Image and Attention, and Measure to be a measure only relying on Attention. The model is shown in Fig. 9.14.

Fig. 9.14
figure 14

The architecture of the neural module network model

A key drawback of [3] is that it relies on the parser to generate modules. [22] proposes an end-to-end model to generate a sequence of Reverse-Polish expression to describe the module network, as shown in Fig. 9.15. And the overall architecture is shown in Fig. 9.16.

Fig. 9.15
figure 15

The architecture of Reverse-Polish expression and corresponding module network model

Fig. 9.16
figure 16

The architecture of end-to-end module network model

Graph Neural Networks (GNNs) have also been applied to VQA tasks. [68] tries to build graphs about both the scene and the question. The authors described a deep neural network to take advantage of such a structured representation. As shown in Fig. 9.17, the GNN-based VQA model could capture the relationships between words and objects.

Fig. 9.17
figure 17

The architecture of GNN-based VQA models

9.6 Summary

In this chapter, we first introduce the concept of cross-modal representation learning. Cross-modal learning is essential since many real-world tasks require the ability to understand the information from different modalities, such as text and image. Next, we introduce the concept of cross-modal representation learning, which aims to exploit the links and enable better utilization of information from different modalities. And we overview existing cross-modal representation learning methods for several representative cross-modal tasks, including zero-shot recognition, cross-media retrieval, image captioning, and visual question answering. These cross-modal learning methods either try to fuse information from different modalities into unified embeddings, or try to build embeddings for different modalities in a common semantic space, allowing the model to compute cross-modal similarity. Cross-modal representation learning is drawing more and more attention and can serve as a promising connection between different research areas.

For further understanding of cross-modal representation learning, there are also some recommended surveys and books including:

  • Skocaj et al., Cross-modal learning [64].

  • Spence, Crossmodal correspondences: A tutorial review [66].

  • Wang et al., A comprehensive survey on cross-modal retrieval [72].

In the future, for better cross-modal representation learning, some directions are requiring further efforts:

(1) Fine-grained Cross-modal Grounding. Cross-modal grounding is a fundamental ability in solving cross-modal tasks, which aims to align semantic units in different modalities. For example, visual grounding aims to ground textual symbols (e.g., words or phrases) into visual objects or regions. Many existing works [27, 74, 76] have been devoted to cross-modal grounding, which mainly focuses on coarse-grained semantic unit grounding (e.g., grounding of sentences and images). Better fine-grained cross-modal grounding (e.g., grounding of words and objects) could promote the development of a broad variety of cross-modal tasks.

(2) Cross-modal Reasoning. In addition to recognizing and grounding semantic units in different modalities, understanding and inferring the relationship between semantic units are also crucial to cross-modal tasks. Many existing works [37, 41, 82] have investigated detecting visual relation between objects. However, most visual relations in existing visual relation detection datasets do not require complex reasoning. Some works [81] have made preliminary attempts on cross-modal commonsense reasoning. Inferring the latent semantic relationships in cross-modal context is critical for cross-modal understanding and modeling.

(3) Utilizing Unsupervised Cross-modal Data. Most current cross-modal learning approaches rely on human-annotated datasets. The scale of such supervised datasets is usually limited, which also limits the capability of data-hungry neural models. With the rapid development of the World Wide Web, cross-modal data on the Web have become larger and larger. Some existing works [42, 67] have leveraged unsupervised cross-modal data for representation learning. They first pretrained cross-modal models on large-scale image-caption pairs, and then fine-tuned the models on those downstream tasks, which shows significant improvement in a broad variety of cross-modal tasks. It is thus promising to better leverage the vast amount of unsupervised cross-modal data for representation learning.