Keywords

1 Introduction

Exploring the relationship between image and natural language has recently attracted great interest among researchers, due to its great importance in various applications, such as bi-directional image and text retrieval [22, 44], natural language object retrieval [10], image captioning [35, 43], and visual question answering (VQA) [1, 18]. A critical task for these applications is to measure the similarity between visual data and textual descriptions. Existing deep learning approaches either attempts to learn joint embeddings [21, 39, 40, 44] for image and text in a shared latent space, or build a similarity learning network [11, 15, 16, 22, 40] to compute the matching score for image-text pairs. The joint embedding learning based methods have shown great potential in learning discriminative cross-modal representations and computation efficiency at the test stage.

Fig. 1.
figure 1

Deep image-text embedding learning

Generally, the joint embedding learning framework for image-text matching adopts the two-branch [21, 39, 40, 44] architecture (as shown in Fig. 1), where one branch extracts the image features and the other one encodes the text representations, and then the discriminative cross-modal embeddings are learned with designed objective functions. The most commonly used functions include canonical correlation analysis (CCA) [44], and bi-directional ranking loss [21, 39, 40]. Compared with CCA based methods, the bi-directional ranking loss produces better stability and performance [40] and is being more and more widely used in cross-modal matching [21, 39]. Nevertheless, it suffers from sampling useful triplets and selecting appropriate margins in real applications.

Despite the great success of these deep learning techniques in matching image and text with only the pair correspondence, some recent works [15, 16, 28] explore more effective cross-modal matching algorithms with identity-level annotations. These research efforts demonstrated that the discrimination ability of the learned image-text embeddings can be greatly enhanced via introducing category classification loss as either auxiliary task [28] or pre-trained initialization [15, 16]. Consider the fact that independent classification may not fully exploit the identity information for cross-modal feature learning, [15] developed the Cross-Modal Cross-Entropy (CMCE) loss which employs the cross-modal sample-to-identity affinity for category prediction, whereas this strategy requires to allocate additional identity feature buffer, which could bring large memory consumption when there are large number of subjects.

To address these problems, we propose a cross-modal projection matching (CMPM) loss and a cross-modal projection classification (CMPC) loss, which introduces the cross-modal feature projection operation for learning discriminative image-text embeddings. The CMPM loss attempts to minimize the KL divergence between projection compatibility distributions and the normalized matching distributions, in order to increase the variance between unmatched samples and the association between the matched ones. The CMPM loss function does not need to select specific triplets or tune the margin parameter, and exhibits great stability with various batch size. For the assistant classification task with identity labels, the CMPC loss attempts to classify the vector projection of the features from one modality onto the matched features from another modality, instead of independently categorizing the original features. Extensive experiments and analysis demonstrate the superiority of the proposed approach for efficiently learning discriminative image-text embeddings.

2 Related Work

2.1 Deep Image-Text Matching

Most existing approaches for matching image and text based on deep learning can be roughly divided into two categories: (1) joint embedding learning [15, 21, 39, 40, 44] and (2) pairwise similarity learning [11, 15, 22, 28, 40].

Joint embedding learning aims to find a joint latent space under which the embeddings of images and texts can be directly compared. This type of approaches usually associate features from two modalities with correlation loss [44], and the bi-directional ranking loss [21, 39, 40]. The deep canonical correlation analysis (DCCA) [44] aims to learn nonlinear transformations of two views of data with the deep networks such that the resulting representations are highly linearly correlated, while the major caveat of DCCA is the eigenvalue problem brought by unstable covariance estimation in each mini-batch [23, 40]. The bi-directional ranking loss [21, 39, 40] extends the triplet loss [29], which requires the distance between matched samples to be smaller than unmatched ones by a margin for image-to-text and text-to-image ranking. Whereas the bi-directional ranking loss inherits the disadvantage of selecting negative samples and margins from the triplet loss.

Pairwise similarity learning focus on designing a similarity network which predicts the matching score for image-text pairs. Apart from the efforts [40] to measure the global similarity between image and text, many of the research works [11, 15, 22, 26, 28] attempt to maximize the alignments between image regions and textual fragments. However, this strategy may lack efficiency involving preparing all the image-text pairs to predict the matching score at the test stage.

For image-text matching with identity-level annotations, Reed et al. [28] proposed to learn discriminative image-text joint embeddings with the indication of class labels, and collected two datasets of fine-grained visual descriptions, while [16] attempted to search persons with language description under the assistance of identity classification. As an improvement, Li et al. [15] developed a two-stage learning strategy for textual-visual matching. Stage-1 pre-trains the network with the cross-modal cross-entropy (CMCE) loss under the supervision of identity labels, and stage-2 retrains the network with latent co-attention restriction under the supervision of pairwise labels.

2.2 Discriminative Feature Learning

Recent years have witnessed the advance of deep neural networks for learning discriminative features, which has great importance in many visual tasks, such as face recognition [19, 20, 29, 32, 41], face verification [33, 37], and person re-identification [2, 8, 42]. Intuitively, discriminative features should be able to maximize both the inter-class separability and the intra-class compactness.

As the most widely used supervision loss for learning strong representations, cross-entropy loss (or softmax loss) [32, 33, 42] has achieved significant success in various applications. Nevertheless, many research works have been focusing on improvements to generate more discriminative features. Wen et al. [41] proposed the center loss to assist the softmax loss for face recognition, where the distance between samples and the corresponding class centres are minimized to improve the intra-class compactness. Liu et al. developed the L-softmax [20] which introduces the angular margin into softmax loss for further increasing the feature separability, and refined it to A-softmax [19] by adding the normalization of the classification weights. It is notable that the A/L-softmax imposes feature discriminativeness by incorporating the angular margin to achieve remarkable results in face recognition. However, the strong restriction of angular and weights makes models difficult to converge [3, 36, 38] in real applications, especially when the training data has too many subjects. Ranjan et al. [27] proposed to normalize the features to strengthen the verification signal and better model the difficult samples. Wang et al. [37] modified the softmax loss by normalizing both the features and the classification weights, which achieves performance improvements with much easier implementation.

On the other hand, deep metric learning gains increasing popularity by learning general distance metrics, under which the distance between relevant samples are smaller than that of irrelevant ones. Hadsell et al. [5] proposed the contrastive loss to minimize the distance between similar points and restrict the distance between dissimilar points to be smaller than a margin. Schroff et al. [29] designed the triplet loss to encourage a relative distance constraint between matched face pairs and unmatched ones, and it has proved effective for matching pedestrians from different cameras in [8]. Recently, quadruplet loss [2] added a negative pair constrain to the triplet loss such that the intra-class variations and inter-class similarities are further reduced. It also introduced the adaptive margin to compute distance penalization and select negative samples.

Unfortunately, there are two main challenges when applying the above loss functions: sampling useful data units (i.e. positive and negative pairs, triplets, or quadruplets) and determining appropriate margins. Generating all possible triplets would result in heavy computation and slower convergence [29] while sampling the hardest negatives may cause the network to converge to a bad local optimum [29, 31]. [29] proposed to choose semi-hard negative samples from within a mini-batch online, while this strategy requires large batch size to select useful negative samples. Song et al. [31] optimized the smoothed upper bound of the original triplet loss and utilized all the negative samples within a mini-batch, and Sohn et al. [30] proposed the N-pair loss in the form of multi-class softmax loss with the request of carefully selected imposter examples. To avoid highly-sensitive parameters, the Histogram loss [34] is developed to estimate the similarity distributions of all the positive and negative pairs in a mini-batch and then minimize the probability that a random negative pair has a higher similarity than a random positive pair, under which the large batch size is preferred to achieve better performance. Nevertheless, these modifications for learning embeddings to preserve the association relationship of samples are specifically designed for single-modal applications, and may not readily adapt to the cross-modal matching problems.

3 The Proposed Algorithm

3.1 Network Architecture

The framework of our proposed method is shown in Fig. 1. We can see that the image-text matching architecture consists of three components: a visual CNN to extract image features, a bi-directional LSTM (Bi-LSTM) to encode text features, and a joint learning module for associating the cross-modal representations.

Given a sentence, we apply basic tokenizing and split it into words, and then sequentially process them with a Bi-LSTM. The hidden states of forward and backward directions are concatenated, and the initial text representations are obtained with a max-pooling strategy. For an image, we employ MobileNet [9] and extract its initial feature from the last pooling layer. In the association module, the extracted image and text features are embedded into a shared latent space, where the compatibility between matched features and the variance between unmatched samples are maximized.

In this paper, we focus on learning the discriminative features in the association module, and describe the proposed cross-modal projection matching (CMPM) and cross-modal projection classification (CMPC) loss function in the following sections.

3.2 Cross-Modal Projection Matching

We introduce a novel image-text matching loss termed as Cross-Modal Projection Matching (CMPM), which incorporates the cross-modal projection into KL divergence to associate the representations across different modalities.

Given a mini-batch with n image and text samples, for each image \(\varvec{x}_{i}\) the image-text pairs are constructed as \(\{(\varvec{x}_{i}, \varvec{z}_{j}), y_{i,j}\}_{j=1}^{n}\), where \(y_{i,j}=1\) means that \((\varvec{x}_{i}, \varvec{z}_{j})\) is a matched pair, while \(y_{i,j}=0\) indicates the unmatched ones. The probability of matching \(\varvec{x}_{i}\) to \(\varvec{z}_{j}\) is defined as

$$\begin{aligned} \begin{matrix} p_{i,j} = \frac{\exp (\varvec{x}_{i}^{\top }\varvec{\bar{z}}_{j})}{\sum _{k=1}^{n} \exp (\varvec{x}_{i}^{\top }\varvec{\bar{z}}_{k})} \; \; \; s.t. \; \varvec{\bar{z}}_{j} = \frac{\varvec{z}_{j}}{\Vert \varvec{z}_{j}\Vert } \end{matrix} \end{aligned}$$
(1)

where \(\varvec{\bar{z}}_{j}\) denotes the normalized text feature. Geometrically \(\varvec{x}_{i}^{\top }\varvec{\bar{z}}_{j}\) represents the scalar projection image feature \(\varvec{x}_{i}\) onto text feature \(\varvec{z}_{j}\) and \(p_{i,j}\) can be viewed as the percent of scalar projection of \((\varvec{x}_{i}, \varvec{z}_{j})\) among all pairs \(\{(\varvec{x}_{i}, \varvec{z}_{j})\}_{j=1}^{n}\) in a mini batch. Figure 2(a) shows the geometrical explanation of the cross-modal projection. We can see that the more similar image feature to text feature, the larger the scalar projection would be. Note that the scalar projection can be negative if the two vectors lie in opposite directions, such as \(\varvec{x}_{i}^{\top }\varvec{\bar{z}}_{k}\) shown in the figure.

Considering the fact that there might be more than one matched text samples for \(\varvec{x}_{i}\) in a mini-batch, we normalize the true matching probability of \((\varvec{x}_{i}, \varvec{z}_{j})\) as

$$\begin{aligned} \begin{matrix} q_{i,j} = \frac{y_{i,j}}{\sum _{k=1}^{n} y_{i,k}} \end{matrix} \end{aligned}$$
(2)
Fig. 2.
figure 2

Interpretation of cross-modal projection and matching. (a) The image feature \(\varvec{x}_{i}\) is projected onto different text directions, and the scalar projection of \(\varvec{x}_{i}\) onto the matched text \(\varvec{z}_{i}\) is larger than that of unmatched text \(\varvec{z}_{j}\) and \(\varvec{z}_{k}\). (b) For the image \(\varvec{x}_{1}\) with \(\varvec{z}_{1}\) and \(\varvec{z}_{3}\) as matched candidates (green arrowed line) in a mini-batch, and the other texts as unmatched samples (red arrowed line), the CMPM loss attempts to find a distribution \(\varvec{p}_{1}\) having low probability where the true matching distribution \(\varvec{q}_{1}\) has low probability (Color figure online)

The matching loss of associating \(\varvec{x}_{i}\) with correctly matched text samples is defined as

(3)

where \(\epsilon \) is a small number to avoid numerical problems, and the matching loss from image to text in a mini-batch is computed by

(4)

Note that Eq. 3 actually represents the KL divergence from distribution \(\varvec{q}_{i}\) to \(\varvec{p}_{i}\), and minimizing \(KL(\varvec{p}_{i}\Vert \varvec{q}_{i})\) attempts to select a \(\varvec{p}_{i}\) that has low probability where \(\varvec{q}_{i}\) has low probability [4]. Figure 2(b) illustrates the proposed matching loss with a mini-batch data, we can see that the true matching distribution \(\varvec{q}_{1}\) for image \(\varvec{x}_{1}\) has multiple modes with more than one matched text candidates in the mini batch, and the proposed matching loss attempts to select a single mode distribution \(\varvec{p}_{1}\) to avoid putting probability mass in the low-probability areas between modes of \(\varvec{q}_{1}\), such that the compatibility of the unmatched image-text pairs are minimized while the relevance of the matched pairs are maximized. Note that given an image, all the positive and negative text candidates in a mini-batch are taken into consideration for computing the matching loss, getting rid of the dedicated sampling procedures in traditional bi-directional ranking loss.

It might raise the concerns about using \(KL(\varvec{q}_{i}\Vert \varvec{p}_{i})\) to maximize the compatibility of matched pairs for learning discriminative embeddings. As explained in [4], \(KL(\varvec{q}_{i}\Vert \varvec{p}_{i})\) would try to find \(\varvec{p}_{i}\) as a blur mode, towards generating high probability where \(\varvec{q}_{i}\) has high probability. This may cause difficulties for distinguishing matched and unmatched pairs when there are multiple positive pairs in a mini-batch. The advantages of \(KL(\varvec{p}_{i}\Vert \varvec{q}_{i})\) over \(KL(\varvec{q}_{i}\Vert \varvec{p}_{i})\) will be further demonstrated in experiments.

In image-text embedding learning, the matching loss is often computed in two directions [21, 39, 40]: the image-to-text matching loss requires the matched text to be closer to the image than unmatched ones, and in verse the text-to-image matching loss constrains the related text to rank before unrelated ones. Similarly, the matching loss \(\mathcal {L}_{t2i}\) from text to image can be formulated by exchanging \(\varvec{x}\) and \(\varvec{z}\) in Eqs. 14, and the bi-directional CMPM loss is calculated by

$$\begin{aligned} \mathcal {L}_{cmpm}= \mathcal {L}_{i2t} + \mathcal {L}_{t2i} \end{aligned}$$
(5)

3.3 Cross-Modal Projection Classification

For image-text matching with identity-level annotations, the classification loss applied to each modality helps to learn more discriminative features. However, the matching relationships of image-text pairs may not be sufficiently exploited in separate classification tasks. In this section, we develop a novel classification function where the cross-modal projection is integrated into the norm-softmax loss to further enhance the compactness of the matched embeddings.

Norm-Softmax. First we revisit the traditional softmax loss by looking into the decision criteria of softmax classifiers. Given the extracted image features \(\mathcal {X}=\{\varvec{x}_{i}\}_{i=1}^{N}\) from visual CNN, text features \(\mathcal {Z}=\{\varvec{z}_{i}\}_{i=1}^{N}\) from Bi-LSTM, and the label set \(\mathcal {Y}=\{y_{i}\}_{i=1}^{N}\) from M classes, the original softmax loss for classifying images can be computed as

$$\begin{aligned} \mathcal {L}_{softmax} = \frac{1}{N} \sum _{i} -log( \frac{exp(\varvec{W}_{y_{i}}^{\top }\varvec{x}_{i} + b_{y_{i}})}{\sum _{j} exp(\varvec{W}_{j}^{\top }\varvec{x}_{i} + b_{j})}) \end{aligned}$$
(6)

where \(y_{i}\) indicates the label of \(\varvec{x}_{i}\) , \(\varvec{W}_{y_{i}}\) and \(\varvec{W}_{j}\) represent the \(y_{i}\)-th and j-th column of weight matrix \(\varvec{W}\), and \(b_{y_{i}}\) and \(b_{j}\) respectively denote the \(y_{i}\)-th and j-th element of bias vector \(\varvec{b}\).

To improve the discriminative ability of the image feature \(\varvec{x}_{i}\) during classification, we impose weight normalization on the softmax loss as with [19, 37], and reformulate Eq. 6 as

$$\begin{aligned} \mathcal {L}_{image}= \frac{1}{N} \sum _{i} -log(\frac{exp(\varvec{W}_{y_{i}}^{\top }\varvec{x}_{i})}{\sum _{j} exp(\varvec{W}_{j}^{\top }\varvec{x}_{i})}) \; \; \; s.t. \; \Vert \varvec{W}_{j}\Vert = 1 \end{aligned}$$
(7)

Compared with the original softmax loss, the norm-softmax loss normalizes all the weight vectors into the same length in order to reduce the impact of weight magnitude in distinguishing different samples. Here we omit the bias \(\varvec{b}\) for simplifying analysis and in fact found it makes no difference as with [19, 20].

The intuitive explanation of the norm-softmax loss is shown in Fig. 3. We can see that, for the original softmax, the classification results depends on \(\Vert \varvec{W}_{k}\Vert \Vert \varvec{x}\Vert \cos (\theta _{k}), (k=1, 2)\), where \(\theta _{k}\) indicates the angle between \(\varvec{x}\) and \(\varvec{W}_{k}\). For the norm-softmax, all the weight vectors are normalized into the same length, and the classification results can be only depended on \(\Vert \varvec{x}\Vert \cos (\theta _{k})\). This restriction encourages the feature \(\varvec{x}\) to distribute more compactly along the weight vector in order to be correctly classified.

Fig. 3.
figure 3

Geometric interpretation of softmax and norm-softmax

Cross-Modal Projection. In this paper, we attempt to classify the projection of image features onto the corresponding text features instead of categorizing the original feature representations. The cross-modal projection integrates the image-text similarity into classification and thus strengthens the association within matched pairs.

By incorporating the cross-modal projection into the norm-softmax, we can reformulated Eq. 7 as

(8)

where \(\varvec{\hat{x}}_{i}\) denotes the vector projection of image feature \(\varvec{x}_{i}\) onto normalized text feature \(\varvec{\bar{z}}_{i}\). Intuitively, all the matched text samples needs to lie in the direction of \(\varvec{W}_{y_{i}}\) for the image feature \(\varvec{x}_{i}\) to project onto, in order to promote correct categorization. The text classification loss function can be written as

(9)

The final CMPC loss can be calculated with

$$\begin{aligned} \mathcal {L}_{cmpc}= \mathcal {L}_{ipt} + \mathcal {L}_{tpi} \end{aligned}$$
(10)

3.4 Objective Functions

For matching tasks with only pairwise correspondence, we can utilize the proposed CMPM loss for learning discriminative image-text embeddings. If identity labels are available, we adopt the joint of the proposed CMPM loss and CMPC loss for more accurately associating the cross-modal representations. The overall objective function is formulated as

$$\begin{aligned} \mathcal {L} =\mathcal {L}_{cmpm} + \mathcal {L}_{cmpc} \end{aligned}$$
(11)

At the test stage, given an image and text, we first extract the image feature \(\varvec{x}\) and text feature \(\varvec{z}\) with the visual CNN and Bi-LSTM network, respectively. Then the cosine distance between \(\varvec{x}\) and \(\varvec{z}\) is computed for image-to-text and text-to-image retrieval evaluation.

4 Experiments

4.1 Datasets and Settings

Datasets. Five datasets are used in our experiments. The Flickr30K [45] dataset contains 31, 783 images with each one annotated by five text descriptions. We adopt the data split in [12] to use 29, 783 images for training, 1, 000 images for validation, and 1, 000 images for testing. The MSCOCO [17] dataset consists of 12, 3287 images and each one is also described by five sentences. Following the protocol of [12], we split the data into 82, 783 training, 30, 504 validation, and 5, 000 test images, and report the evaluation results on both 5 K and 1K (5 fold) test images. The CUHK-PEDES [16] dataset contains 40, 206 pedestrian images of 13, 003 identities, with each image described by two textual descriptions. The dataset is split into 11, 003 training identities with 34, 054 images, 1000 validation persons with 3, 078 images and 1000 test individuals with 3, 074 images. The Caltech-UCSD Birds (CUB) [28] dataset consists of 11, 788 bird images from 200 different categories. Each image is labelled with 10 visual descriptions. The dataset is split into 100 training, 50 validation and 50 test categories. The Oxford-102 Flowers (Flowers) [28] dataset contains 8, 189 flower images of 102 different categories, and each image has 10 textual descriptions. The data splits provide 62 training, 20 validation, and 20 test categories.

Evaluation Metrics. We adopt Recall@K (K = 1, 5, 10) [12] and AP@50 [28] for retrieval evaluation. Recall@K (or R@K) indicates the percentage of the queries where at least one ground-truth is retrieved among the top-K results, and AP@50 represents the percent of top-50 scoring images whose class matches that of the text query, averaged over all the test classes.

Implementation Details. All the models are implemented in TensorFlow with a NVIDIA GEFORCE GTX 1080 GPU. For all the datasets, we use MobileNet [9] and Bi-LSTM for learning visual and textual features, respectively. The adam optimizer [13] is employed for optimization with \(lr=0.0002\). For Flickr30K and MSCOCO, we also report the results with ResNet-152 [7] as image feature extractor, where we start training with \(lr=0.0002\) for 15 epochs with fixed image encoder and then training the whole model with \(lr=0.00002\) for 30 epochs.

Table 1. Comparison of bi-directional retrieval results (R@K(%)) on Flickr30K

4.2 Results on the Flickr30K Dataset

We summarize the comparison of retrieval results on the Filckr30K dataset in Table 1. We can see that with MobileNet as image encoder, the proposed CMPM loss achieves competitive results of R@1 = \(37.1\%\) for image-to-text retrieval, and R@1 = \(29.1\%\) for text-to-image retrieval. The performance can be improved to \(48.3\%\) and \(35.7\%\) respectively by employing ResNet-152 as with RRF-Net [21] and DAN [26]. We also explore the assistant effect of the CMPC loss by training the classifiers single category per image, and we observe that the retrieval results can be further improved by around \(1.3\%\), demonstrating the effectiveness of cross-modal projection learning for image-text matching.

4.3 Results on the MSCOCO Dataset

We compare the proposed approach with state-of-the-art methods on the MS-COCO dataset in Table 2. We can see that for 1K test images the proposed CMPM loss achieves R@1 = \(56.1\%\) and \(44.6\%\) with image and text as quires, respectively. For 5K test images the algorithm achieves R@1 = \(31.1\%\) and \(22.9\%\), outperforming the second best by \(7.0\%\) and \(5.3\%\), which further verifies the superiority of the proposed loss functions.

Table 2. Comparison of bi-directional retrieval results (R@K(%)) on MSCOCO
Table 3. Comparison of text-to-image retrieval results (R@K(%)) on CUHK-PEDES

4.4 Results on the CUHK-PEDES Dataset

Table 3 compares the proposed method against existing approaches on the CUHK-PEDES dataset. We can see that the proposed CMPM loss achieves \(44.02\%\) of R@1 and \(77.00\%\) of R@10, outperforming the second best performer [15] by a large margin. When we add the CMPC loss supervised by the identity-level annotations, the text-to-image retrieval performance is further improved to \(49.37\%\) for R@1 and \(79.27\%\) for R@10. This illustrates the effectiveness of the CMPM loss for person search applications, and the promotion effect of the CMPC loss when the category labels are available in real applications.

4.5 Results on the CUB and Flowers Dataset

The comparison of image-to-text and text-to-image retrieval results on the CUB and Flowers dataset is shown in Table 4. Consider that the bi-directional losses are implemented in our approach, we choose the symmetric results [15] of the existing methods for fair comparison. We can see that the proposed algorithm outperforms the state-of-the-art, achieving 64.3% of R@1 for image-to-text retrieval and 67.9% of AP@50 for text-to-image retrieval on CUB, and reporting the best R@1 of 68.90% for image-to-text retrieval and the second best AP@50 of 69.70% for text-to-image retrieval on Flowers.

Table 4. Comparison of image-to-text (R@K(%)) and text-to-image (AP@K(%)) retrieval results on the CUB and Flowers dataset

5 Ablation Studies

To investigate the effect of each component of the proposed CMPM and CMPC loss, we perform a series of ablation studies on the CUHK-PEDES dataset. We conduct further comparative experiments in three aspects: comparison of the CMPM loss with other matching losses under various batch size, impact of cross-modal projection and weight normalization for the CMPC loss, and the cross-modal feature distribution learned with different losses.

5.1 Analysis of Cross-Modal Matching

Table 5 compares the proposed CMPM loss with the commonly used bi-directional ranking (Bi-rank) loss [21, 39, 40], the most similar N-pair loss [30], and Histogram Loss [34] with different batch size on the CUHK-PEDES dataset. We add the image-to-text retrieval evaluation for more comprehensive analysis of learned embeddings, since good cross-modal embeddings should be able to perform bi-directional matching tasks. Note that all the loss functions are implemented in the bi-directional mode and the triplets are online sampled.

Table 5. R@1 (%) comparison of cross-modal matching functions with different batch size on the CUHK-PEDES dataset

From the table we can see that the previous matching loss fluctuates greatly when the batch size varies between 16 and 128. The bi-directional ranking loss depends on larger batch size to generate comparative matching accuracies, due to the negative sampling requirements [29]. The Histogram loss [34] performs much worse than other methods for cross-modal matching. The N-pair loss [30] produce better text-to-image retrieval results with moderate batch size, while the image-to-text matching performance are much worse. This might due to the scalar gap of image and text embeddings from different networks. The \(KL(\varvec{q}_{i}\Vert \varvec{p}_{i})\) discussed in Sect. 3.2 generates satisfying results when the batch size is small, while deteriorates with larger batch size of 128. This further verifies the analysis that, when there are more positive pairs in larger mini batches, the inappropriate KL direction blurring the multiple modes could cause ambiguities for image-text matching. In contrast, the proposed CMPM loss produces much more stable matching results with different batch size (R@1 remains above 42% for text-to-image retrieval), and the advantages are more obvious when the batch size are too small or too large, exhibiting great superiority and broad applicability.

Table 6. R@1 (%) comparison of different components of the cross-modal projection learning on the CUHK-PEDES dataset

5.2 Analysis of Cross-Modal Classification

Table 6 illustrates the impact of the softmax loss, weight normalization (normW) and cross-modal projection (CMP) in image-text embedding learning on the CUHK-PEDES dataset. We can see that adding the supervision loss indeed improves the matching performance, while the original softmax loss offers limited assistance. By adding the weight normalization, the R@1 rates are increased from 45.38% to 47.12% for image-to-text retrieval, and 55.14% to 56.51% for text-to-image retrieval. The cross-modal projection further improves the bi-directional retrieval results by \(2.25\%\) and \(1.20\%\). We also notice that the CMPC loss alone achieves competitive results for image-text matching and weight normalization brings significant improvements. This indicates the effectiveness of weight normalization and cross-modal projection in learning discriminative cross-modal representations.

Fig. 4.
figure 4

Comparison of feature distribution learned with the proposed approach

5.3 Feature Visualization

To better understand the effect of the proposed cross-modal matching loss and cross-modal classification loss for learning discriminative image-text embeddings, we show the t-SNE [24] visualization the test feature distribution learned using the CMPM loss and CMPM+CMPC loss on the CUHK-PEDES dataset. From Fig. 4(a) we can see that the CMPM loss learns image-text embeddings distributed along radial spokes, where the image and text features from the same class approximately lie in the same direction. This type of angular distribution is consistent with the traditional softmax loss [19], and therefore the added CMPC loss naturally improves the compactness of the features along each spoke as shown in Fig. 4(b). We can also observe that the radius of image feature areas is smaller than text features, which indicates the scalar gap brought by different networks (i.e., the CNN network for image and Bi-LSTM for text). In experiments we obtain the average length (value of \(\ell _2\) norm) of 52.62 for image features and 128.92 for text features. The cross-modal distribution shows the importance of feature normalization in cross-modal projection for bridging the scalar gap in image-text embedding learning.

6 Conclusions

In this paper, we proposed a novel cross-modal projection matching loss (CMPM) and cross-modal projection classification (CMPC) loss, for learning deep discriminative image-text embeddings. The CMPM loss utilize the KL divergence to minimize the compatibility score of the unmatched image-text pairs while maximizing the relevance between the matched ones. It shows great stability and superiority for associating image and text under various batch size, without triplet sampling and margin selection that hampers the traditional bi-directional ranking loss. The CMPC loss incorporates the matching relationship into the auxiliary classification task, which further enhances the representation compactness of each category. In the future, we will work on how to better interact the matching task and classification task in identity-aware matching problems.