1 Introduction

Image–text matching (ITM) bridges the space gap between the scenes and annotations to achieve an effective understanding the different modalities (text, image). It measures the image–text similarity depending on visual and semantic information [1]. For instance, given an image, it is feasible to search for the relevant sentences (I2T) and vice versa (T2I), where the same notion can be represented in many ways, such as text, image, and audio. The matching between images and texts is employed in various multimodal tasks such as image captioning (IC), cross-modal retrieval (CMR), text-to-image synthesis, and visual question answering (VQA) [2].

Many statistical algorithms are applied in matching, such as Canonical Correlation Analysis (CCA) [3], Cluster CCA [4], and Partial Least Squares (PLS) [5]. These algorithms determine the correlation among data and learn joint embedding with low dimensions. In recent years, deep learning (DL) approaches have demonstrated notable results in many domains, such as natural language processing (NLP) and computer vision (CV). This is because of the thorough ability to represent the data without demanding handcrafted features. This motivates the researchers to explore the strength of DL in multimodal tasks.

There are a few overviews that discuss multimedia retrieval problem; these reviews did not focus on image–text retrieval (ITR) only but also included audio and video modalities. Peng et al. [6] and Aygun et al. [7] provided cross-media reviews for image, text, video, and audio modalities retrievals. However, these overviews did not deeply focus on ITR, and they did not give sufficient information about the ITM/ITR task. Lately, ITR has attracted a frequent attention because of the growth of multimodal data. Therefore, Chen et al. [8] wrote a short review that focused on ITR approaches that proposed in the period from 2018 to 2019. Chen et al. classified the retrieval models according to the embedding methods into four groups pairwise, adversarial, attributes, and interaction. But this review covered some existing methods and neglected the hybrid methods. After that, Abdullah et al. [2] presented a review that also focus on ITR, but they classified the existing approaches based on the alignment between image and text parts.

In this study, we provide a survey on the ITM methods; our focus will be on bi-directional ITM models that utilized the DL with different learning approaches; the bi-directional ITM models have the ability to retrieve from image to text (I2T) and vice versa (T2I). We suggest to classify these methods according to learning approaches into the following learning categories: (1) deep canonical correlation analysis; (2) rank; (3) interaction; (4) adversarial; (5) cycle-consistent; (6) few-shot; (7) hybrid; and (8) vision-language pre-trained models (VL-PTMs). In contrast to the previous overviews, we explore the link between learning and alignment methods as in Fig. 2. Furthermore, we summarize the present models’ structures as in Table 1, by highlighting the used encoders, loss functions, and optimizers to illustrate the main variations in the architectures. In addition, we discuss the key challenges and future directions.

Table 1 The architectures differences between the existing works

The rest of this article is arranged as follows: in Sect. 2, a general background is presented to give an overview about the ITM task. In Sect. 3, a taxonomy of Bi-ITM approaches is reported and summarized in Fig. 7, where each approach is described briefly. In Sect. 4, datasets and evaluation methods are presented. In Sect. 5, discussion and new directions are given. Finally, Sect. 6 is the conclusion.

2 Background

In this section, we give an overview about the basic structure of the ITM models in details. Generally, bi-directional CMR architecture consists mainly of three components as illustrated in Fig. 1: (1) image features branch, (2) text features branch, and common space (or latent space) [2] [8]. Although, the feature vectors for an image and its reference text have similar semantics, they distributed in distinct spaces. Hence, a common space (shared space) is used to embed these vectors in one space to be able to compare because their similarities are in different spaces using Euclidean or Cosine distance [9] (loss function). Thus, the matching task is often hard since a deep understanding of images and sentences is required. Since, the image–text retrieval model aims to assign a similarity score to a pair of an image and a sentence. If the score is high then the sentence sufficiently describes the image, and if the score is low then the sentence is unassociated to the input image. Consequently, the similarity scores used for determining the top appropriate images and sentences in both retrieval situations [10].

Fig. 1
figure 1

Basic CMR architecture for the ITM task, the model takes the inputs which are the images and their descriptions; the inputs features are extracted using encoders. Then, the encoding vectors are passed to common/latent space to compute the similarities scores using loss function to retrieve the similar pairs

Before the learning process, modality features must be extracted through suitable representation methods. Significantly, the representation methods influence on the performance. Here, we will present the popular features extractors for text and image modalities, without going deeply into technical specifics.

2.1 Image Representation

Many powerful DL models are constructed for image feature extraction using convolutional neural networks (CNNs) because of its characteristics such LeNet [11], AlexNet [12], GoogleNet [13], visual geometry group (VGGNet) [14], region-based convolutional neural network (R-CNN) [15], faster region-based convolutional neural networks (Faster R-CNN) [16], and residual network (ResNet) [17]. These models can be directly integrated into multimodalities models and trained simultaneously. However, the common direction for features representation is the existing CNNs pre-trained models, especially in multimodal learning, due to the computation resources limitations and the required amount of data for sufficient training. Recently, multi-step feature extraction is suggested to obtain better representation. For instance, Lee et al. [18] and Wang et al. [19] used Faster R-CNN then the output is fed to ResNet to get the image’s features.

2.2 Text Representation

Since, sentences and images have diverse natural, different representation methods are used to encode words. The Most popular representation methods are bag of words (BOW) [20], term frequency-inverse document frequency (TF-IDF) [21], latent Dirichlet allocation (LDA) [22], Word2Vec [23], Glove [24], and recently BERT [25]. Theses embedding methods map the words into a vector space under the same distribution to measure the similarity between the words. To deal with the varying length of the sentences. Many DL Networks are established to deal with the varying length of the sentences by considering a sentence as a sequence of words. Recurrent neural networks (RNNs) [26] is a robust tool for dealing with sequential data as sentences because RNN has an internal memory. But, RNNs cannot capture long-term dependencies since it suffers from the gradient vanishing. For that reason, modified versions from RNN are utilized to better performance such as long short-term memory (LSTM) [27] and gated recurrent unit (GRU) [28]. GRU is a modified version from LSTM with 2 gates instead of 4 as in LSTM, so GRU needs less computational power than LSTM. Additional, bi-directional versions are proposed for RNN [29], LSTM [30] and GRU [26], they are widely utilized for capturing the semantics of the text. Practically, the weights of these DL networks are often initialized using Word2Vec, Glove, BERT or randomly. In addition to these methods, also CNN showed remarkable results when it used for extracting textual features as in [31, 32].

2.3 Alignment

Alignment between image (as whole or objects) and text (as whole or words) plays an important role in understanding these different modalities. The alignment shows how the inputs vectors interact with each other. The main alignment categories are illustrated in Fig. 2 as following: (1) Global Alignment: the whole image and whole text features are directly passed to the common space to compute the similarity (similarities) between each pair, (2) Local Alignment: it focuses on the correlation between image regions and text snips, (3) Hybrid Alignment: it combines the global and local approaches for achieving more precise understanding to obtain better matching results.

Fig. 2
figure 2

Alignment methods: (1) global (blue boxes), (2) local (orange box), and (3) hybrid (gray area)

In addition to that, some works incorporate rational alignment with one from the above categories as [33, 34] to enhance matching scores, where rational alignment concerns with the relation between image objects (e.g., “a boy is holding a ball above the footpath”, Relations: (a boy, holding), (holding, a ball), (above, footpath)). In matching task, most of the existing works follow global approach as seen in fig where significant improvements are achieved. However, global approach is unable to advantage from the interaction between image and text to find the common semantics. Consequently, the local approach is used to allow linking between image areas or objects and text snips in several studies (see Fig. 3). But, it does not focus on non-elements in the image such as background and snow. Recently, attention mechanism is used to associate words with image areas to compute local similarity between each pair, also it allows to focus on salient regions (e.g., sky, ground) in several works [18, 19]. Since, local approach ignores fine-grained information which can give additional signs for matching learning. Thence, to investigate better visual and textual matching, the hybrid alignment is suggested in many studies where Wang et al. [35] employed local and global approaches in separated branches, and Li et al. [36] fused global and local similarities to obtain accurate matching. Even if, the hybrid alignment increases the need of memory and computation time.

Fig. 3
figure 3

Interaction learning

2.4 Loss Functions

Several formulas for loss function are suggested, this means that they have different factors which construct loss function form such as the distance space of features, label relation, and similarity measurement. The triplet loss [37] is commonly used in image classification and retrieval. Usually, the triplet is expressed as (anchor (a), positive (p), negative(n)). The triplet aims to decrease the distance among an anchor (given) and a positive(similar) sample, and maximizes the distance with a negative (dissimilar) sample. It can be expressed as in Eq. 1 which called hinge triplet loss, I is image encode, T is text encode, N is the negative samples set, \(\alpha\) margin, S is the similarity function and \({\left[m\right]}_{+}=\mathrm{max}(0,m)\):

$$L(I,T) = \sum_{{T}^{^{\prime}}\in {N}_{I}}{\left[\alpha -S\left(I,T\right)+S\left(I,{T}^{^{\prime}}\right)\right]}_{+}\sum_{{I}^{^{\prime}}\in {N}_{T}}{\left[\alpha -S\left(I,T\right)+S\left({I}^{^{\prime}},T\right)\right]}_{+}$$
(1)

Faghri et al. [38] used hard-negatives concept to enhance triplet loss, where hard-negatives (HN) \({T}_{h}^{^{\prime}}\), and \({I}_{h}^{^{\prime}}\) are assumed as \({T}_{h}^{^{\prime}}={argmax}_{c\ne T}(I,C)\), and \({I}_{h}^{^{\prime}}={argmax}_{c\ne I}\left(c,T\right)\) and as in Eq. 2:

$${L}_{h}\left(V,T\right)=\,max\left[0,\alpha -S\left(I,T\right)+S\left(I, {T}_{h}^{^{\prime}}\right)\right]+\mathrm{max}\left[0,\alpha -S\left(I,T\right)+S\left({I}_{h}^{^{\prime}},T\right)\right]$$
(2)

3 Retrieval Approaches’ Categorization

There is a variety of multimodal embedding learning approaches that may share similar constructions as the used methods for capturing the features of each modality which could be quite different as illustrated in Table 1. In this section, we will illustrate the image–text matching DL approaches according to our proposed categorization in details and discuss their pros and cons as summarized in Table 2, and Fig. 7 summarizes the literature works based on two factors learning category and alignment method in each category, where the learning categories are:

Table 2 Advantages and disadvantages of the learning approaches

3.1 Deep Canonical Correlation Analysis

Deep canonical correlation analysis (DCCA) focuses on learning composite non-linear transformations for various modalities of data by maximizing the total correlation through deep networks where the resulted representations are linearly correlated [39] as shown in Fig. 1 common space. Yan et al. [40] proposed an extension for DCCA for image–text retrieval, this extension works under specific constrains to avoid overfitting issue and the eigenvalue decomposition problem. After that, Shao et al. [41] integrated DCCA with progressive learning to reduce the required data for the training, and they used hypergraph learning to extract semantic information from textual features, then related image–text pairs are clustered in the latent space according to the semantic information. But, the proposed models in [40] and [41] did not take into account finding dissimilar pairs by a direct route, and this produces false positives outputs in retrieval result. Recently to solve this issue, Hua et al. [42] designed a loss based on metric learning to learn distance metrics among text and image to measure similarity using inter and intra correlation knowledge. In addition, they used multi-scale to represent the similarity as real value instead of binary way (similar or not). Even if, DCCA aims to maximize correlation among different modalities, it requires huge memory and it does not obtain non-linear relations among various modalities. Due to that, it is not easy to implement DCCA if the number of various modalities is greater than two. In addition, its loss is sensitive to the batch size.

3.2 Rank Learning

Rank learning has three approaches which are pointwise, pairwise, and finally list-wise. They are diverse in the way of dealing with the input data, where pointwise learning takes one input and calculate the score between input element and input queue; and pairwise learning takes a pair of elements and then rank all available pairs in descending order; and finally, list-wise learning that takes all entire input list then optimize their order. The rank of an element is based on its loss, if the loss is low its rank will be high [43].

In ITM, pairwise learning is widely to use, this approach attempts to find a loss function that calculates the distance between image and text pairs in the common space. Where the distance between related images and texts is reduced, the distance between unrelated samples is increased. Using DL in matching starts using CNN to obtain the visual features as in [44], [45], [46]. They applied different CNN structures to capture visual features and LDA to represented textual features. Instead of using LDA, Karpathy et al. [33] employed dependency trees (DT) to encode words relations in a given sentence, where these relations are used as sentence fragments. Then the local similarity is measured between the image regions and sentence snips, then the global by accumulating the similarity scores of all region-word pairs. After that, Karpathy et al. [47] modified their previous work using BRNN as sentence encoder instead of DT as in [33] to achieve better performance, where DT Parsers might be trained on unassociated text corpora that will affect the performance. Furthermore, the CNN is used to capture annotations features where Ma et al. [31] used CNN for both sentence and image representation.

Dealing with various modalities is a critical task through the learning; therefore, many approaches are suggested to embed different modalities jointly. For instance, Frome et al. [48] proposed a joint embedding model that addressed the limited number of categories in available data by joint representation which used images labels and unannotated data to obtain objects. After that, this embedding method is commonly used in CRM. Instead of using joint representation directly in one step, Peng et al. [49] proposed hierarchical combination between modalities’ representations. This aimed to acquire intra- and inter-media information to realize the correction between media forms. Another approach is proposed by Mithun et al. [50], they used web data to develop CMR based on web-supervised learning to reach strong joint embedding, where, Mithun et al. used the hard-negatives loss which was proposed by Faghri et al. [38] for CMR task.

Basically, the loss function has a valuable impact since it shows the error level during the learning, so selecting or designing the loss is essential to reach the desired output. According to that, many attempts are suggested to introduce new loss functions to be used in ITM, where Zhang et al. [51] introduced a loss function that called cross-modal projection matching (CMPM) which aims to boost the correlation between matched pairs. In addition to CMPM, they proposed a classification loss called CMPC to obtain discriminative features. After that, Jian et al. [52] proposed bi-triple loss to decrease the gap between images and sentences through data labels information, where the similarity is computed using Euclidean distance. In this work, the features are extracted by a network with 2 layers. To improve the data quality, Zhen et al. [53] introduced invariance loss that aims to eliminate inconsistency between data, and discrimination loss that preserves the discrimination based on label information among different semantic classes. In addition, Liu et al. [54] proposed K-nearest-neighbor (KNN) loss to avoid overfitting or handle noised data, where KNN loss handled the proposed hard-negatives loss by Faghri et al. [38].Then, Wang et al. [32] proposed a discriminative embedding by representing the different modalities hierarchically, and semantic discrepancy (SD) loss is suggested to deal with multiple semantic levels. The model has the ability to share information among input and output by reverse connection. Recently, Biten et al. [55] modified HN loss by introducing semantic adaptive margin (SAM), where the sentences are used to update margins to find the best similar samples, unlike HN loss which concerned with Ground truth. From another perspective, Chen et al. [56] proposed to integrate intra-modal loss with HN to improve the learning with intra-modal information.

To enhance feature embedding in the rank learning, Wang et al. [57] extended Faghri et al. [38] matching model by incorporating consensus knowledge for scenes and their captions using graph in the embedding, by contrary to Shi et al. [58] incorporated scene information to improve embedding. In addition to that, Liu et al. [59] introduced neighbor-aware loss to increase the distance between different neighbors based on their semantics. Another view to improve CMR results by enhancing to way of computing the similarity, where Wang et al. [35] tried to build matching model to serve two separately tasks image–text retrieval and phrase localization. This is by building an embedding network with neighborhood constrains and a similarity network, but the similarity network failed in image-sentence task. Lately, Yang et al. [60] suggested to compute the similarity using Wasserstein distance, instead of Cosine or Euclidean, where the similarity is measured using distribution of samples. The model extracts the mutual information to improve matching.

Instead of using pairwise ranking, Xu et al. [61] employed list-wise ranking in CMR. This means for a given image, the loss is computed based on all available annotations directly at the same time. This to avoid pairwise drawback, where the number of unrelated annotations is larger than the related ones, sometimes that leads to rank irrelevant annotations before relevant ones. Consequently, it leads to decrease model accuracy. Recently as post-processing step, to enhance the performance of CMR at the testing stage because of the pairwise rank drawback, a set of re-ranking approaches is introduced. For example, Wang et al. [62] proposed a re-ranking approach to improve testing results without more training. Then, Niu et al. [63] also proposed another re-ranking approach. These models start by creating a fusion method for the image and annotation modalities, then using the re-ranking method to get improved outcomes.

3.3 Interaction Learning

In this approach, the information transfer among the modalities before entering the common space as Fig. 3. Lou et al. [64] proposed CMR based on multitask learning by a correlation network to learn the common information and to distinguish the unassociated image–text pairs. In addition, a relation-enhanced autoencoder is used to correlate the hidden embeddings where the interaction is done between modalities. Simultaneously, Wang et al. [65] proposed a massage passing between two modalities, where the silent information from one modality is aggregated then the aggregation result passes to the other modality.

Recently, many methods suggested to solve matching task by adding attention layers to transfer the information among image and text modalities, to obtain the salient regions in image and to achieve better understanding. Lee et al. [18] designed a network with stacked attention layers to obtain full alignment among image and text. This model cared about the silent areas as snow and ground using Bottom-up attention [66] to extract scenes features. Following the same idea, Wang et al. [19] extended the model by Lee et al. [18] through adding an attention layer to detect the objects positions, to enhance matching results. In contrast to Lee et al. [18] and Wang et al. [19], Huang et al. [34] focused on the relation among objects and words, and disregard silent regions. In addition to that, many studies employed interaction learning which are motived by Lee et al. work such as Wu et al. [67], Ji et al. [68], Diao et al. [69], Li et al. [36], and Qi et al. [70]. These studies proposed a two-level network to deal with global and local characteristics and computing the final similarity after obtaining the similarity based on the hybrid alignments. They guided the learning process with important words in the input sentence and objects in the image. Furthermore, Ji et al. [68] presented a method to produce pairwise representations for image–text pairs consistent with their semantics association. In addition, they employed intra-modal interaction between image regions. Contrary to Nam et al. [71] that proposed a matching network with dual attention, where the model concentrated on certain regions and words to obtain the fine-grained between them. Another attention network is designed by Liu et al. [72], it is different from the previous studies since it focused on learning weights, where Liu et al. eliminated unrelated fragments from the shared space. Zhang et al. [73] used confidence to illustrate the degree of the consistency of each region with the global text, to enhance alignment specially for salient regions. Generally, the previous attention models applied the alignment among all words and all regions, this approach consumes time and memory to compute overall similarity. Lately, Zhang et al. [74] to reduce the effect of the previous issue, they proposed to incorporate the relevance threshold between fragments with embedding learning network, to distinguish the relevant and irrelevant fragments to get better alignment.

By contrast, instead of attention layers, the graph is used to alignment image objects and words. Liu et al. [75] built a graph network that treated objects (cat, person, etc.), relations (playing, eating, etc.) and words as a structured phrase. In addition, Li et al. [76] proposed multi-level measurement for the similarity based on a graph network. Recently, Long et al. [77] introduced a dual graph representation for text and image, to bridge the gap between modalities where text sematic used to improve visual representation and visual information used to improve textual representation based on attention mechanism. They used graph convolutional networks (GCN) to apply their approach. In addition, Dong et al. [78] used GCN, where they proposed a hierarchal aggregation for the features based on GCN, where GCN used to extract relation between objects, and then integration between the object features and global feature from the other modality to narrow the modality gap and facilitate the fusion of cross-modal features. The proposed aggregation combined object’s attributes and objects relations hierarchically in both modalities. The idea behind using the graph is to embed the relations between objects and words to enhance learning.

However, this learning approach increases computations and makes learning complicated, since transferring information between modalities is done before projecting the features in latent space.

3.4 Adversarial Learning

The adversarial concept is a new learning approach based on the proposed generative adversarial network (GAN). GAN is a neural network which is composed of a generative network which generates samples that are close to given data depended on the data distribution and a discriminative network which aims to distinguish between real data and the generated examples from the generative network [79] [80], and Fig. 4 illustrates how GNA works.

Fig. 4
figure 4

The adversarial learning

Lately, several trails employ GAN in multi-model retrieval, the first trial by Park et al. [81], where the suggested model focus on category information instead of images and sentences pairs. Consequently, the embedding is done by category prediction and domain classification procedures, where the category predictor aims to characterize the features from the different modalities, and the domain classifier aims to make the extracted features from the images and sentences have the same distribution; this is through a gradient reversal layer (GRL) to create adversarial relation between domain classifier and embedding network. Although, this model depends on category tags, the learning performance may be affected if the predication is not accurate. Following the same idea, Wang et al. [82] and Sarafianos et al. [83], the authors modified the suggested model in [81] by utilizing different images and texts representations methods and different optimization techniques. Gu et al. [84] incorporated GAN in the embedding phase to capture the local and global features. Zhu et al. [85] and Wang et al. [86] used GAN to learn how to match between food recipes and images.

Recently in [9], an integration between adversarial learning and information theory is proposed for cross-model retrieval. This framework employs information theory to reduce semantic gaps between image and text, where information entropy is combined with modality categorization in an adversarial way based on the relation between modality uncertainty and information entropy.

3.5 Cycle-Consistent Learning

The cycle-consistent concept between text and image means that the retrieval model is able to interpret text features to convenient image features and inversely. This interpretation is based on reconstruction constraints to guarantee that the reconstructed text or image is equivalent to the original [87] as in Fig. 5. When it is difficult to gather many images and sentences pairs, the advantage of this approach appears, since the models learn to map directly between different modalities [80].

Fig. 5
figure 5

The cycle-consistency learning

Cornia et al. [87] introduced a cycle-consistent network for image and text matching task, by transforming textual data to an appropriate representation in a visual domain, and visual data to the textual domain, where this mapping is synchronized for both modalities based on a similarity condition. The model applied a hinge-based triplet loss instead of cycle-consistency loss to decrease the model’s complexity. In addition, Liu et al. [88] suggested a more complicated model based on cycle learning for the same task, where the retrieval model employed the cycle embedding for image–text retrieval by incorporating intra-model consistency and inter-model correlation to achieve better translation among different modalities. The loss function of the model is defined as the total sum of three loss functions in both directions which are hard negative, reconstructed and latent loss. Simultaneously, Chen et al. [89] used the semantic consistency to learn modalities embedding spaces jointly, where consistency constraint is incorporated with loss function.

The models in [87] [88] does not utilize local relations between image part and given text. In general, such models are not simple to implement without regularization based on reconstruction constrains to be sure from the translation performance.

3.6 Few-Shot Learning

Few-shot learning (a. k. a, Attributes learning) simulates human thinking and learns the features (attributes) of objects such as shape, color, and texture through few instances. It has a generalization capacity because of its representation learning sequence where attributes information can be shared among known and unknown classes [90] as in Fig. 6. The main advantage of the few-shot learning is reducing the large amount of data that required to train DL models. In addition, it has the ability to handle unseen classes [91].

Fig. 6
figure 6

Few-shot learning

Unlike the statistical CMR proposed by Yuan et al. [92], and inspired from Ji et al. [93] who proposed an image retrieval model based on text or image using attribute learning approach. A number of CMR models is proposed such as Chakraborty et al. [94] utilized a textual attention (self-attention) to enhance the CMR performance in zero-shot cases. The model used a simple recurrent unit (SRU) to represent the sentences, in addition to that, an inter-modality fusion is applied between different modalities to focus on the important areas in the images. Simultaneously, Huang et al. [91] proposed aligned memory for CM to save the rarely happened content using graph convolutional network to control the memory. In addition, to store the sequence order of the sentences, a bi-directional GRU is used.

Comparing the previous models with the statistical model by Yuan et al. [92], DL models achieve better results. However, this learning approach faces many challenges such as inadequate training examples for unseen instance during the training phase, knowledge lack in the classes because of the inconsistent between seen and unseen classes, and the distribution heterogeneity due to the multimodal data.

3.7 Hybrid Learning

There are a set of studies that combine several learning approaches to achieve better result or to solve certain learning approach issues by another one. Where to solve the issue of the absence of the paired data, Wu et al. [95] suggested to combine the cycle approach with adversarial learning. In the proposed model which called CycleGAN, a cycle-consistency loss is utilized to encourage the generated codes from the hash functions with semantic information among inputs and outputs. Using GAN increases the correction between the given inputs and corresponding outputs, where the model can translate each modality to other one. The disadvantage of this model is the complexity. In addition, XU et al. [96] designed an improved structure for GAN based on Wasserstein distribution, and they developed three alignment methods with cycle-consistency constraints instead of using triplet ranking. In addition to that, they took in consideration the zero-shot (ZS) scenarios during testing phase.

Regarding insufficient training data in ZS, ZS integrated with GANs to generate more data such as Xu et al. [90] and Xu et al. [97]. In addition, Lin et al. [98] integrated ZS with cycle learning, they used variational autoencoder to produce latent embeddings instead of GAN to achieve stable training and to enhance retrieval result. Another trial to link GAN with ZS, Xu et al. [99] designed CM with three subnetworks, two for capturing semantic features, and rest one with self-supervised to transfer the knowledge to unknown labels. In addition, Huang et al. [100] who proposed a gated visual semantic embedding for enhancing modalities representations. Firstly, the model learnt two parallel Visual semantic embeddings (VSE), one for uncommon VSE and the other for common VSE to match common and uncommon instances, where there are two metrics for measuring common and uncommon similarities, and then fused by the proposed gated fusion matrix to produce the final representation matrix. But the number of epochs for gated modal is more than normal models. Furthermore, interaction learning is combined with other methods. For instance, GAN performance is enhanced by incorporating GAN with attention, where Wei et al. [101] proposed CM that combined GAN and attention to enhance embeddings in ITM. In addition, Ma et al. [102] proposed to solve DCCA drawbacks using selective attention by incorporating local and global features, and using intra-modality knowledge. However, these attempts achieve enhancements in ITM, but they suffer from complexity specially in loss structure which reflects on learning time and resources.

3.8 Vision-Language Pre-trained Models

All the previous learning methods achieved success in the ITM task by designing several models based on them, the most of these models are focused on the ITM task only, and are trained on ITM datasets. In vision-language tasks such as ITM, Visual question answering (VQA), or image annotation (IA), working on a specific task results poor transferability. The transfer learning approach in DL makes a model which is developed for a specific task (e.g., classification) is reused as a starting point for another model on a second task (e.g., ITM, VQA), where the pre-trained models are commonly used as the starting point (e.g., features representations) on CV and natural NLP tasks. This help the researchers achieving their tasks which require high resources, where integrating the pre-trained models in new models reduce required time and resources. Inspired by the existing pre-trained models that are used to capture the visual features as VGGNet and more as mentioned in Sect. 2.1, where the idea of the pre-training is first found out in the CV field and it shows effective results in [14]. In addition, the success of the pre-trained models in CV field extended to the NLP field after the release of transformer [103], BERT [25], and GPT-3 [104], where the transformer is a DL model that adopts the self-attention (SA) to handle long-range dependencies, since the input segments are differentially weighted [103]. The transformers are the backbone of BERT [25], and GPT-3 [104] which are pre-trained language models. Inspired by the success of the pre-trained models in NLP and CV, several attempts have been made to pre-train large-scale models on vision and language modalities; by training the models on large and general image–text datasets. These models known as vision-language pre-trained models (VL-PTMs). VL-PTMs aim to achieve better performance of the image–text tasks as ITR and VQA by acquiring general representations for the task modalities. VL-PTM is used for different tasks by fine-tuning it for the desired task.

Basically, the VL-PTMs are usually passed through three stages: (1) defining image and text encoders to obtain the inputs representations; (2) defining the interaction schema to link the different modalities by designing the pre-trained model architecture; (3) determining the desired pre-training tasks to train the model on general dataset. In encoding stage, BERT [25] is frequently used as text encoder in VL-PTMs as in ViLBERT [105], Faster R-CNN [16], and (ResNet) [17] are used for image encoding. After that, the text and the image representations are used to create an encoder that integrates both modalities information, to achieve the interaction between image and text which is the second stage in VL-PTM. To align the image parts and the text snips, the designed model needs to integrate the extracted information from the inputs texts and images, then localize the corresponding regions/objects in the images and text pairs. According to the way of integrating the extracted information from the different modalities, the encoders types can be classified as fusion encoder (single stream, dual stream), dual encoder, and sometimes both of them.

After SA or cross-attention procedure, the hidden states of the last attention layer will be utilized as the fused representation of the different modalities. There are two based on the fusion encoder single stream, and dual stream. In single stream, the representations of text and image are concatenated together, and pass to a single transformer (SA) encoder to produce the fused representation. This approach is implemented in VL-BERT [106] which used the segment encoding for the inputs instead of the global image–text pair encoding, and OSCAR [107] which used the detected object in an image as tags to achieve better alignment; where the image–text pair is presented as word-tag-image. However, the single stream method does not take in consideration the intra-modality information; due to that, a dual stream is proposed to solve this issue. In dual stream does not use SA as in single stream, but it adopts a cross-attention layer, where the input vectors from one modality and the key and value vectors from the other. The cross-attention layer contains two sub-layers; one for each modality to exchange the information among the modalities, to allow the intra-interaction for each modality and to separate intra-modal and cross-modal interaction. This schema is employed in many VL-PTMs such as ViLBERT [105], LXMERT [108], and ALBEF [109]. Recently, BLIP [110] is proposed and achieving high performance. However, applying the fusion encoder in ITR requires to encode all available pairs to calculate the similarity scores, and it depends on transformer, due to that the time complexity increases and the inference speed will be a quite slow.

In contrast to the fusion encoder, the dual encoder employs single-modal encoders to encode each modality individually. Then, it adopts an attention layer as [18] or uses dot product as ALIGN [111], and CLIP [112] to project the image and text embedding vectors to the latent space for computing the similarity scores without complex cross-attention layer as in the fusion encoder. Recently, MACK [113] is a re-ranking method to enhance the performance of ALIGN [111] and CLIP [112] based on cycle-consistent loss. This makes the dual encoder more effective in ITR than the fusion encoder, since the embedding vectors for both images and sentences can be pre-computed and stored.

In addition to that, combining the fusion and dual encoders is applied in FLAVA [114] and VLMO [115], since the fusion encoder performs better on Vision-Language understanding, while the dual encoder performs better on ITR, it is natural to combine the benefits of the two types of architectures. Table 3 shows the differences between the VL-PTMs that are used in ITM/ITR task. Once the inputs are encoded as vectors and the interaction is done, the next stage is to design pre-training tasks such as ITM, VQA, and IC for VL-PTMs. These tasks have an influence on what VL-PTM can learn from the input data. In the training phase of VL-PTM, some models take in consideration ZS as BLIP [110] and GLIP [116] to evaluate the model without fine-tuning on the evaluation dataset. In addition, g adversarial data samples are used to enhance pre-training model to overcome the overfitting issue as proposed in [117]. This shows how the VL-PTMs incorporate with the other learning approaches. In general, after pre-training the VL-PTMs are fine-tuned on a specific task based on the evaluation dataset.

Table 3 Comparison between the current VL-PTMs

Finally, to summarize our work, Table 2 illustrates the advantages and disadvantages for each learning approaches. In addition to that, Table 1 summarizes and explores the used techniques in each existing work based on the representation for both text and image, the used loss function in the learning, and the optimization techniques such as Stochastic gradient descent (SGD) and Adam which are common to use in ITM. Figure 7 shows a taxonomy for the literature works based on the learning approaches and alignment methods.

Fig. 7
figure 7

ITM approaches taxonomy, where the blue color refers to using global alignment, green color for local alignment, and orang for hybrid alignment in each learning category. The VL-PTMs have different architectures but all of them apply interaction between the input data

4 Datasets and Evaluation Methods

4.1 Evaluation

There are two popular evaluation methods for CMR. First, Recall@K (R@K) score which computes the portion of times where correct item (Text or image) being existed in the top K outcomes, k takes values 1, 5, and 10 as usual in retrieval task. Second, Mean Average Precision (MAP) score for all returned results on all datasets for both image–text retrievals. MAP value is the mean of average precision (AP) that is computed for all retrieval queries. In image–text retrieval, R@K is popular to use than MAP.

4.2 Datasets

There are many benchmark datasets for CMR domain summarized in Table 4, particularly for image–text annotation and search tasks such as

  • Wiki [124]: it consists of 2866 images representing 10 categories where 2,173 image for training pairs and 693 for testing. The data are selected from Wikipedia articles where each text represents an article about events, places, or people and the related image clarifies the article content as shown in Fig. 8.

  • Pascal sentences [125]: it is also called PASCAL1K, it contains 1 k images and all images have five captions using Amazon’s Mechanical Turk (AMT) as shown in Fig. 9. Randomly, the images are taken from the PASCAL VOC 2008 [126] which contains 20 classes.

  • Flickr8K [127] and Flickr30K [128]: where Flickr30K is an extension for Flickr8K. Flickr8K contains nearly 8 K images and every image has 5 clear annotations for the image content, the images are selected manually from 6 dissimilar Flickr groups. Then, Flickr30K is released which contains nearly 31 K images and 155 K sentences where each image has 5 corresponding descriptions

  • MSCOCO [129]: MSCOCO (2014) version contains approximately 164 K images, where approximately 83 k for training and 41 K validation and the rest for testing, each image has 5 corresponding descriptions. Another version is released MSCOCO (2017) where approximately 118 k for training and 5 K for validation and the rest for testing.

Table 4 Datasets for image–text retrieval
Fig. 8
figure 8

Samples from Wiki dataset

Fig. 9
figure 9

Samples from Pascal sentences dataset

Table 5 shows the used evaluation approach and datasets for the literatures. In addition, Tables 6 and 7 report publicly recorded performances on the two common datasets Flickr30K and MSCOCO using R@K evaluation method.

Table 5 The used datasets and evaluation methods in the existing works
Table 6 Some examples for the retrieval results for Flickr30K dataset 
Table 7 Some examples for the retrieval results for MSCOCO dataset 

5 Discussion and New Directions

5.1 Discussion

In this survey, we used 66 previous studies that are related to ITM that are trained on the ITM datasets as Table 4. For fair comparison and discussion, we exclude 4 previous studies which are [55], [63], [85], and [86], where Biten et al. [55] evaluated their approach on the proposed models in [1] [57] [69]. The same with Niu et al. [63], where they evaluated their re-ranking method on the proposed models in [38] [48] [100]. Zhu et al. [85] and Wang et al. [86] used data set related to food recipes. In addition, we exclude the VL_PTM that are used in the ITM task from the statistics computation, since they follow different architecture, but we will compare their results with the other learning approaches.

In Tables 6 and 7, we show some recent results in Flickr30K and MSCOCO for all learning approaches to give a sight about the performance, since these datasets the most common to use as in Fig. 10(c). From Tables 6 and 7, we can obtain that the quality of the bi-directional retrieval is not the same from I2T and T2I, where some ITM models perform well in I2T and vice versa; this depends on the features representation in common space for each modality. In addition, the performance varies from dataset to another, and this depends on data variation.

Diving deeply in the different structures for ITM methods as in Table 1, Fig. 10 illustrates statistics information that help to understand the alignment and learning approaches distribution in the previous studies. Since, the percentage of studies that used global approach is 55%, 22% for local, and 23% for hybrid from Fig. 10(b). In addition, Fig. 10(a) shows that the ranking method has the highest percentage 35% then interaction learning in second place. To understand the relation between learning and alignment approaches, which influence on model structures, Fig. 7 illustrates the relation among them.

Fig. 10
figure 10

Statistical information based on Table 1 and Table 5, (a) percentage of using each learning approach, (b) percentage of using each alignment approach, (c) frequency of use each dataset, and (d, e) frequency of use each text/image encoder   

According to Table 1, we summarize the used embeddings methods for images and text as in Fig. 10(d, e); here, we count how many a certain method is used in the previous work, by taking into consideration that the embedding may be done by more than one embedding method. In addition, Fig. 10(c) shows the frequency of datasets usage where Flickr30K has the largest value then MSCOCO. Building ITM not only needs to select appropriate modality representation and Learning approach, but also it needs to know the available resources as memory and saving space. Since extracting features and learning process consume resources. From Table 1, we can elicit the similarity in the ITM models constructions as [18, 19, 65, 67, 69] and [72,73,74,75] in interaction learning. We can obtain that VGG Net is used mostly in adversarial and hybrid learning. In opposite, R-CNN and Faster R-CNN is used in interaction learning. In addition to that, the dataset may require a specific way to obtain better results as in Wiki which contains long text because of that most studies used Doc2Vec to embed text. The previous studies cab be divided in two groups, one used Flickr and MSCOCO (with R@K for evaluation) and the other used Wiki and Pascal(with MAP for evaluation); this is due to the available resources. Furthermore, the loss is an essential factor in learning, and we can find that triplet loss is common to used specially HN.

By comparing the VL-PTMs in ITR with the other learning approaches, it seems to be clear that the performance of the VL-PTMs outperforms the others as shown in Tables 6 and 7. This is because the VL-PTMs are trained on general and large-scale data millions image–text pairs. In addition, the VL-PTMs are evaluated on small data partition compared to the training data. But, they need huge resources compared to the other learning approaches. In addition, the performance of any VL-PTM can be changed based on the amount of the pre-training data.  

5.2 New Directions

Despite some encouraging achievements have been accomplished in the area of the bi-directional image–text matching, there is still a set of open issues that requires more investigation. In this subsection, we will highlight the prospective research chances which are

  • Similarity or correlation measurement: searching for matched texts and image is a hard task, since image and words have different heterogeneous spaces. Therefore, it is difficult to measure the similarity before passing the features to a common space.

  • Datasets: the available datasets face some issues such as data size, or uncategorized data. For instance, Pascal sentence and wiki datasets have a small size comparing to the rest. It is a challenge to make a decision automatically using shortage data. From another perspective, existing of real data with a small size, this can be used as a pilot dataset which is not available resource for matching task, instead of staring with large datasets to save time and resources. Unlike Wiki and Pascal sentence, Flickr30K has Uncategorized data, so you cannot obtain the data details such balanced or not. Consequently, it is preferable to design a new general Datasets or modify the existing for upcoming research. In addition, the images in the available datasets do not include textual data mostly, which may help the learning process to achieve robust performance. As a first step towards that, Malfa et al. [130] established a dataset that contains images with scene-text examples to be used CMR. Now, the easy way to create a dataset is the social media, but keeping in mind the formal and informal text descriptions.

  • Evaluation: due to the standard datasets structure where one image has 5 annotations only, and sentences are labeled as relevant or not with ignoring the degree of relevance. This makes the existing evaluations methods in ITM need enhancements to represent a deep semantic interpretation between modalities. This motived Biten et al. [55] to propose new evaluation matrices for ITM inspired by the evaluation matrices that used in image caption task.

  • Features Representation: the embedding approaches play a vital role in the retrieval task. Therefore, it is important to select a suitable approach to represent words and images to reach the desired results. Now, some studies apply multi-step features extraction to improve modality’s understanding and to create powerful language models to understand images strongly. In addition to that, the most of studies ignore intra-modal information which give useful information beside inter-modal information during the learning. Another issue using the same encoders for dissimilar datasets may be not effective. For example, Jian et al. [52] used various encoder for the used datasets to achieve better result.

  • Features Dimensions: in multimodal retrieval, the data dimensions are high, so it is promising to apply compressions techniques to compact modal’s representation to decrease hardware requirements such as space; memory; GPUs, and also to reduce training time. For instance, Serra et al. [131] proposed a compact approach for embedding deals with one-hot data. Another way is network compression which is needed to decrease the size of the created models as in [132].

  • Loss Functions: it is important to minimize the output error, some of them have issues. For instance, in triplet loss, it is hard to find sample suitable triplets and selecting proper margins. In addition, the complex structure of loss function specially in hybrid models, which consumes more time to be computed.

  • Statistical Information: Incorporating statistical information with features which extracted using DL may achieve more understanding such as [60] [57].

6 Conclusion

This paper presents an overview of image–text matching task using deep learning, summarizes the current learning methods and categorizes them into DCCA, ranking, interaction, adversarial, cycle consistent, few-shot, hybrid, and VL-PTMs. In addition to that, we classify the existing works based on alignment methods into global, local, and hybrid (global–local). For more illustration, we summarize the used techniques in the existing architectures to show the main differences between them. Then, we present commonly used benchmark datasets and empirically assess the performance of some existing works. Finally, we discuss the challenges and the future trends in image–text matching. Although remarkable studies have been achieved in matching task, it still needs more work to achieve performance that can mimic human behavior. We look forward to help junior researchers to understand the state-of-the-art in image–text matching and motivate them to more significant works.