Introduction

Given an image, humans can immediately comprehend the visual content and describe the image with natural language, while it is quite challenging for computers to do the same. Image captioning [1] aims to equip the machine with the ability to provide sensible sentence descriptions for the input image automatically. Normally, generating acceptable captions relies largely on the semantic understanding of the image, which poses challenges to computer vision and natural language processing. Enlightened by neural machine translation [5], the encoder–decoder frameworks have been commonly employed to deal with the task over the last few years [6,7,8,9]. Typically, CNN is employed as the encoder to encode the image as fixed-length visual features. RNN is utilized as the decoder to produce valid and descriptive captions by decoding visual features. To achieve a more fine-grained understanding of the image, the attention mechanism [7, 10, 11] is used to focus on more relevant image regions when predicting words. The performance of image captioning has been greatly boosted by introducing the attention mechanism into the encoder–decoder architecture. Recently, Transformer [28] has set off a revolution in the artificial intelligence domain, and since then, image captioning models based on Transformer have been leading the advanced performance.

Generally, visual features play a crucial part in the semantic understanding of the image, and great efforts have been made to learn effective visual representations [2, 4]. For example, the local region-based and global grid-based convolutional features are broadly used in the existing image captioning models. Specifically, the region-based visual feature is usually extracted with a pre-trained object detector, such as Faster RCNN [12], which captures rich information about objects and other important image regions. However, the object detector is so bulky that it takes much time to extract image features; therefore, the cached region-based visual representations are more popular in current image captioning models. In terms of the global grid-based visual feature, it treats all elements of an image equally by directly dividing the image into uniform grids, which makes it hard for the decoder to generate high-quality textual sentences. However, a series of works [13, 14, 61] have shown that the grid-based visual features can also perform well on VQA, when they are obtained from the same layer of the region-based object detector. To improve the image captioning performance based on flattened grid visual representations, RSTNet [14] absorbed spatial information into the grid-based feature by introducing a grid-augmented module. RSTNet is sensitive to the geometry relationships between grid positions, but the augmented grid features ignore the information on the channel dimension, hence, limiting the model’s ability to fully understand visual content. Additionally, the offline grid-based visual features are usually not aligned with the linguistic features, making it difficult to directly learn the interaction between them.

In this paper, to handle the aforementioned issues, the learnable feature queries and the feature interaction module are designed to capture expressive information from the grid-based visual representation and improve the performance of the Transformer-based image captioning model. Feature queries are a group of learnable tokens, and they are exploited to query useful visual clues from the global grid representations. The feature queries serve as the local signals that interact with the overall image context to obtain the discriminative visual features, thus facilitating the alignment of visual features with textual features during decoding to enhance the quality of the generated descriptions. Then, a feature interaction module is introduced to enable interaction between the feature queries and the grid features to output refined visual representations. The feature interaction module is implemented by the cross-attention mechanism and linear mapping networks. Finally, adaptive attention is leveraged to investigate how visual and linguistic signals influence the generation of accurate descriptions. With our proposed feature queries and feature interaction module, more expressive and relevant visual information can be mined to generate high-quality captions for the given image. To evaluate the effectiveness of the proposed approach FeiM for image captioning, we carry out extensive experiments on the MSCOCO benchmark, where FeiM achieves superior performance.

Overall, the main contributions of the paper are threefold:

  • We design feature queries, which work as the local signals to query powerful visual details from the context of grid representations, by introducing a small amount of learnable continuous vectors. Benefiting from these flexible tokens, discriminative global visual information can be easily aggregated for better image captioning.

  • We design feature queries, which work as the local signals to query powerful visual details from the context of grid representations, by introducing a small amount of learnable continuous vectors. Benefiting from these flexible tokens, discriminative global visual information can be easily aggregated for better image captioning.

  • We propose a feature interaction module to capture the interdependencies on the spatial and the channel dimensions, respectively, by cross-attending the feature queries and the augmented grid visual features. It can provide refined and expressive visual features for the subsequent Transformer-based multi-modal fusion module to produce multi-modal signals.

  • We build a complete image captioning framework, namely FeiM (which is short for feature queries and feature interaction module), by applying the two designs in a Transformer-based captioning architecture. Extensive experiments on the MSCOCO benchmark demonstrate that FeiM enables remarkable image captioning capability with grid visual representations.

Related works

Image captioning

There have been two main phases in the development of image captioning [2,3,4]. To begin with, traditional captioning models [15,16,17,18,19,20] are based on retrieval and template, attempting to automatically generate simple descriptions of certain scenes for the image. In particular, the retrieval-based methods [15,16,17] generate descriptions by retrieving the most relevant sentences to the image from the predefined image-caption pool. The models [18,19,20] based on the template fill the detected words from the visual concepts into slots of the predesigned template. Although these traditional methods enable simple caption generation, they rely on hand-crafted visual features, resulting in the poor linguistic, syntactical, and semantic performance of the generated sentences.

Recent years have witnessed the overwhelming evolution of the deep neural network, and the methods [3, 4, 6,7,8,9] based on deep learning have been universally employed for image captioning. Currently, many image captioning methods adopt the encoder–decoder structure, where CNN is for feature extraction and RNN is for caption generation. For example, Neural Image Captioning (NIC) [4] treated the image captioning task as a translation problem from vision to text, and it was the first image captioning model that exploited the encoder-decoder paradigm. After that, the attention mechanisms [7, 21, 22], the training strategy with reinforcement learning [23,24,25], and the large-scale pre-trained vision-language models [26, 27] enrich the captioning methods based on deep learning.

Nowadays, illuminated by the successful application of self-attention in natural language processing, Transformer [28] has made great breakthroughs in the field of computer vision. Therefore, the CNN–RNN architecture is gradually replaced by the Transformer-based captioning models [14, 29,30,31,32, 55]. Transformer is also an encoder–decoder framework and its encoder has the same structure as its decoder, including residual connection, self-attention layer, layernorm, and feed-forward network. Compared with previous CNN-RNN-based methods, the self-attention mechanism enables Transformer-based models to model long-range dependencies of the sequence and ensure effective contextual information interaction. For example, Cornia et al. [30] introduced the Meshed-Memory Transformer (\(M^2\) Transformer) for better image captioning with a fully attentive network. Li et al. [29] extended the attention mechanism that connected the encoder and decoder of the Transformer to maximize the use of visual and semantic information. And Zhang et al. [14] utilized the grid-augmented visual features to incorporate more spatial information via a relationship-sensitive Transformer-based (RSTNet) image captioning model.

Convolutional representation learning

The convolutional visual features currently used in the image captioning model generally fall into two groups. One is the local region feature, and the other is the global grid feature, as shown in Fig. 1. For example, Karpathy et al. [2] observed that the descriptions for the image usually contained rich objects and attributes; thus, they generated captions with information from image regions. Anderson et al. [33] obtained a set of salient region features via Faster RCNN. Furthermore, Datta et al. [34] used the Faster RCNN pre-trained on VisualGenome to extract Regions-of-Interest (RoIs) for the image. The most significant advantage of region features is that they contain fine-grained information from the objects and are beneficial for the comprehensive understanding of image content. However, the region feature is typically captured through the pre-trained object detector. The detection process is time-consuming and the detector cannot be trained end-to-end in the captioning model.

Fig. 1
figure 1

The region-based features (left) and grid-based features (right)

The global grid features divide the entire content of an image in a uniform fashion. In the NIC [4], GoogleNet [35] was employed to extract visual representations with fixed length. Instead of using the image feature output from the last fully connected layer, Xu et al. [7] focused on the output from a lower convolutional layer. To generate more reliable language descriptions, Wu et al. [36] not only captured the global grid representation with pre-trained CNN but also learned explicit high-level concepts from the image. In contrast to the region feature, the grid feature is easier to obtain, but it ignores the distinctive information that is important for image semantic understanding and text description generation. Recently, several works [13, 14, 61] have revealed that grid visual features obtained from the same layer of the pre-trained object detector were as competitive as region features. RSTNet [14] studied the augmented grid features by introducing spatial information. However, RSTNet fails to capture meaningful visual content from both spatial and channel dimensions and restricts the performance improvement of the captioning model. In this paper, based on the Transformer architecture, we introduce the feature queries and the feature interaction module to take advantage of the expressive features embedded in the grid-based visual representations for better image captioning.

Methods

We aim to automatically generate textual descriptions for the given image with a Transformer-based architecture. Previous Transformer-based image captioning models [14, 29,30,31,32] attach more importance to boosting the performance from the perspective of the encoder, decoder, or attention mechanism. Instead, we concentrate on building a Transformer-based captioning model by maximizing the use of the grid visual information. Figure 2 depicts the overall framework of our image captioning model FeiM, and it can be divided into the semantic understanding and description generation parts conceptually. For semantic understanding, FeiM begins by extracting the grid-based image features and text features; similar to RSTNet [14], the image features are then augmented with spatial information in FeiM. We introduce a small amount of learnable continuous tokens as feature queries to capture expressive visual details from the augmented overall grid visual features. Then, we take the input of both feature queries and spatial-augmented grid features for feature interaction. Next, the features of the word sequence and the output of the feature interaction module go through the Transformer-based multi-modal fusion module, where the \(M^2\) Transformer [30] is exploited. The Transformer encoder is in charge of the multi-level relationships encoding of the feature interaction module output, while the decoder cross-attends the word sequence features and the latent representation of all Transformer encoder layers. For description generation, adaptive attention [14, 37] is used to generate words conditioned on vision, language, and vision-language signals.

Essentially, FeiM is a Transformer-based captioning framework to generate descriptive sentences for given images. In this section, we first formulate the image captioning task and introduce how visual and textual features are obtained. Then, we give the details of the designed feature queries and feature interaction module, respectively. Finally, we briefly introduce the Transformer-based multi-modal fusion and captioning module.

Fig. 2
figure 2

The overall framework of the proposed FeiM. FeiM starts with extracting grid-based image features and text features, and the image features are augmented with spatial information. We design feature queries to capture useful knowledge from the augmented global grid visual representations. Then, both feature queries and spatial-augmented grid representations are fed into the feature interaction module, to refine expressive visual features from the interaction of spatial and channel dimensions. Next, the features of the word sequence and the output of the feature interaction module go through the Transformer-based multi-modal fusion module. Finally, adaptive attention is used to generate words conditioned on multi-modal, visual, and textual signals

Visual and textual features

The standard image captioning task can be stated as: Given a dataset D with image–text pairs \(\{I^i, y^i\}\) and a test image I, the goal is to generate a fluent and plausible description \(y = \{y_1^I,...,y_L^I\}\) corresponding to I, where \(y_i^I\) represents the \(i-\)th word in the caption sequence y generated for image I.

During training, the original grid features of images are obtained following the operation in [13]. Recent works usually flatten the grid features and then encode them directly with a Transformer encoder, which fails to capture the relationship between the image grids. Therefore, to take advantage of the more readily available grid features without losing position information, the augmented grid image feature [14] is exploited as the visual representation in our work.

Specifically, the image encoder outputs the original grid feature \(F_o\in {{\mathbb {R}}^{(h\times {w})\times {2048}}}\) with grid size \(h\times {w}\). To adapt the original gird feature to our method, \(F_o\) is projected to \(F_d\in {{\mathbb {R}}^{(h\times {w})\times {d}}}\) \((d = 512)\) via a linear network with a ReLU activation and dropout layer, which can be written as

$$\begin{aligned} f_{d(i)} = max\left( 0, W_{o(i)}\cdot B(p) \cdot f_{o(i)} + b_{(i)}\right) , \end{aligned}$$
(1)

where \(f_{d(i)}\in F_d\), \(f_{o(i)}\in F_o\) are the ith elements of \(F_d\) and \(F_o\), respectively. p is the dropout rate, and B(p) denotes a binomial probability distribution function with a probability of p. \(W_o\in {\mathbb {R}}^{2048\times d}\) is the weights matrix, and b is the bias term.

Following [14, 56], each grid can be specified with a triple \(\left( (x_i, y_i), w_i, h_i\right) \), where \((x_m, y_m)\) is the center coordinates of the grid m. Additionally, w and h denote the width and height of the grid m. And the relational vector \(R(m,n)\in {\mathbb {R}}^{(h\times w)\times (h\times w)\times 4}\) between grid m and n are determined with

$$\begin{aligned} \begin{aligned} \Delta x&= |x_m-x_n| \\ \Delta y&= |y_m-y_n| \\ R(m,n)&=\left( \!\log \left( \!\frac{\Delta x}{w_m}\!\right) ,\log \left( \!\frac{\Delta y}{h_m}\!\right) , \log \left( \!\frac{w_m}{w_n}\!\right) ,\log \left( \!\frac{h_m}{h_n}\!\right) \!\right) . \end{aligned} \nonumber \\ \end{aligned}$$
(2)

Then, a 4-dimensional relational vector R(mn) is casted to \(R_h(m,n)\in {\mathbb {R}}^{(h\times w)\times (h\times w)\times d_h}\) with a high dimension \(d_h\) through a linear projection layer with learnable matrix \(W_h^T\) and bias \(b_h\) as

$$\begin{aligned} R_h(m,n) = W_h^T \cdot R(m,n)+b_h. \end{aligned}$$
(3)

and we compute the relational features \(R_f(m,n)\in {\mathbb {R}}^{(h\times w)\times (h\times w)}\) between grid m and n with the weights matrix \(W_g^T\) and the ReLU activation function by

$$\begin{aligned} R_f(m,n) = max\left( 0, W_g^T \cdot R_h(m,n) \right) . \end{aligned}$$
(4)

To capture the message of the relative position between the grids, the relational features \(R_f\) are delivered to the attention mechanism as a bias term to figure out the relational attention weights \(A_R\), which can be defined as

$$\begin{aligned}{} & {} A_R = softmax\left( \frac{QK^T}{\sqrt{d_k}}+ R_f\right) \cdot V \end{aligned}$$
(5)
$$\begin{aligned}{} & {} softmax(x) = \frac{e^{x_i}}{\sum _i{e^{x_i}}} \end{aligned}$$
(6)
$$\begin{aligned}{} & {} Q = F_d \cdot W_Q, K = F_d \cdot W_K, V = F_d \cdot W_V, \end{aligned}$$
(7)

where Q, K, and V are relational query, key, and value matrices obtained by projecting the processed grid feature \(F_d\). \(W_Q\), \(W_K\), and \(W_V\) are learnable weight parameters, while \(d_k\) is a scaling factor. Then, the relational attention weights \(A_R\) of the image grids are involved in \(F_d\) to obtain the augmented grid feature \(F_f\)

$$\begin{aligned} F_f = F_d + A_R. \end{aligned}$$
(8)

For the text, there are two ways to characterize the word sequence depending on the subsequent modules. In particular, as for the input of the Transformer-based multi-modal fusion module, a linear embedding layer combined with the positional embedding is utilized to turn the word sequence into a d-dimensional word embedding \(S_e\in {\mathbb {R}}^{L\times d}\). As for the input of the captioning module, we leverage the text encoder based on the BERT language model [26] for linguistic feature extraction. Figure 2 shows the structure of the text encoder in our captioning model, where the pre-trained BERT BERT(.) is employed to extract the textural features, and the masked multi-head attention MMHA(.) is for preventing the information leakage of subsequent words. To train the text encoder, an additional feed-forward network is added at the end of the masked multi-head attention module for sequence prediction.

During training, for a given word sequence \(S = (<bos>, s_1,s_2,...,s_L)\) of text T, the text encoder is fine-tuned with cross-entropy loss to predict the target sequence \(S^* = (s^*_1,s^*_2,...,s^*_L, <eos>)\). This operation can be expressed by

$$\begin{aligned}{} & {} L_f = BERT(S) \end{aligned}$$
(9)
$$\begin{aligned}{} & {} T_f = MMHA\left( (W_{s1}^T \cdot L_f + b_{s1}) + P_s\right) \end{aligned}$$
(10)
$$\begin{aligned}{} & {} S^* = \log \left( softmax(W_{s2}^T \cdot T_f + b_{s2})\right) , \end{aligned}$$
(11)

where \(L_f\) is the language feature output by the pre-trained BERT and \(P_s\) is the positional embedding of the word sequence S. \(W_{s1}\) and \(W_{s2}\) are learnable weights matrices, and \(b_{s1}\) and \(b_{s2}\) are bias terms. In this paper, the output \(T_f\) of the tuned text encoder is used as the linguistic feature to represent the text T.

Feature queries

Fig. 3
figure 3

The structure of the feature interaction module. We first align feature queries c and augmented grid features \(F_f\) through a linear layer. Then, to capture the interdependencies of the inputs, we cross-attend c and \(F_f\) in both the spatial and channel domains to obtain the position and scale factor for information interaction. Finally, feature queries c is projected into the parameter space to incorporate joint embedding \(A_{FeiM}\)

Although the augmented grid features enable global representations to carry spatial information, they lose most of the local details. In this paper, we investigate employing a small number of learnable vectors as feature queries, and it works as the local visual signals to compensate for the failure of augmented grid features in capturing relevant visual details from both spatial and channel domains.

Feature query is derived from the idea of the prompt that is used for inducing the pre-trained language models to produce the next word tokens [39]. Earlier works [40, 41] attempt to develop hand-crafted prompts, which is quite time-consuming and requires knowledge of specific fields. In comparison with the hand-crafted prompt, the automatic prompt is more flexible and efficient to the model. For example, Jiang et al. [41] employed the mining and paraphrasing-based method to generate high-quality and diverse candidate prompts, and the one with the highest accuracy on the training dataset was selected as the final prompt. Shin et al. [42] developed an automated prompt-generating method based on a gradient search, which was friendly to multi-tasks. Taking inspiration from CoOp [43], we resort to learnable tokens as feature queries to query visual clues that are beneficial to the task from global grid features, as expressed in Eq. (12)

$$\begin{aligned} c=\{c_1, c_2,...,c_n\}\in {\mathbb {R}}^{n\times d}, \end{aligned}$$
(12)

where \(c_i\in {\mathbb {R}}^d\), and n is the number of tokens in the feature queries to be learned. Note that there are three groups of feature queries in FeiM that share the same parameters, as Fig. 3 shows. Two groups associated with augmented grid features pass through the feature interaction module to calculate the cross-attention maps. The other group is transformed to the parameter space via a linear function in the backend of the feature interaction module.

Feature interaction module

We introduce a feature interaction module to insert feature queries into the Transformer-based captioning architecture, which cross-attends the feature queries and the augmented grid visual features to learn the interconnections on the spatial and channel dimensions. Therefore, the refined visual features can be obtained for the subsequent Transformer-based multi-modal fusion module to output the multi-modal signals.

Inspired by [44], to achieve feature interaction between the feature queries and augmented grid visual representations, we present a feature interaction module consisting of the cross-attention mechanism and linear mapping networks, as Fig. 3 illustrates. In the feature interaction module, to enable the model to process feature queries with different semantic meanings, we align the feature queries c and augmented grid features \(F_f\) through a linear layer. Then, we cross-attend c and \(F_f\) in both the spatial and channel domains to obtain the position and scale factor for information interaction.

Recall that the self-attention and cross-attention mechanism in the Transformer block involve Q, K, and V embedding matrices. Q and K are employed to explore the interaction between tokens determining the probability distribution over V. According to the cosine similarity between queries and the keys, the weighted average of the value vectors is calculated with the attention mechanism.

In the spatial domain, the feature queries are regarded as the query \(Q_{sp}\in {\mathbb {R}}^{n\times d}\), while the key \(K\in {\mathbb {R}}^{(h\times w)\times d}\) and value \(V\in {\mathbb {R}}^{(h\times w)\times d}\) stem from augmented grid features \(F_f\)

$$\begin{aligned} Q_{sp} = c \cdot W_{sq}, K = F_f \cdot W_{sk}, V = F_f \cdot W_{sv}. \end{aligned}$$
(13)

And then, the cross-attention map \(Att_{sp}\in {\mathbb {R}}^{n\times (h\times w)}\) in the spatial domain is calculated with the softmax operation as

$$\begin{aligned} Att_{sp} = softmax\left( Q_{sp}\cdot K^T\right) . \end{aligned}$$
(14)

The scale vector \(S\in {\mathbb {R}}^{n\times 1}\) is obtained by a linear layer with the learnable weights \(W_{sp}\in {\mathbb {R}}^{(h\times w)\times 1}\)

$$\begin{aligned} S = Att_{sp} \cdot W_{sp}, \end{aligned}$$
(15)

where \(W_{sp}\in {\mathbb {R}}^{(h\times w)\times 1}\), \(W_{sq}\), \(W_{sk}\), and \(W_{sv}\in {\mathbb {R}}^{d\times d}\). Then, \(M_{FeiM}\) embedded with scale knowledge can be calculated via multiplication between the scale vector S and value V by

$$\begin{aligned} M_{FeiM} = S\cdot V. \end{aligned}$$
(16)

In the channel domain, a new query \(Q_{ch}\in {\mathbb {R}}^{(h\times w)\times d}\) corresponding to the feature queries is calculated using a linear function with parameter \(W_{cq}\), and we keep K and V the same as the ones in the spatial domain. Then, we cross-attend the feature queries c and augmented grid features \(F_f\) to calculate the channel-wise cross-attention map \(Att_{ch} \in {\mathbb {R}}^{d\times d}\)

$$\begin{aligned}{} & {} Att_{ch} = softmax\left( Q_{ch}\cdot K^T\right) \end{aligned}$$
(17)
$$\begin{aligned}{} & {} Q_{ch} = c^T \cdot W_{cq}, K = F_f \cdot W_{sk}, V = F_f \cdot W_{sv}. \end{aligned}$$
(18)

The updated value \(V'\) is obtained by multiplying \(Att_{ch}\) and V

$$\begin{aligned} V' = Att_{ch} \cdot V. \end{aligned}$$
(19)

Then, an adaptive average pooling AP(.) is applied to transform \(V'\) into a position vector \(P_{FeiM}\in {\mathbb {R}}^{1\times d}\), which can be formulated as

$$\begin{aligned} P_{FeiM} = AP(V'). \end{aligned}$$
(20)

The joint embedding \(A_{FeiM}\) incorporated with position and scale knowledge captured from the local query signals and augmented grid representations is obtained through the following calculation:

$$\begin{aligned} A_{FeiM} = P_{FeiM} + M_{FeiM}. \end{aligned}$$
(21)

Then feature queries c are projected into the parameter space to incorporate with \(A_{FeiM}\) via learnable factors \(\alpha \in {\mathbb {R}}^{1\times d}\) and \(\beta \in {\mathbb {R}}^{1\times d}\) in the backend of the feature interaction module, with \( O_{FeiM}\) as the final output of the proposed feature interaction module. This process can be defined as

$$\begin{aligned}{} & {} O_{FeiM} = \alpha + (1+\beta ) \cdot LN(A_{FeiM}) \end{aligned}$$
(22)
$$\begin{aligned}{} & {} \alpha = W_{\alpha }^T \cdot c + b_{\alpha }, \beta = W_{\beta }^T \cdot c + b_{\beta } \end{aligned}$$
(23)
$$\begin{aligned}{} & {} LN(A_{FeiM}) = \frac{ A_{FeiM} - E[A_{FeiM}]}{\sqrt{Var[A_{FeiM}] + \epsilon }} \cdot \gamma + \lambda , \end{aligned}$$
(24)

where \(LN(A_{FeiM})\) represents the layer normalization operation for \(A_{FeiM}\), and \(E[A_{FeiM}]\) and \(Var[A_{FeiM}]\) are the mean and standard deviation of \(A_{FeiM}\), respectively. \(\epsilon \) is a small constant, \(\gamma \) and \(\lambda \) are learnable affine transform parameters. And the incorporation parameters \(\alpha \) and \(\beta \) for feature queries can be learned with linear functions. \( O_{FeiM}\) is the output of the feature interaction module. In practice, we stack N feature interaction modules for FeiM, and the output of the last module is used to reduce the modality gap by propagating the refined and expressive features from the feature queries and augmented grid visual representation in the Transformer-based multi-modal fusion module.

This cross-attention-based feature interaction method models the interactions between augmented grid features (\(F_f\)) and feature queries (c) to obtain the refined and expressive visual representation (\(A_{FeiM}\)), which absorbs the information from both the global visual features and the feature queries with local properties, thus enabling the captioning model to generate accurate and high-quality captions. Moreover, the feature interaction module is plug-and-play and can be easily and flexibly used in other captioning models.

Transformer-based multi-modal fusion module

Transformer performs well in incorporating the knowledge from different modalities, because it can capture long-range dependencies in the data. Consequently, referring to \(M^2\) Transformer [30], we apply a Transformer-based module for multi-modal fusion. Typically, the module is an encoder–decoder framework with stacked encoder and decoder layers.

The encoder takes the output \(O_{FeiM}\) of the feature interaction module as its input to learn the high-level relationship between visual features via self-attention. The decoder integrates both the output of the last encoder layer and the text embedding with position encoding via cross-attention to fuse the information from visual and textual modality, and to improve the understanding of the vision and language. In our FeiM, we take the output \(M_f\) of the Transformer decoder \(TMFM_{dec}(.)\) in the multi-modal fusion module, and the output \(V_f\) of the last Transformer encoding layer \(TMFM_{enc}(.)\) in the multi-modal fusion module as the multi-modal and visual signals, respectively

$$\begin{aligned}{} & {} V_f = TMFM_{enc}\left( O_{FeiM}\right) \end{aligned}$$
(25)
$$\begin{aligned}{} & {} M_f = TMFM_{dec}\left( S_e+P_s+V_f\right) . \end{aligned}$$
(26)

Captioning module

Typically, the hidden state of the current word can be generated by incorporating visual information with the guidance of linguistic representations. However, the contribution of the vision and language signals to the currently generated token is confusing, leading to the inconsistency between the generated word and the signal that should be given more attention. Therefore, we apply adaptive attention to the captioning module to address the problem by simultaneously attending to the visual and linguistic signals. The captioning module is used to transform the visual \(V_f\), linguistic \(T_f\), and multi-modal \(M_f\) signals to adaptive attention score \(S_{aam}\) for word generation with the adaptive attention mechanism, which can be expressed by

$$\begin{aligned}{} & {} Q_{aam}^{(i,t)} = M_f \cdot W_{aq}^i,K_{aam}^{(i,t)} = C_f \cdot W_{ak}^i, V_{aam}^{(i,t)} = C_f \cdot W_{av}^i \nonumber \\ \end{aligned}$$
(27)
$$\begin{aligned}{} & {} C_f = \left[ V_f;T_f^t\right] , T_f = \left[ T_f^1, T_f^2,...,T_f^m\right] \end{aligned}$$
(28)
$$\begin{aligned}{} & {} S_{aam} = \left[ Aam^1;...,;Aam^h\right] \cdot W_{aam}. \end{aligned}$$
(29)

The attention weights \(Aam^i\) of the word sequence for the ith head is written as

$$\begin{aligned} Aam^i = \left[ Aam^{(i,1)};...,;Aam^{(i,m)}\right] . \end{aligned}$$
(30)

\(Aam^{(i,t)}\) is the attention weights of the tth time step for the ith attention head

$$\begin{aligned} Aam^{(i,t)} = softmax\left[ Q_{aam}^{(i.t)} \cdot \left( K_{aam}^{(i,t)} \right) ^T \right] \cdot V_{aam}^{(i,t)}, \end{aligned}$$
(31)

where \(Q_{aam}^{(i, t)}\) is the query of the ith head in the multi-head attention for the tth time step, obtained from the multi-modal signal \(M_f\) with parameter \(W_{aq}^i\). \(K_{aam}^{(i, t)}\) and \(V_{aam}^{(i, t)}\) are the key and value of the ith head in the multi-head attention for the tth time step, respectively, both derived from the combination \(C_f\) of visual and linguistic representations. [; ] refers to the concatenation operation, and \(T_f^t\) is the linguistic representation in the tth time step.

Training and inference

We optimize the proposed captioning model FeiM following the standard practice. First, cross-entropy loss [45, 46] is employed to calculate the probability distribution difference between the target word sequences and the predicted ones. The calculation of the cross-entropy loss can be written as

$$\begin{aligned} L_{xe}(\theta )=-\sum _{t=1}^{T}\log \left( P(y^*_t|y^*_{{1:t-1}})\right) , \end{aligned}$$
(32)

where \(\theta \) is the parameters of FeiM, \(y^*= \{y^*_1,..., y^*_T\}\) is the ground-truth caption and the probability distribution P is obtained from the language model. In each time step of the training, although the cross-entropy loss maximizes the probability of each word in the target caption sequence at the word level, it ignores the long-range dependencies between the generated words.

The CIDEr metric [51] is commonly leveraged to evaluate the information and fluency of sentences; thus, it is taken as the reward and FeiM is further optimized with self-critical sequence training [23]. It is a sentence-level training strategy that aims to remedy the flaw of the word-level cross-entropy training loss. To update the loss gradient, following [30], we leverage the average of rewards instead of greedy decoding to determine the baseline of rewards. The calculation of the loss gradient is defined as follows:

$$\begin{aligned} \begin{aligned} \nabla _{\theta } L_{RL}(\theta )&=-\frac{1}{n}\sum _{i=1}^{n}\left( re(y^i)-bs)\nabla _{\theta } \log \left( P(y^i)\right) \right) \\ bs&=\frac{1}{n}\left( \sum _{i=1}^{n}re(y^i)\right) , \\ \end{aligned} \end{aligned}$$
(33)

where \(y^i\) is the ith sampled sentence, bs is the reward baseline of the sentence, and re(.) is the reward function depending on the CIDEr score.

Experiment

To validate the effectiveness of the proposed approach FeiM, we first evaluate it on the MSCOCO benchmark. Then, a series of ablation studies is performed to confirm the contribution of each designed part of FeiM.

Dataset

MSCOCO [47] is one of the most widely used large-scale datasets for image captioning. Typically, the dataset is made up of more than 120,000 images with about five different English annotations for each image, including 82,783 images for training, 40504 for validation, and 40,775 for testing. In practice, due to the non-public annotations of the testing images, the training and the validation images are usually re-divided according to the Karpathy split [2] setting. Correspondingly, 5000 images are leveraged for testing, 5000 for validation, and the others for training. In our experiments, we follow the Karpathy split for a fair comparison with the existing captioning approaches. Besides, the sentences in the annotated captions are converted to lowercase, all punctuation marks and words that appear less than five times are removed, and a vocabulary of 10,201 words is formed.

Evaluation metrics

We adhere to the standard evaluation protocol, where BLEU (B) [48], METEOR (M) [49], ROUGE (R) [50], CIDEr (C) [51], and SPICE (S) [64] metrics are adopted to measure the performance of image captioning task. Specifically, BLEU (including BLEU-1, BLEU-2, BLEU-3, and BLEU-4) enables quick and simple automatic evaluation for image captioning. METEOR deeply analyzes the word order and is more aligned with human judgments. ROUGE is extensively applied to examine the sufficiency and authenticity of the generated sentence. CIDEr is based on consensus and attaches more significance to descriptions with salient image content. SPICE analyzes the semantic content of the predicted captions to determine the generation quality. In general, the higher the four evaluation metrics, the better the generated language descriptions.

Experimental setup

We mainly follow the implementation of [13, 14] and [30]. The visual features with a grid size of \(7\times 7\) \((h=w=7)\) and a dimension of 2048 for the input images are first extracted, and then, according to [14], we obtain the 512-dimensional augmented grid representations. In our experiment, the number of learnable tokens in the feature queries is set as 1, and the number of feature interaction modules is 4. In addition, we stack 4 Transformer blocks in FeiM, and the dimensionality d of each Transformer layer is 512 with 8 attention heads. We employ the Adam [65] optimizer, the dropout with probability 0.1, the batch size of 50, and the beam size of 5 to train the captioning model FeiM. The evolutionary algorithms [66] also help us to determine some parameters.

In terms of the cross-entropy training, we apply the epoch decay schedule in [14], as defined in Eq. (34), to change the learning rate with the base learning rate \(lr_b = 1e-4\). During training, the learning rate lr changes based on the current training epoch number \(n_e\). For the self-critical sequence training based on reinforcement learning, we resort to the CIDEr score as the reward and use a fixed learning rate of \(5e-6\)

$$\begin{aligned} lr = \left\{ \begin{aligned}&\frac{n_e\times lr_b}{4}, \quad n_e<= 3 \\&lr_b, \quad 3< n_e<= 10 \\&\frac{lr_b}{5}, \quad 10< n_e <= 12 \\&\frac{lr_b}{5^2}, \quad otherwise. \\ \end{aligned} \right. \end{aligned}$$
(34)

Normally, we start the optimization for FeiM with cross-entropy loss and then adopt the self-critical sequence training once the CIDEr metric declines for 5 continuous epochs. When the CIDEr metric falls continuously for 5 epochs during self-critical sequence training, the entire training process is finished.

Quantitative analysis

Comparisons with state-of-the-arts

The comparison methods include Up-Down [33], GCN-LSTM [52], SGAE [53], AoANet [54], Transformer [32], X-Transformer [31], and RSTNet [14]. For clarity, the Up-Down [33] approach utilized two LSTM layers combined with bottom–up and top–down attention mechanisms to boost the performance. In GCN-LSTM [52], the graph convolutional neural network was leveraged to obtain visual representations from the spatial and semantic relationships of the scene graphs, and in [53], a scene graph auto-encoder was used. In [54], the Attention on Attention (AoA) approach further analyzed the relevance between the queries and attention results by extending the traditional attention methods. In terms of the Transformer-based captioning model, Sharma et al. [32] demonstrated that the captioning model based on Transformer architecture for sequence modeling was able to achieve the best performance. Furthermore, in X-Transformer [31], a bilinear pooling was applied to a basic Transformer’s attention module. Zhang et al. [14] developed a relationship-sensitive Transformer-based image captioning model to fuse the spatial information of the visual representations.

Table 1 The performance comparisons with state-of-the-art captioning approaches on the MSCOCO Karpathy test split under the cross-entropy training
Table 2 The performance comparisons with state-of-the-art captioning approaches on the MSCOCO Karpathy test split under the self-critical sequence training
Table 3 The performance comparisons with state-of-the-art captioning approaches that leverage the same ResNeXt101 grid-based visual features on the MSCOCO Karpathy test split

The performance comparisons between FeiM and a series of powerful image captioning baselines on the MSCOCO Karpathy split under the cross-entropy training are reported in Table 1. Under the cross-entropy loss optimization setting, FeiM outperforms all the other approaches on all the metrics except METEOR. Then, we compare FeiM with more state-of-the-art models on the MSCOCO Karpathy test split under the self-critical sequence training, where expressive information embedded in grid representations is utilized with CIDEr score optimization, as Table 2 summarizes. We can observe that our approach excels all the other advanced approaches across all the evaluation metrics. It should be noted that FeiM achieves the CIDEr score of 135.2, surpassing the previous Transformer-based approaches that employ the cross-attention mechanism to boost the interaction between vision and language. Specifically, FeiM outperforms the \(M^2\) Transformer [30], X-Transformer [31], and RSTNet [14] by 4.0%, 2.4%, and 1.9% CIDEr score, respectively. The results demonstrate the positive meaning of incorporating the feature queries and feature interaction module into the Transformer-based captioning model. Besides, FeiM is superior to the RNN-based captioning models, such as the Up–Down [33] and GCN-LSTM [52] approach, with FeiM sharply improving the CIDEr metric by 15.1% and 7.6%, respectively.

Notably, FeiM not only comprehensively outperforms its corresponding counterparts RSTNet with the same grid features, but also outperforms \(M^2\) Transformer and X-Transformer with fine-grained region features by a decent margin across all metrics. The comparison results against the state-of-the-art approaches indicate the effectiveness of FeiM for caption generation. We explain the reasons why our grid-level Transformer-based model yields better performance than the region-level ones from the following aspects. On the one hand, the region-based visual features are usually extracted from the prominent parts of the image; thus, meaningful information of the overall context in the region features is missing, failing to capture robust global representations and accurate relationships between regions. On the other hand, even though global grid representations are employed in our proposed approach FeiM, we incorporate a small number of learnable feature queries and the feature interaction module to capture more expressive visual concepts.

Comparisons with state-of-the-arts on the same ResNet101 grid features

We compare FeiM against other state-of-the-art approaches on the same ResNet101 grid features to exclude the impact of various visual representations. As illustrated in Table 3, FeiM unsurprisingly still outperforms the state-of-the-art approaches with ResNet101 grid visual features in terms of most evaluation metrics, which shows the strong ability of FeiM to understand the image and generate high-quality descriptions.

Fig. 4
figure 4

Examples of image captioning results generated by the proposed FeiM and the corresponding ground-truth captions

Qualitative analysis

We give several examples of the original image, three human-annotated ground-truth captions (GT), and the captions produced with our proposed approach FeiM in Fig. 4. Intuitively, the description for the given image generated by FeiM are very close to the human-annotated captions, demonstrating the effectiveness and superiority of FeiM from a qualitative perspective.

Table 4 Ablation studies on the effect of the feature queries and feature interaction module

Ablative analysis

Table 5 Ablation studies on the number of the feature queries tokens
Table 6 Ablation studies on the stacked number of the feature interaction module

To explore the impact of the feature queries, the feature interaction module, and the best parameters for FeiM, we perform a series of ablation studies to assess the performance of our approach on the MSCOCO Karpathy test split.

Fig. 5
figure 5

Ablation studies on the number of the attention heads

Effect of the feature queries and feature interaction module

We begin by varying the different components of our method FeiM to investigate the impact of the feature queries and feature interaction module. The captioning model with the same configuration but without the feature queries and feature interaction module is regarded as the baseline. As is shown in Table 4, for setting (a), we first insert the feature interaction module into the baseline captioning model. In comparison with the baseline, setting (a) showcases that the feature interaction module is beneficial to generate better descriptions. In setting (b), we remove the feature interaction module and add the feature queries into the baseline model. Compared with the results in setting (a), setting (b) demonstrates that the feature queries play a more powerful role in the Transformer-based captioning model than the feature interaction module. Setting (c) is our full model FeiM, which achieves significantly better performances than its variants, with performance gains of 2.3% and 9.6% on the BLEU-1 and CIDEr scores compared with the baseline. This indicates that our feature queries and feature interaction module enable better captioning performance by promoting each other.

In general, when adding different components, the proposed FeiM presents consistent improvements over the baseline, suggesting that both the feature queries and feature interaction module are effective for achieving superior captioning generation.

Effect of the number of the feature queries tokens

We then conduct experiments by varying the number of learnable tokens to explore the best length of the feature queries. Due to limitations in computing resources, we set the number of the feature query tokens to no more than the grid size \(7\times 7=49\), and the minimum number of the tokens is 1. As reported in Table 5, we could find that the captioning performance does not keep improving when we further increase the number of the feature queries tokens, which is also observed in [62] and [63]. Specifically, when the number of feature query tokens is 49, noisy information is more likely absorbed in the visual representations, resulting in the performance drop of the captioning model. Therefore, we set the number of the learnable tokens in the feature queries to 1 for our experiments.

Effect of the stacked number of the feature interaction module

The feature interaction module is stacked to propagate expressive messages from the local query signals and global augmented grid visual representation, by incorporating the feature queries into the aligned hidden space from the spatial and channel dimensions via learnable factors \(\alpha \) and \(\beta \). To study the influence of the feature interaction module, we change the stacked number N of the feature interaction module. Similarly, due to limitations in computing resources, we only perform six sets of experiments on the number of stacked feature interaction modules. From Table 6, we could observe that by increasing N from 1 to 4, all the evaluation metrics are gradually increasing, and then, a drop is observed when increasing N from 5 to 6. The main reason may lie in that appropriate feature interaction modules encourage the expression of the grid visual features, but too much of them can lead to the overfitting of the model, degrading the captioning performance. Therefore, we stack four feature interaction modules in FeiM to make the most of the grid features.

Effect of the number of the attention heads and transformer blocks in our FeiM

We evaluate FeiM with different numbers of attention heads and Transformer blocks. We first fix the number of Transformer blocks to determine the best number of the attention heads. The results are shown in Fig. 5. As the number of attention heads increases from 4 to 12, the performance of FeiM improves and then decreases. Notably, FeiM obtains the best results on all metrics when the number of attention heads is set to 8. To choose the best number of Transformer blocks, we carry out a series of experiments by setting FeiM with different Transformer blocks ranging from 1 to 6, respectively. From Fig. 6, we can observe that the performance of image captioning gradually improves as the number of Transformer blocks increases from 1 to 4, but too many Transformer blocks will lead to performance degradation. Clearly, stacking 4 Transformer blocks shows superior performance on the CIDEr metric, thus, we stack 4 Transformer blocks for our FeiM if not specified.

Fig. 6
figure 6

Ablation studies on the transformer blocks

Conclusion

In this paper, we design feature queries and feature interaction modules for Transformer-based image captioning, which exploits feature interaction and fusion to improve the understanding of the image with grid features. Specifically, we introduce a small number of learnable tokens and design feature queries to leverage the expressive information embedded in the grid-based visual representations. The feature queries serve as the local signals to query relevant visual clues from global grid features and promote the interaction of different modalities. Additionally, we present the feature interaction module to interact with feature queries and augmented grid features on the spatial and channel dimensions. By incorporating the feature queries and feature interaction module into a Transformer-based captioning model, the experimental results on the MSCOCO dataset demonstrate the effectiveness and superiority of our FeiM over the current cutting-edge models. We also carry out extensive ablative experiments to further validate the impact of the two designed components, the number of the feature query tokens, the stacked number of the feature interaction modules on image captioning, etc. The novel and simple design of the feature queries and feature interaction module is plug-and-play and can be easily applied to other visual tasks as well.