Towards Explanatory Interactive Image Captioning Using Top-Down and Bottom-Up Features, Beam Search and Re-ranking

Image captioning is a challenging multimodal task. Significant improvements could be obtained by deep learning. Yet, captions generated by humans are still considered better, which makes it an interesting application for interactive machine learning and explainable artificial intelligence methods. In this work, we aim at improving the performance and explainability of the state-of-the-art method Show, Attend and Tell by augmenting their attention mechanism using additional bottom-up features. We compute visual attention on the joint embedding space formed by the union of high-level features and the low-level features obtained from the object specific salient regions of the input image. We embed the content of bounding boxes from a pre-trained Mask R-CNN model. This delivers state-of-the-art performance, while it provides explanatory features. Further, we discuss how interactive model improvement can be realized through re-ranking caption candidates using beam search decoders and explanatory features. We show that interactive re-ranking of beam search candidates has the potential to outperform the state-of-the-art in image captioning.


Introduction
The goal of image captioning is to automatically generate descriptions for a given image, i.e., to capture the relationship between the objects present in the image, generate natural language expressions (see an example in Fig. 1), and judge the quality of the generated descriptions. The problem, therefore, is seemingly more difficult than popular computer vision tasks, e.g., object detection or segmentation, where the emphasis is solely on identifying the different entities present in the image. With recent advancements in training neural networks [26], the availability of GPU computing power, and large datasets [31], neural network driven approaches are the most popular choice for handling the caption generation problem. However, humans are still better at interpreting images and constructing useful and meaningful captions, with or without a particular application context, which renders it an interesting applications for IML [10,43] and explainable artificial intelligence (XAI) [11]. Promising technologies include active learning [41], which was already applied for automating the assessment of image captioning [4,5], IML methods to incrementally train, e.g., re-ranking models for selecting the best caption candidate similar to [3,39], and XAI methods that can improve the user's understanding of a model and, eventually, enable it to provide better feedback for a second IML process.
In this work, we adopt and extend the architecture proposed in [49] since it is the most cited seminal work in the area of image captioning. It introduced the encoderdecoder architecture and the visual attention mechanism for image captioning in a simple yet powerful approach. Compared to [49] other captioning approaches are task specific, more complex and derivative in nature. Moreover, we believe the simplicity of the Show, Attend and Tell model, compared to it's other counterparts, would help to add explainability into the captioning task. This approach uses a transparent attention mechanism which, in the domain of sequence-to-sequence tasks, translates to the ability to dynamically select important features in the image instead of maintaining one feature representation for the image at all times. However, the selected image features on which the attention mechanism works are obtained from a deep convolutional encoder which mostly capture high-level image abstractions and not the low-level object specific details. These high-level features are top-down in nature since their primary purpose is to provide context for the decoder in producing the next word based on the partially generated caption. In doing so, these features often fail to attend and provide direct visual cues, i.e., specific object details, to the decoder. We explore the possibilities to make this attention mechanism more effective, in particular towards novel IML and XAI approaches, by implementing a combination of top-down image features and bottom-up object specific details.
We discuss a novel augmentation of the attention mechanism in [49] with bottom-up features, in terms of localization maps encoded in the feature space obtained by the deep convolutional encoder (see Fig. 2). For each input image, we embed the content of a constant number of localized bounding boxes from a pre-trained Mask R-CNN model [13] for augmenting the attention mechanism from [49]. We use Resnet-101 [14] pre-trained on the Imagenet dataset [7] for extracting fixed size feature vectors per bounding box. The resulting set of vectors represents object specific salient regions of the input image.
Further, we compute visual attention on the joint embedding space formed by the union of high-level features obtained from the encoder of the caption generator and the low-level features obtained from our object specific encoding of salient regions of the input image. We show that with our approach we obtain better Bleu scores compared to the original scores in [49], specifically, we obtain higher scores in Bleu-2, Bleu-3, Bleu-4 metrics. In a separate experiment, we use beam search to expand the search space of the associated natural language generation problem [32]. Beam search is a greedy tree search algorithm which sorts possible language generations based on a heuristic and keeps the best k options with k being the beam width. We show that effective re-ranking of caption candidates from a beam search decoder has a huge potential for improving results. Further, we discuss how interactive model improvements and explainability can be obtained.
To summarize our contributions: first, we implement a novel image captioning architecture that augments the visual attention mechanism introduced by [49]. We show that our approach achieves comparable or better results on the image captioning task compared to Show, Attend and Tell, while it offers more explanatory cues for XAI at the same time. Second, we show the potential of an implemented beam search generation process and how it improves the resulting captions by interactively re-ranking the candidates. Third, we discuss how our architecture can be used for novel IML and XAI approaches. We describe several directions of future work, concerning the bottom-up features and beam search results and the challenges towards explanatory interactive image captioning.

Related Work
Recently there has been renewed interest in the problem of image captioning in spite of considerable focus in the recent past on language grounding in perceptual data [12,33,38]. This is due to a wider push to investigate the intersection between vision and language. In this work, the caption generation method employs the neural framework proposed in [6] where instead of translating text from one language to another, an image is translated into a caption or sentence that describes it. In general, the image caption generator is a neural architecture consisting of a deep convolutional network [17] and a recurrent network [16]. Kiros et al. [24,25] is credited with the first attempt in this direction where the authors develop a joint multimodal embedding space and provide a natural way of performing both ranking and generation. As a slight modification, the works of [9,47] employ LSTMs (Long Short Term Memory) instead of regular recurrent neural networks. Karpathy et al. [22], on the other hand, advocates learning a joint embedding space for both ranking and generation. As a matter of fact, their model learns to score sentence and image similarity as a function of convolutional network object detection with outputs of a bidirectional RNN (Recurrent Neural Network).
The caption generation problem also is a structured learning problem since both the input and output of this problem have a rich structure. That is, the image of a natural scene is made up of multiple random variables, such as, the position of objects and their inter relationships and all of them have a rich joint distribution. Moreover, there needs to be an alignment between the output words of a caption with the spatial regions of the input image. So, to properly address the structured nature of this problem, we make use of an attention mechanism in our work. Hence, we have adopted the Show, Attend and Tell architecture by Xu et al. [49] which uses attention to generate the captions for images. The attention mechanism tries to learn the latent alignments between the objects in the image and output words of the caption or sentence from scratch. Thus, they learn to attend to the higher level dependencies between different entities present in the image. It is worthwhile to note that the use of an attention mechanism with neural networks is not entirely new. In fact, in the computer vision community there exists some works, such as, [8,27] which employed attention with neural networks to handle different vision tasks.
In general, the attention mechanism operates on a grid of image features obtained from a layer of a convolutional neural network, where each feature represents a high-level abstraction of a region in the image, and provides a weighting for each spatial region. There by, a higher weight would translate to more importance for the corresponding image region. However, often times it is difficult to find the optimal number of image regions which should capture all the relevant details in the image. Additionally, the high-level image features may fail to capture the finer object specific details or low-level salient regions in the image. So in our approach, we try to augment the attention mechanism by combining low-level fine details in the image with the highlevel image abstractions. Previously, only a couple of works [19,36] have tried to use salient image regions. The work in [19] utilizes a search technique proposed in [45] to identify salient image regions which are subsequently used in image captioning. Pedersoli et al. [36], on the other hand, uses spatial transformer networks [18] or edge-based boxes [50] for generating image features which are processed using a model based on three bilinear pairwise interactions. In our work, for the purpose of generating object specific localized maps or salient regions, we utilize the Mask R-CNN [13], a close variant of the Faster R-CNN [37] technique. We extract the image regions inside the bounding boxes and embed them into the feature space learned by a pre-trained deep convolutional network.
Image caption generation, in addition to being an important task in computer vision, is also a major problem in the area of natural language generation which requires proper evaluation. The common criteria here include readability or fluency, which refer to the linguistic quality of the text, and also accuracy or relevance relative to the input which shows the natural language generating system's ability to satisfactorily reproduce content. In our evaluation we use standard metrics, such as Bleu [34], ROUGE [30], METEOR [2], CIDEr [46] and SPICE [1] which try to emulate human judgement.

Implementation
In this section, we describe our implementation of the neural encoder-decoder architecture for generating image captions based on [49] and our extension of the visual attention mechanism: we use bounding boxes from Mask R-CNN [13] to encode object specific bottom-up features which complement the currently used top-down representation. Further, we describe our beam search decoder and two heuristic approaches for re-ranking its generated caption candidates.

Image Caption Generation
For generating the image captions, we use an own implementation of the Show, Attend and Tell method [49] as depicted in Fig. 3, with several modifications for extensions. Xu et al. [49] suggested to use a set of fixed dimensional vectors from a lower convolution layer of the CNN (Convolutional Neural Network) architecture instead of using a single fixed dimensional vector to represent the image. This helps to maintain a fine grained correspondence between the different portions of a 2D image represented through the corresponding vectors. With this the decoder becomes more powerful as it can focus selectively on different parts of an image during the generation process by selecting a subset of the feature vectors. The detailed operations of the LSTM based decoder, used in [49] for generating the captions, are described through the following equations: The variables i t , f t , c t , o t , h t denote input, forget, memory, output gates and the hidden state respectively. T represents a mapping of the form f s,t ∶ ℝ s → ℝ t . Thus, T D+m+n,n is a mapping from ℝ (D+m+n) to ℝ n . ẑ ∈ ℝ D denotes the context vector responsible for capturing the visual information related to a specific location in the input image. E denotes the embedding matrix and has the dimension m × k . The dimension of the embedding vector is given by m while the dimension of the LSTM hidden state is denoted by n. Furthermore, and ⊙ represent the logistic sigmoid and element-wise multiplication respectively. The model implementation and training details are as follows: • We use the MSCOCO dataset for training the model [31], adopting the data splits proposed in [21]: the training set contains 113,287 images with 5 corresponding captions, the validation and test sets contain 5000 images each with 5 groundtruth captions per image. •

Augmented Attention Mechanism
The visual attention mechanism used in image captioning models can be described as the expectation of an annotation function. In general, this expectation is computed over a set of image features and the previous history of the generation process. This form of attention works primarily with high-level abstractions captured by the convolution network which may or may not include specific objects and salient regions in the image. We propose a strategy to enrich the Fig. 3 Neural caption generation mechanism based on [49] depicting the processing of a red double decker bus present attention mechanism by incorporating object-specific localized maps from the region proposal network Mask R-CNN [13] (see Fig. 4 for examples). We represent an input image I as a set which includes a constant number of fixed size feature vectors: Each feature vector represents the encoding of a bounding box detected by Mask R-CNN that is encoded using the Resnet-101 model. At every spatial location of an image, Mask R-CNN predicts an objectiveness score accompanied by refinement of anchor boxes of varying scales and aspect ratios which result in tighter bounding boxes. These bounding boxes are further refined using nonmaximum suppression: 1. We extract the image regions inside the final bounding boxes and embed them into the feature space learned by Resnet-101 pre-trained on the Imagenet dataset. 2. We re-train Resnet-101 on MSCOCO images and set a high threshold for the classification probability for the regions to be selected. 3. In contrast to the original architecture in [49], we compute visual attention on the joint embedding space formed by the union of high-level features obtained from the encoder of the caption generator and the low-level features obtained from the object specific salient regions of the input image, i.e., the embedded bounding boxes.
The augmented attention mechanism is shown in Fig. 2. For every image we use 10 additional feature vectors of dimension 2048 to represent the salient regions. Thus, at every time-step, our attention model produces a mask over 206 spatial locations. This mask is applied to a set of image features and then the result is spatially averaged to produce a 2048 dimensional representation of the attended portion of the image. Most hyper-parameters of the training procedure stay the same. The initial learning rate for this model is 4 × 10 −4 which is annealed by a factor of 0.8 every three epochs. Further, we use batch size of 32. We evaluate the model at each epoch on the development set.

Beam Search and Re-ranking
Beam search [32] as a decoding technique allows for the generation of a more diverse set of caption candidates. A previous investigation [4,5] has shown that beam search is to be preferred over other techniques such as [28,29] for generating diverse captions. We use a beam width of k = 20 to generate caption candidates that can be re-ranked in an additional step. To estimate the potential improvement for our caption generation method through re-ranking, we compute the upper bound of Bleu scores using the scores of all generated candidates. Our long-term goal is to leverage the objects, or their respective embeddings, detected by Mask R-CNN for such a re-ranking. In this work, we implement and test two heuristic re-ranking methods that rely on the similarity between the generated captions and the corresponding object classes: we estimate the similarity using the Euclidean distance with (1) bag-of-words and (2) TF-IDF based text representations.

Evaluation
In this section, we evaluate the image caption generation process, the extended attention mechanism, and the beam search and re-ranking approach. We compare the performance of our approach with and without beam search to the scores reported in [49]. Further, we investigate the potential improvement that can be achieved by the re-ranking of caption candidates of our beam search decoder. We compute a set of common metrics as dependent variables: Bleu, METEOR, ROUGE-L, CIDEr and SPICE, which primarily focus on the n-gram overlap between the Fig. 4 Object-specific salient regions highlighted with corresponding bounding boxes as bottom-up features generated and ground truth captions. To be more specific, we provide short descriptions for each metric.
Bleu is an automatic metric for evaluating the quality of a machine generated text. Bleu scores are computed from individual machine generated sentences by direct comparison between them and with a set of good quality references or ground truth references. It is always between 0 and 1 and indicates the similarity between the generated captions and the ground truth. So, a score of 0 indicates no overlap whereas 1 indicates complete overlap. Depending on the size of the n-grams we want to match between the candidate caption and the ground truth captions we have different BLEU scores, i.e, Bleu-1, Bleu-2, Bleu-3, Bleu-4.
METEOR is a metric for evaluating outputs from a machine translation system. The metric is based on the harmonic mean of unigram precision and recall, where, recall is weighted higher than precision. METEOR uses features, such as, stemming and synonymy matching along with the standard exact word matching. ROUGE-L measures the longest matching sequence of words. An advantage of it is that it does not require consecutive matches but in-sequence matches that reflects sentence level order. One does not require a predefined n-gram length since it automatically includes longest in-sequence common n-grams. CIDEr denotes Consensus based Image Description Evaluation. It measures the similarity of a generated sentence against a set of ground truth sentences composed by humans and shows high agreement with consensus as assessed by humans. While SPICE stands for semantic propositional image caption evaluation.
The upper bounds for our architecture are estimated by sorting the generated caption candidates from the beam search by their Bleu-1 to Bleu-4 scores, i.e., assuming we had access to a perfect re-ranking. For all tests, we use the MSCOCO dataset [31] using the data splits as described above.
We hypothesize that our approach improves the caption generation process and, hence, outperforms the scores reported in Xu et al. [49].
Another hypothesis is that re-ranking of beam search candidates has a high potential for improving image captions and that our heuristic approaches supports this conjecture.
Finally, we expect that our approach paves the way for novel IML and XAI methods that can be used to further improve the image captioning results. We qualitatively discuss this topic based on the results of this experiment. Table 1 shows the scores of the three approaches we evaluated. Our approach without beam search obtains higher scores Bleu-2, Bleu-3, Bleu-4, measuring bi-gram, tri-gram and tetra-gram overlaps than the baseline approach [49]. This is a significant improvement because the Bleu metric computation does not remove stop words and so higher scores should lead to more natural and pertinent generations. The Bleu-1 and METEOR scores are on par with the baseline approach. Additionally, we obtain high scores in ROUGE-L and CIDEr; unfortunately we cannot compare these results with the those from the baseline approach [49] since the original baseline does not report on these metrics.

Results
Naturally, the results for the top-1 captions (beam search approach) are worse than the scores for the version without beam search. In particular, the scores for Bleu-3 and Bleu-4 are significantly worse (see Sect. 5 for an extensive discussion about this).
In addition to these quantitative results, we visually inspect generated captions from our approach without beam search and our baseline implementation based on [49] (see Fig. 5). A third caption is shown, which was selected from the 20 beam search candidates which have a zero Bleu-4 score (beam candidate). Obviously, the Beam metric, which the international leader board uses, does not work properly on their own gold standard. To summarize, all generated captions distinctly describe the objects and their inter relationships in the corresponding images in natural language text. Our approach correctly aligns the image concepts, i.e., the objects with the output words in the generated captions. A more detailed qualitative analysis of generated captions can be found in the discussion Sect. 5. Further examples from the beam search decoder are shown in the Appendix Table 3, where we show all 20 generations for randomly selected images from the test set, along with the 5 ground truth captions.
The upper bounds for Bleu-1 to Bleu-4 metrics and the results from our heuristic re-ranking methods are reported in Table 2. The upper bounds are reported for top-i candidates from our beam search decoder with i ∈ 3, 5, 10, 20 . The re-ranking methods perform slightly better compared

Qualitative Analysis
Our approach without beam search outperforms the state-of-the-art method from [49] for several metrics and  Table 1). In particular, we achieve better Bleu scores with long n-grams, i.e., 3-grams and 4-grams, showing a better alignment to formulations in the ground truth captions. This latent alignment is important because neural caption generation is often regarded as translating an image into a natural language description. Together with the qualitative analysis (visual inspection) of generations, this shows that our architecture can effectively produce meaningful image captions. Results, as shown in Table 1 and compared to [49], suggest that particular localized information in conjunction with the high-level features obtained from deep convolutional encoder improves the correspondence resolution problem (i.e., image and word entity alignment) at the heart of this multimodal task. The results underline the positive influence of bottom-up features (or object specific localized maps) for the image captioning task; also they deliver explainable features. We note that the application of Mask R-CNN [13] in obtaining the localized maps or salient regions in our work puts in an important step towards better exposition of the object features involved in the caption generation process compared to previous approaches including [49] which only uses high level image abstractions, i.e., the top-down features. The specific object masks in addition to the bounding boxes provide important explanatory cues for the generated text describing the corresponding image as is shown in Fig. 6 where bounding boxes help localize the white refrigerator and stove present in the generated caption. Similarly, for the other image in Fig. 6 the context for the generated caption is provided by bounding boxes localizing the man, woman and wine glasses. We believe the proposed approach is a good step in the direction of infusing image caption generation with explainable AI. In addition, we use our beam search decoder for generating a more diverse set of caption candidates. Averaging over all test images, we computed the upper bounds for all Bleu metrics. For i = 20 , i.e., including all caption candidates from the beam search with beam width k = 20 , all Bleu scores potentially outperform the state-of-the-art method and our method without beam search by 0.196 (average over all Bleu metrics). We find that the scores increase with higher values of k which is probably caused by a higher recall due to more captions from which the Bleu score can be selected. Of course, this gain in Bleu scores motivates an optimal method for selecting from the 20 candidates and indicates a high potential of re-ranking methods. The results from our evaluation show that simple heuristics-based re-ranking methods do not improve the caption selection process considerably. This leaves the challenge to future research, i.e., Without a selection or re-ranking, the top-1 candidate from beam search yields worse results compared to all other methods. This phenomenon is well understood, beam search expands the search space for the natural language generation which does not guarantee that the first generation has the best overlap with the ground truth captions. This, however, is measured by the Bleu metric and can affect the corresponding scores. The examples in Figs. 5 and 7 show semantically meaningful and fluent generations originating from the beam search with zero Bleu-4 score, which further demonstrates some shortcoming of the Bleu metric in this regards. More examples can be seen in the Appendix in Table 3. Compared to greedy decoding which is locally optimal, candidates obtained from beam search may contain different words compared to corresponding ground truths which can dramatically harm Bleu scores since they measure only the overlap with the ground truths without taking into account the semantic meaning of the generations.
However, beam search alone and re-ranking with two heuristic methods (based on bag-of-words and tf-idf) turn out to fail in improving the overall scores. But we believe our approach with its use of Mask R-CNN producing bottomup features provides new opportunity for making imagecaptioning an IML task apart from only generating captions which achieve higher metric scores.

Towards Interactive and Explanatory Captioning
Based on our findings, we believe that our image captioning system with its augmented attention mechanism and the beam search decoder has the potential to facilitate interactive improvement of the captioning system and to improve the explainability of the caption generation process. In the following, we describe opportunities and challenges for future research in this direction.

Interactive Re-ranking
Effective re-ranking can leverage the inherent potential of the beam search decoder to improve generated image captions. Besides the output of diverse image captions, our architecture yields several opportunities for user interaction such as interactive training of a caption re-ranking model: The additional bottom-up features can be used as input to a re-ranking that learns from continuous user feedback to score the 20 generated caption candidates. Corrective feedback to the model can be realized by selecting relevant areas of the image, that are important for generating the caption, based on the Mask R-CNN bounding boxes. This enables users to easily change the focus for the generation process, e.g., if the model wrongly puts emphasis on an irrelevant object. The challenge lies in the development of interactive machine learning (IML) mechanisms that facilitate efficient and effective model training, i.e., that model training requires low annotation effort, is scalable and, yet, converges to a model that improves image captioning. Active learning can be used to reduce the annotation effort for the humans involved in that process or, due to a better selection of training samples from an unlabelled pool, improving the overall quality of the model [41]. Including active learning techniques was shown to be effective for different natural language processing tasks, e.g., for reducing the number of training samples in machine translation without a loss in quality [20] and for training quality assessment models for image captioning [5]. The latter model for caption quality assessment can also be used as a baseline for a future reranking system. Crowdsourcing can scale up the annotation process as shown for, e.g., dialogue systems [3,39,48] and in the context of image captioning [4,5]. Promising techniques for improving caption generation can also be found in coactive learning [42].

XAI Methods for Image Caption Generation
We discuss different future extensions of our work pertaining to domain of XAI methods, particularly through deep explanations. The field of deep explanations subsumes methods that introduce more transparency in how black box models, in particular neural network models, work. A prominent approach is to generate visual explanations which describe how the objective is achieved by the neural model. Our approach provides a relation between image captions, bounding boxes, and pixel-wise segmentations from Mask R-CNN that localize regions that are important to the generation process. An interesting direction of future work is Fig. 7 Correctly generated one of the beam candidate captions with zero Bleu-4 score to develop segmentation-based visual explanation methods and to compare them with state-of-the-art approaches like Grad-Cam [40]. The specific object masks in addition to the bounding boxes provide important explanatory cues for the generated text describing the corresponding image as is shown in Fig. 6 where bounding boxes help to localize the white refrigerator and stove present in the generated caption. Similarly, for the other image in Fig. 6 context for the generated caption is provided by bounding boxes localizing the man, woman and wine glasses. This can also be used as an extension to interactive re-ranking, e.g., as a part of explanatory interactive machine learning interfaces [44].

Conclusion
In this work, we presented a new architecture for image captioning that incorporates a top-down attention mechanism with bottom-up features of a scene: we encoded the object specific bounding boxes provided by the Mask R-CNN model [13] using the Resnet-101 architecture [14]. We show that our approach achieves scores on par with the stateof-the-art, Show, Attend and Tell [49], for the Bleu-1 and METEOR metrics, and better scores for the Bleu-2, Bleu-3 and Bleu-4 metrics using the MSCOCO dataset, while at the same time, providing explanatory features. In addition, we showed that using our beam search decoder has great potential for further improvements of the image captioning process. We discussed opportunities in interactive machine learning for leveraging this potential, in particular by interactively training re-ranking models that effectively select the best options from the generated caption candidates. Further, we discussed how XAI method can be developed based on our image captioning system to better understand the image captioning process, which in turn delivers valuable feedback to users of such intelligent user interfaces for incremental model improvements.