1 Introduction

The global-village phenomenon is strengthening day by day. Technological advancements, the abundance of devices, automation, and widespread availability of the internet has connected people like never before. People exchange texts, images, and videos for communication, resulting in a massive amount of textual and visual data. These copiously available videos linked with accurate processing can help to address numerous real-world challenges in various disciplines of life. No doubt, human beings are intelligent enough to know that understanding visual aspects and language intricacies is among the inherent capabilities they possess. However, for machines to be rationally sharp enough, proper understanding of images and their consequential interpretations are essential for content description. The primary objective of video description is to provide a concise and accurate textual alternative to visual content. Researchers put considerable effort into understanding visual characteristics and generating an eloquent interpretation, i.e., a video description that is a blend of vision and language encompassing the two prominent domains of CV and NLP (Bhatt et al. 2017). Scientists from both areas have mutually worked on getting appropriate insights from images and videos, and then accurately and precisely interpreting them by considering all the elements appearing in the video frames, like the objects, actions, interactions, backgrounds, overlapping scenes with localization information, and most importantly, their temporal sequence. Table 1 lists abbreviations with full form.

Table 1 List of Abbreviations

The importance of video description is evident from its practical and real time applications (including efficient searching and indexing of videos on the internet, human-robot relationships in industrial zones, and facilitation of autonomous vehicle driving), and video descriptions can outline procedures in instructional/tutorial videos for industry, education, and the household (e.g., recipes). The visually impaired can gain useful information from a video that incorporates audio descriptions. Long surveillance videos can be transformed into short texts for quick previews. Sign language videos can be converted to natural language descriptions. Automatic, accurate, and precise video/movie subtitling is another important and practical application of the video description task.

1.1 Classical approach

Video description research began with the classical approach (Rohrbach et al. 2013; Kojima et al. 2002; Khan and Gotoh 2012; Barbu et al. 2012; Das et al. 2013; Hakeem et al. 2004) where, after identification of subject, verb, and object in a constrained domain video, the fitting of the SVO in a standard predefined template was performed. These classical methods were effective only for short video clips with a limited number of objects and minimal interactions. For semantic verification, the research in Das et al. (2013) developed a hybrid model addressing the issues in Khan and Gotoh (2012) and Barbu et al. (2012) , combining the best aspects of bottom-up and top-down exploitation of rich semantic spaces of both visual and textual features. They produced high-relevance content beyond simple keyword annotations. The SVO tuple based methods can be split up into two phases for the better performance of video captioning system, i.e., phase-I is the content identification and phase-II is the sentence generation for the identified objects/events/actions in phase-I. Methods for identification (phase-I) include edge detection/color matching (Kojima et al. 2002), Scale Invariant Feature Transform (SIFT) (Lowe 1999) and context based object recognition (Torralba et al. 2003) whereas for sentence generation phase there exists HALogen (Langkilde-geary and Knight 2002) representation and Head-driven Phrase Structure Grammar (HPSG) (Levine and Meurers 2006).

Fig. 1
figure 1

Hierarchical structure of this paper

The methods adopted for the task of image/video captioning can be segregated into two broad categories of retrieval based and template based approaches. In retrieval based methods, the captions are retrieved from a set of existing captions. These methods first find candidate captions, i.e., visually similar frames with their provided captions from the training dataset and then most appropriate and suitable caption is selected from the candidate cations. Although retrieval based captions are grammatically correct, but frame/video specific caption generation is very challenging. The template based approaches have fixed templates with a number of blank slots for generated caption’s subject verb and object. These methods are also capable of generating grammatically correct captions but can not generate variable length captions because of the limitation of their dependence on fixed, predefined templates, which are not capable of generating semantically rich natural language sentences, and hence, are not analogous to human annotations.

1.2 Video captioning

The deep learning models employed for video description tasks primarily follow the Encoder–Decoder structure, which is the most productive/beneficial sequence-to-sequence modeling technique. Describing a video can also be defined as a sequence-to-sequence task, since it has a sequence of visual representations as input, and a sequence of generated words as output. The ED architecture gained considerable attention in the earlier research specific to neural machine translations, where the implementation was for text translations from one language domain to another. The task of describing videos can be partitioned into two major sections: the visual model for understanding visual content correctly (without missing any information), and the language model for transforming learned visual information into grammatically correct natural language sentences. Since computers only understand numbers, arrays, and matrices, so learned visual representations are stored as context vector. The context vector is a collection of numbers communicating some visual information into the language model. The language model then extracts the connotation of each of the context vectors, and accordingly generates semantically aligned words, one by one. Represented mathematically, we can say that the language model is employed to establish the probability P of generating a word w at time t conditioned on previously generated words \(w1, ..., wt-1\), during the preceding time steps (where \(1,2,.... t-1,t\) represents time step), i.e., \(P ( wt \Vert w1, ... , wt-1 )\) where \(w_{i}\) represents word generated at a certain time.

Figure 2 demonstrates the deep learning-based basic model employing visual and language models for video description. Following the ED architecture for video descriptions, the standard ED structure employs a combination of the convolutional neural network, the recurrent neural network , or the variants LSTM or GRU as encoder and decoder blocks. RNNs for sequential data processing have demonstrated comparable results; but for long sequences, the implementation of the RNN system is not appreciable. The associated vanishing and exploding gradient problems, as well as the recurrent nature involving previous-step computations in the next step, hinders the parallel processing of the sequence, hence degrading overall performance. In order to upgrade the performance of the standard ED architecture, it can be equipped with an attention mechanism, reinforcement learning, or a transformer mechanism. Attention mechanisms focus on specific areas of the frame, and achieve high-quality results. RL employed within the ED architecture can progressively deliver state-of-the-art captions, employing its own agent-environment interactions. The transformer mechanism is an efficient architecture for robust output. It does not contain any convolution and recurrence, and is developed solely on the basis of self-attention. The transformer allows parallelization along with training on a massive amount of data, with the capability to fully utilize the available GPUs for most machine learning tasks. Drastically reduced training time and efficient model training can take place with high accuracy by using the parallel processing capability of transformers. Recently with the emergence of several versions of transformers, long-term dependency handling is not an issue anymore.

Fig. 2
figure 2

The basic model for video description (dense video captioning) examines a long video comprising multiple scenes or events, i.e., Event 1, Event 2, up until the last identified event. After localization (identification of start and end times) of each event, the paragraph-like (multi-sentence) description is generated by coherently combining the captions generated for each event, catering to concurrent and overlapping events

1.3 Dense video captioning/ video description

Comprehending the localized events of a video appropriately and then transforming them accurately into a textual format is called dense video captioning, or simply, video description. This task of describing complex and diverse visual perceptions establishes a connection between the two world-leading realms of computer vision and natural language processing. Capturing the scenes, objects, and activities in a video, as well as the spatial-temporal relationships and the temporal order, is crucial for precise and grammatically correct multi-line text narration.

Nevertheless, the task of automatically describing video is challenging. The model employed for the generation of a caption characterizing a long-duration video or a short clip consisting of a significant number of frames requires not only an understanding of sequential visual data but also the capability to provide a syntactically and semantically accurate translation of that understanding into natural language. Similarly, the proper understanding of a considerable number of objects, events, and actions, and their interactions in the video (as well as their relationships and the order in which they happen) must be captured accurately and explained properly using natural sentences. Whether they belong to an open or a constrained domain, videos mostly contain numerous scenes or events. The dependencies between the events are captured by using contextual information from the previous (past) and coming (future) events, and then all events are jointly described accordingly by using natural language. Analogous to dense image captioning which describes regions in space after localization. Similarly, with the help of transformer, 2D images are transformed into 3D objects with color and texture-aware information by Yuan et al. (2022), for dense captioning. Dense video captioning (Krishna et al. 2017) localizes events in time, and afterwards expresses them. These events can intersect with other events, and hence, are challenging to describe appropriately. Dense video captions capture details of event localization and their co-occurrence (Aafaq et al. 2022).

Terminologies associated with video description have their specific implications. Keeping current research in mind, the task of video captioning can be distributed into two sections: mono-sentence caption generation and multi-sentence (paragraph) caption generation. The mono-sentence is supposed to be a precise, yet fully informative, abstractive representative sentence of the whole video, whereas a multi-sentence (dense) caption is supposed to localize and describe all events in the video temporally, including intersecting and overlapping events. Here, event localization refers to identification of each event in the video with its start and end times; event description means expressing each localized event temporally in a much more detailed way, resulting in the generation of multiple sentences or paragraphs (like a dense summary of the whole video). The generated fine-grained caption is a requirement of such a mechanism that proves to be expressive and subtle. Its purpose is to capture the temporal dynamics of the visuals present in the video, and to then join that with syntactically and semantically correct representations using natural language.

Problem setup: video captioning/description

  1. 1.

    For video captioning (single-sentence): Let us suppose we have a video, V, containing N frames such that \(V=\{f_{1},f_{2},....f_{N}\}\) (f representing frame), and our aim is to generate a single-sentence textual caption, T, representing the video content comprising n words such that \(T=\{w_{1},w_{2},....w_{n}\}\) (w representing word), and semantically aligned words, one by one, are generated conditioned on previously generated words. At time t, the word wt is generated conditional on probability \(P ( wt \Vert w1, ..., wt-1 )\) where \(w_{i}\) represents a word generated at a certain time, i.

  2. 2.

    For video description (dense captioning): Particular to videos containing multiple scenes or events, event localization (Krishna et al. 2017) is the identification of start and end times of a particular event in the video. Comprehending these localized events semantically, and transforming them into precise and grammatically correct multi-sentence natural language explanations, is required. For video V containing N events such that \(V=\{E_{1},E_{2},....E_{N}\}\) (E representing event), each event needs to be identified such that \(E_{1}=\{EST, w_{1},w_{2},....w_{A}, EET\}\) with event start time (EST) and event end time(EET). A certain number of words, A , expresses event \(E_{1}\), and similarly, localized \(E_{2}=\{EST, w_{1},w_{2},....w_{B},EET\}\) has a certain number of words, B, to express event \(E_{2}\) and so on, until all events in the video are understood. Every event can be expressed with a different number of words (AB, ...) depending on the duration of the event. The aim is to gather all localized event descriptions and generate a semantically and grammatically correct and coherent paragraph-like description for the video, avoiding redundancy.

This survey aims to present inclusive insights into the deep learning-based techniques implemented in the video description, supported by the most recent research. During the past few years, the field of captioning (image/video) has exhibited remarkable success and has achieved amazing state-of-the-arts. A thorough discussion on these techniques/methodologies adopted from time to time lacks in the current literature. The key motivation behind this research work is to fill this research gap and facilitate the researchers in a clear understanding of the employed approaches. Our contributions to this research are as follows.

  1. 1.

    We provide an elaborate view of the latest deep learning-based techniques for video description, with up-to-date supporting articles from the literature.

  2. 2.

    Besides the standard ED architecture, a detailed exploration of deep RL, attention mechanisms, and transformer mechanisms for video descriptions is performed.

  3. 3.

    We categorize and compare the key components of the models, and the substantially crucial information is highlighted for in-depth insights and quick understanding, making it expedient for researchers who are involved in video processing for summarization and description, or for autonomous-vehicle, surveillance, and instructional purposes, to find the state of the art in a single go.

  4. 4.

    Finally, we identify future research directions for further improvement in video description systems.

Outline of the survey This paper is organized as shown in Figure 1. The next section, Section2, offers a brief discussion on the available surveys on the topic. These surveys primarily focus on the simple Encoder–Decoder based models. Section 3 demonstrates detailed deep learning based techniques employed for video description. At first, standard encoder decoder architecture employing CNN-RNN, RNN-RNN and CNN-CNN compositions followed by a thorough discussion is explore. Secondly, we describe fusion of attention mechanism in encoder decoder system for video captioning models to focus on specific distinctiveness. Thirdly, we present transformer based recent state of the art methods and analyze them for video description generation. Finally, the successful strategies for optimizing the generated descriptions through deep reinforcement learning is discuss in detail. Discussion on limitations and challenges of every technique’s is also present with working strategy, computational concept, and literature review in their respective sub-sections. In Section 4, we analyze and compared the benchmark results produced by the state of the art methods, segregated chronologically based on specific dataset. A brief overview of the evaluation metrics and datasets used for video description is also provided in this section. Finally, Section 5 concludes the review with few future directions.

2 Literature review

Computer vision mainly deals with classification, detection, and segmentation tasks (Rafiq et al. 2020; Agyeman et al. 2021). The first part of video captioning, i.e., temporal action recognition, solely belongs under computer vision, whereas the second part (caption generation) bridges computer vision and natural language processing. Captioning is once again split into two types–one for the simple-to-recognize and -describe actions, and the second, for actions too complex to be described with simple and short natural language sentences.

The selection of appropriate components plays a substantial role in the generation of accurate and truthful output. A thorough empirical analysis of each component in the ED framework was presented in Aafaq et al. (2019b). Significant performance gains were demonstrated by careful selection of an efficient and capable mechanism for four major constituent components: feature extraction, feature transformation, word embedding, and language modeling. The authors emphasized which efficient mechanisms can be adopted for these four factors and how to generate state-of-the-art results. For feature extraction, five different CNN models (3D-CNN, VGG-16, VGG-19, Inception-v3, and Inception-ResNet-v2) were analyzed, and the authors concluded that C3D is a common choice because of its ability to process both individual frames and short video clips. For 2D-CNN models, Inception-ResNet-v2 performed best in feature transformation. For mean pooling, and temporal encoding, temporal encoding was favored since mean pooling will result in considerable loss of information. In contrast, temporal encoding can capture highly reliable temporal dynamics of the whole video without any noteworthy loss of information, creating a positive counterbalance for system performance. In the literature, two methods are commonly referred for word embedding. The first is randomly initializing the embedding vector, and then computing the task-specific embedding, which is not able to capture rich semantics, whereas the second method makes use of pre-trained embedding. The authors examined four pre-trained embeddings–Word2Vec, FastText, and Glove (glove6B, glove 840b)–as well as randomly initialized embedding. FastText, with operative word embedding, performed prominently. Finally, in language modeling, the depth (or number of layers) in the system is crucial for superior performance, along with various hyperparameters, e.g., internal state size, the number of processed frames, fine-tuned word embedding, and dropout regularization.

Table 2 A literature review of video captioning/description

This research work features deep learning-based frameworks for video description–ED in particular. It is clear from Table 2 that all available surveys on video description were primarily focused on simple ED-based frameworks. Several among them, notably (Li et al. 2019a; Chen et al. 2019b; Aafaq et al. 2019c; Amaresh and Chitrakala 2019; Su 2018) and (Wu 2017), briefly discussed the application of an attention mechanism, and (Li et al. 2019a; Chen et al. 2019b; Aafaq et al. 2019c) just gave an overview of reinforcement learning within the encoder-decoder, but none of them elaborated on these architectures, on related articles from the literature in detail, or explored transformer mechanism employment for video captioning. In this survey, all four approaches are described in detail, with state-of-the-art articles proving their worth.

To take full advantage of the advanced state-of-the-art hardware, i.e., GPUs, it is essential to adopt the models/mechanisms that can fully exploit these hardware structures. The sequential nature of RNNs cannot utilize the parallelization found in GPUs, resulting in inferior performance and slow training. As an alternate option to recurrence and convolution, an efficient approach is proposed by the transformer. It is capable of parallel processing, accelerated training, and handling long-term dependencies, and is space-efficient, much faster, solely self-attention-based, and is the model of choice for current advanced hardware.

3 Techniques/approaches

Inspired by technological advancements, researchers have experimented with deep neural networks for the automatic caption generation task. The early frameworks comprised the standard ED structure, but with their methodical rise, new high-tech approaches are fused with the standard structure to produce more expressive and flexible natural language sentences with richer semantics. In this paper, we have classified the adopted techniques into four categories per their technological advancement in time–the standard ED approach, the fusion of attention mechanisms in the standard ED structure, and adoption of the transformer mechanism for robust performance, and the decision-based DRL approaches, which have prominence in accurate natural language caption generation and optimization. The arrangement of these approaches/techniques is based on their technological evolution over time. We discuss these techniques one by one in detail in this section.

Fig. 3
figure 3

The standard Encoder–Decoder architecture

3.1 Standard encoder–decoder approaches

The ED approach is a neural network configuration, as shown in Figure 3. The architecture is partitioned into two components, namely, the encoder and the decoder. It has proven to be cutting-edge technology. The modern approach has been employed by the research community around the globe to solve sophisticated tasks, i.e., image captioning, video description, text and video summarization (Rafiq et al. 2020), visual question-answering systems and conversational modeling, and movement classification.

The ED framework comprises two neural networks (NNs):

$$\begin{aligned} \varphi (F) = R \end{aligned}$$
(1)

where

$$\begin{aligned} R=\{{r_{1}},{r_{2}},....., {r_{n}} \} \end{aligned}$$
(2)

The vector R in (2) is an internal representation that captures the context and meaning of the input, and is known as a context vector or thought vector. The choice of encoder structure mainly depends on the type of input, e.g., for text, the best encoder architecture is the RNN. For an image/frame or video clips as input, the CNN structure proved to be best suited for context-vector/visual-feature extraction (Xu et al. 2015). However, deliberation regarding CNN or RNN selection (Yin et al. 2017), and their behavioral differences for NLP, is under way among researchers. The fusion of these two architectures has accomplished outstanding results, since they both process information through altered techniques, and complement one another.

Context vector R generated from the encoder is input to the second neural network in the system, i.e., the decoder. The decoder generates the corresponding output. Selection of the decoder architecture depends on the type of output. In the video description task, when meaningful textual information is required as output from the input video, the RNN is the architecture mostly employed for this purpose. RNN variants like long short-term memory and the gated recurrent unit are popular in research involving natural language processing because of their ability to handle long-term dependencies. Decoder RNN \(\theta\) functionality at any given time t is

$$\begin{aligned} \left[ \begin{array}{c} O_t\\ h_t \end{array} \right] = \theta \left( h_{t-1}, O_{t-1}, R \right) \end{aligned}$$
(3)

where \(O_t\) represents output at time t, and \(h_t\) is the internal/hidden state of the RNN, whereas \(h_{(t-1)}\) and \(O_{(t-1)}\) represent the hidden state and the output of the previous time step, \((t-1)\). The RNN repeatedly works until the end-of-sequence \(\langle EOS\rangle\) token is generated. LSTM and GRUs with improved performance replace the basic RNN structure.

Fig. 4
figure 4

Composition of the standard ED architecture for video description based on the literature explored for this research work

Specific to the video description, the encoder can be treated as the visual model for the system, whereas the decoder is responsible for language modeling. Two-dimensional (2D) or 3D convolutional neural networks are mostly used as an encoder for computing the context vector of a fixed or variable length. The context vector can be called a vector representation or a visual feature. After extraction, certain transformations are applied to these visual features, i.e., Mean/Max pooling or temporal encoding. The resultant transformed visual features are then entered into the language model for description generation. The ED framework is the most popular paradigm for video description tasks in recent years, so the authors in Aafaq et al. (2019b) partitioned the ED structure for video description into four essential components: a CNN model for visual feature extraction, the types of transformations applied to extracted visual features, and the language model and the word embedding within the language model. Since the involvement of each of these components in the performance of the system is of high importance, intelligent selection is essential. By keeping in mind the pros and cons of each selected component, one can straight forwardly determine the overall performance of the description system. Blohm et al. (2018) explored the behavioral variance between CNNs and RNNs using a MovieQA dataset with 11 models trained for different random initializations of both an RNN-LSTM and a CNN, and finally, they observed that RNN-LSTM models outperformed CNN models by a large margin, although both models share the same weaknesses. Considering limitations or weaknesses, they test the transferability of the adversarial examples across models to investigate the CNN models on the adversarial examples optimization to fool the RNN models and vice versa. Degradation in performance was observed for both CNNs and RNNs and was fixed by including some adversarial examples in the training data.

Table 3 Standard encoder–decoder based models for video captioning with visual and language components, short contributions and shortcomings (if any) are also mentioned with each approach

Three compositions of encoder and decoder for video description available in the literature (CNN-RNN, RNN-RNN, and CNN-CNN, are summarized in Table 3 for convenience along with their visual & language components, contributions, and shortcomings (if any). Figure 4 shows these compositions as percentages. In recent research works, the transformers are also exploited as visual or language component of the ED structure, Seo et al. (2022) employed ViViT (video vision transformer) Arnab et al. (2021) & BERT as encoder and GPT-2 based decoder. Zhao et al. (2022) used an encoder composed of transformer encoder blocks for video features extraction in a global view resulting in reduced loss of intermediate hidden layer information.

3.1.1 CNN–RNN

The conventional ED pipeline typically comprises a CNN as a visual model for extracting visual features from each frame of the video, employing an RNN as a language model for generating the captions, word by word. VSJM-Net (Aafaq et al. 2022) presented a visual and semantic joint embedding network which is employed to detect proposals as well as learn the visual and semantic space. vc-HRNAT (Gao et al. 2022) using hierarchical representations is capable to learn in a self-supervised environment with multi-level semantic representation learning of video concepts. However, the system lacks the ability to visualize concepts of objects and actions that are absent or unclear in videos. VNS-GRU (Chen et al. 2020), a semantic GRU model with variational dropout and layer normalization, is trained using professional learning. For feature generation, the system utilizes ResNetXT-101 pre-trained on ImageNet (Deng et al. 2009) at the frame level, and an efficient convolutional Network (ECN) (Zolfaghari et al. 2018) pre-trained on Kinetics-400 at the video level. The model can learn unique words and delicate grammar based on vocabulary and tagging mechanisms. Similarly, a system comprising 2D and 3D ConvNets with a semantic detection network (SDN) as the encoder and a semantic-assisted LSTM as the decoder was proposed in (Chen et al. 2019a) to overcome the limitations of short and inappropriate descriptions, deprived training approaches, and the non-availability of critical semantic features. Static spatial as well as dynamic spatio-temporal features are involved, along with a scheduled sampling strategy for self-learning of long sentences. A proposal for sentence-length modulated loss reassures optimization as well as thorough and detailed captions. In order to enhance the visual encoding mechanism for captioning purposes, GRU-EVE (Aafaq et al. 2019a) was the first to emphasize feature encoding for semantically robust descriptions using a 2D/3D CNN with short Fourier transform as a visual model and a two-layered GRU as a language model for capturing spatio-temporal video dynamics. A 2D-CNN (InceptionResNetv2 Szegedy et al. 2017) pre-trained on an ImageNet dataset, and a 3D-CNN (C3D Tran et al. 2015) pre-trained on a Sports 1M dataset (Karpathy et al. 2014) are used for feature extraction. Then, the extracted features are processed hierarchically with short Fourier transformation, and the visual features are semantically improved. The approach proved that application of short Fourier transformation on a 2D-CNN produces improved results compared to the 3D-CNN. Feature extraction techniques play a significant role in the generation of an accurate caption. Both static and dynamic feature extraction were explored in SEmantic Feature Learning and Attention-Based Caption Generation (SeFLA) (Lee and Kim 2018). The paper suggested a multi-modal feature learning system with an attention mechanism. This research explains the prominence of semantics acquired using LSTM along with broad-spectrum visual features extracted using a ResNet CNN for generating accurate descriptions. Semantics was further categorized as static or dynamic, where static (a noun in the description) refers to the object, person, and background, whereas dynamic (a verb in the description) corresponds to the action taking place within the input video, as shown in Figure 5.

Fig. 5
figure 5

Semantic feature categorization as either static or dynamic, where static refers to the object, the person, and/or the background, and dynamic corresponds to the action taking place within the input video. Sample video frames and reference captions were taken from the Microsoft Video Description (MSVD) dataset

Systems with multiple independently trained models from different domains utilized in a pipeline fashion, focusing only on the input and output and skipping all the intermediate steps to get the required output, are called end-to-end systems. In video description, we have visual and language models for vision and language processing. If we train them independently, and then plug them into a pipeline, they are end-to-end systems. The first end-to-end trainable deep RNN (Zhang et al. 2017) proposed a description model employing Caffe CNN (Jia et al. 2014), a variant of AlexNet, fused with a two-layered LSTM accompanied by transfer learning, forming the ED to describe videos in an efficient fashion. The model is trained on the popular ImageNet dataset, and trained weights are utilized for initialization of the LSTM-based language model, boosting the training speed. Feature extraction, aggregation, and caption generation, are all steps involved in the process that require memory for computation and evaluation. Limitations associated with memory requirements while generating captions is addressed in EtENet-IRv2 (Olivastri 2019), which is also an end-to-end trainable ED architecture proposing a gradient accumulating strategy employing Inception-ResNet-v2 (Szegedy et al. 2017) and GoogLeNet (Szegedy et al. 2015) with two-stage training for encoding. Evaluation of benchmark datasets (Rafiq et al. 2021) showed significant improvement, but with a limitation on the computational resources required for end-to-end training.

Long Short-Term Memory with Transferred Semantic Attributes (LSTM-TSA) (Pan et al. 2017) emphasizes the fusion of jointly exploited semantic attributes for both images and video, along with the significance of its injection in extracted visual features for automatic sentence generation. A transfer unit to model the jointly associated attributes extracted from images and videos was proposed for integrating semantic attributes into sequence learning. The visual model, afterward accompanied by semantic attributes mined from both images and video, is fed into an LSTM for caption generation. Similarly, ResNet50 and VGG-16 CNN architectures coupled with the LSTM structure were exploited in Rivera-soto and Ordóñez (2013) for sequence-to-sequence video description models. Three types of model were proposed: mean pool, a single-layer ED, and a stacked ED. Extensive experimentation, performed on the Microsoft Video Description (MSVD) dataset, proved that a single-layer ED network performs best for machine translation, but complicates the network convergence for video descriptions. Instead, two stacked LSTM networks concentrate efficiently on both visual encoding and natural language decoding.

Both global and local features play roles while captioning a video. Object-aware aggregation with a bidirectional temporal graph-based (OA-BTG) description model (Zhang et al. 2019a) captures in-depth temporal dynamics for significant objects in a video, and learns particular spatio-temporal representations by performing object-aware local feature aggregation on the detected object-aware regions and frames. A bi-directional graph is designed to capture both forward and backward temporal trajectories of a specific object. For learning certain representations, the global frame sequence and object spatio-temporal trajectories are aggregated. The influence of objects at a particular time is differentiated using a hierarchical attention mechanism. Understanding the global contents of a video, as well as the in-depth object information, is essential for the generation of flawless and fine-grained automatic captions. Likewise, RecNet (Wang et al. 2018a) (a novel ED-reconstructor architecture) also exploits the phenomenon of the global and local structure of the video by employing two types of reconstructors and bi-directional flow. The relationship between video frames and generated natural sentences is established and enhanced by incorporating a reconstruction network for video captioning. Global structure is captured by mean pooling, while the attention mechanism is included in the local part of the model to exploit local temporal dynamics for the reconstruction of each frame.

CVC (Yan et al. 2010) proposed a system using the ED approach to describe numerous characteristics of off-site viewers or an audience’s crowd (such as the number of people in the crowd), the movement conditions, and the flow direction. The model employs a 2D/3D CNN for crowd feature extraction from video, which then feeds into an LSTM-GRU-based language model for captioning. The authors created their own crowd captioning dataset based on WorldExpo10. Based on the famous S2VT model, the CVC model showed improvement because of the small dataset and simple captions. To deal with the uncertainties faced during inappropriate data-driven static fusion methods employed in the video description system, TDDF (Zhang et al. 2017) established a task-driven dynamic fusion method. VGG-19 and the GoogLeNet CNN were employed for extraction of appearance features, whereas C3D was utilized for motion feature extraction. The proposed method achieved the best METEOR and CIDEr score when evaluated with the MSVD and Microsoft Research Video to Text (MSR-VTT) datasets, compared to a single-feature system.

One of the significant characteristics required in a generated description is its diversity. Lexical-FCN (Shen et al. 2017) was proposed for generation of multiple diverse and expressive captions based on weak video-level sentence annotations. Although the model is trained with a weakly supervised signal, it produces multiple diverse and meaningful captions with the sequence-to-sequence language model. A convolution-based lexical FCN forms the visual part of the model, whereas the language model follows the state-of-the-art S2VT (Venugopalan et al. 2015) mechanism with a bi-directional LSTM to improve the quality of automatically generated captions. Diversity, coherence, and informativeness of the generated captions ensure the supremacy of the proposed model.

3.1.2 RNN–RNN

In early research, employing an RNN in both encoding and decoding for neural machine translation demonstrated very efficient performance. Researchers explored the horizons for video description by exploiting the RNN for both feature extraction and language modeling. Long-term recurrent convolutional networks (LR-CNs) (Donahue et al. 2017) were proposed with an ED architecture for long sequences with time-varying input and output. Video description is carried out using three variants of the architecture: an LSTM encoder and decoder with a conditional random field (CRF) max, an LSTM decoder with a CRF max, and an LSTM decoder with CRF probabilities. For a broader scope, the research focuses on activity recognition, image captioning, and video description.

A state-of-the-art, sequence-to-sequence video-to-text generator, S2VT (Venugopalan et al. 2015), following the ED architecture uses a stacked two-layer LSTM ED model that takes a sequence of RGB frames as input and produces a sequence of words corresponding to the input sequence. The encoding and decoding of the frame and word representations are learned jointly from a parallel corpus. To model the temporal aspects of activities typically shown in videos, optical flow (Brox et al. 2014) between pairs of consecutive frames is computed. The flow images pass through a CNN and are provided as input to the encoding LSTM. Employing a single LSTM for both encoding and decoding allows parameter-sharing between the two states. Sequential processing at both stages is incorporated because both input and output are of variable, potentially different, lengths. Loss is computed on the decoding side for optimization of the video description system. The model was taken as a basis by many researchers, like in S2VT with knowledge (S2VTK) (Wang and Song 2017), and follows a detect, fetch, and combine approach. It first detects an object in the video, fetches object-related information from a knowledge base DBPedia, and creates a vector using Doc2Vec. Both elements, i.e., visual extracted features and related information regarding the detected object, are then input to the LSTM-based language model for caption generation. Another model based on S2VT (Venugopalan et al. 2015), meaning a guided system (Babariya and Tamaki 2020), was proposed in connection with the object detection module YOLOv3 (Redmon and Farhadi 2018) to generate correct captions having a similar meaning. The proposed model picks the object having the highest abjectness score in the YOLO detector, and after detection, searches for the nearest string describing the detected object. Word2Vec (Demeester et al. 2016) pre-trained on part of the Google News Dataset, is used for string embedding. Semantic similarity or caption meaning is considered for optimization of the training instead of training using the conventional word-by-word loss. Following the object detection approach, tube features for video description was proposed in Zhao et al. (2018). Trajectories of objects in input videos are captured, employing a Faster-RCNN (Wallach 2017) to extract region proposals, and afterwards, the regions from different frames (but belonging to the same objects) are associated as tubes. A similarity graph is created among the detected bounding boxes, and a similarity score is assigned to a pair of bounding boxes in adjacent frames. A bi-directional LSTM encoder encodes both forward and backward dynamic information of the tubes, and converts each tube into a fixed-sized visual vector, whereas a single LSTM decoder with an attention mechanism to monitor the most correlated tubes, generates the captions.

Dealing with multiple and diverse caption generation, the Diverse Captioning Model (DCM) (Xiao and Shi 2019z) is a conditional Generative Adversarial Network (GAN) with an ED model to describe video content with multiple descriptions. It can describe video content with great accuracy, and can capture both forward and backward temporal relationships to encode the extracted visual features. For a given video, the intermediate latent variables of the conventional encode-decode process are utilized as input to the conditional GAN (CGAN) to generate diverse sentences. Generators comprising different CNNs generate diverse descriptions while the discriminator inspects the worth or quality of the formed captions. Combining the reasonableness and differences between the generated sentences, a diverse captioning evaluation metric (DCE) was also proposed.

Feature extraction from pre-trained models and their sensible arrangement can considerably affect the quality of generated captions. These extracted features or modalities are recognized, and their detailed effects were discussed in Hammad et al. (2019) with an S2VT (Venugopalan et al. 2015) basis. The different video modalities can be recognized as a frame or image, with a scene, the action, and audio (Ramanishka et al. 2016). All these modalities have their own significance while generating the description and inclusion of essential features, and when accompanied by a decoder with an attention mechanism, can help the model to extract the most pertinent information related to the scene and can have a substantial effect on quality improvement of the generated description. A human-like ability to extract the most relevant information from a scene can be incorporated by intelligent selection and accurate concatenation of features.

3.1.3 CNN–CNN

TDConvED (Chen et al. 2019b) was the first and (so far) the only ED approach fully employing CNNs for both visual and language modeling. To address the limitations of vanishing/exploding gradients, as well as the recurrent dependency of the RNN preventing parallelization during sequence training, a system with convolutions in both the encoder and the decoder was proposed. Feed-forward convolution networks are free from recurrent functions, and previous step computations are not considered in the next step, so parallelization of sequence training can be achieved. The proposed model also exploits the temporal attention mechanism for sentence generation. In the encoder, the convolutional block is provided with temporal deformable convolutions by capturing dynamics in temporal extents of actions or scenes. The significant contribution of this research is to use convolutions for sequence-to-sequence learning and for enhancing the quality of video captioning.

3.2 Discussion - ED based approaches

The famous Encoder–Decoder structure for video description configures two neural networks, one for visual information extraction and other for textual narration generation corresponding to the visual perspective. This composition of neural networks involve CNN, RNN, LSTMs, GRUs, and tranformers as its encoding and decoding modules. CNNs are proficient in automatic identification of relavent features without human intervention (Zhang et al. 2019b). As per (Goodfellow et al. 2016) the key charachteristics of CNN are sparse interactions, equivalent representations, and parameter sharing. Scanning regions instead of whole image results in less parameters with simplified and speedy training process and enhance generalization capability to avoid overfitting (Alzubaidi et al. 2021). RNNs are applied mostly in speech and language processing contexts. It uses sequential data to convey the information catering the order of the sequence. It offers recurrent connections to memory blocks in the network and flow of information is controlled through gated units in the network. This algorithm’s sensitivity to exploding and vanishing gradients are the main limitations associated with it while dealing with long range dependencies. In comparison with RNNs, CNNs are considered to be more powerful due to its less feature compatibility when compared to CNN (Alzubaidi et al. 2021). It variants LSTMs and GRUs are further enhanced to utilize less training parameters, less memory with more accuracy and faster execution. These deep learning models composition, i.e., CNN-RNN, RNN-RNN and CNN-CNN employed by researcher for video description task demonstrated their findings. A thorough empirical analysis by (Aafaq et al. 2019c) concluded that C3D is commonly employed model for visual features extraction from images and short clips. Inception-ResNet-v2 and temporal encoding achieved compareable results in features transformation. Towards language modelling, the depth or number of layers in the decoder module, internal state size, number of processed frames and word embedding with dropout regularization selection is crucial for high quality description generation. If we train these modules (visual & language) independently, and then plug them into a pipeline, they are end-to-end systems. These systems are pre-trained on large scale datasets and then fine-tuned on video description datasets for the down stream task of video description. The use of deep learning to caption video has been extensively researched, but there are still numerous challenges to be resolved including objects accurate identification and their interactions, generating improved event proposals for dense captioning, utilization of task specific transformers for vision and language accurate comprehension.

3.3 Attention mechanism

An attention mechanism can be characterized as an act of cautiously focusing on the directed, relevant, and important parts in an image, frame, or scene, i.e., considering only the salient contents to be described, while ignoring others. The general structure of the video captioning model supported by the attention mechanism is grounded on various types of cues from the video. These cues are integrated into the basic framework of the ED to get the decoding process to concentrate on specific parts of the video at each time step to generate an appropriate description.

Fig. 6
figure 6

Distribution of various attentions for video descriptions based on the literature explored for this research work

Before establishment of the attention mechanism in a standard ED architecture, the encoder block of the employed model was able to convert image or frame features into a single context vector, which is then fed to the decoder unit for caption generation word by word. For images loaded with multiple/complicated objects, one intermediate vector is unable to adequately convey the subsequent image features, instigating the loss of important information and substandard caption generation. The fusion of the attention mechanism empowers the encoder to concentrate on the various essential parts of the frame with distinct intensity, generating multiple context vectors, resulting in enhanced quality of the generated natural language sentences.

Let us suppose the video description system takes a video and generates caption Y for that video such that

$$\begin{aligned} Y= \{W_1,W_2,W_3, ... ,W_c\}, W_i \in R^K \end{aligned}$$
(4)

where K is the size of the vocabulary, and C is the number of words in the caption. Using a 2D/3D CNN to extract the features from each frame/clip of the video, we have an annotation vector as a collection of all the intermediate context vectors or feature vectors, expressed as

$$\begin{aligned} CV= \{CV_1,CV_2,CV_3, ...,CV_L\}, CV_i \in R^D \end{aligned}$$
(5)

where L is the number of feature vectors, each of which is a D-dimensional representation corresponding to the relevant part of the frame/clip of the video (Xu et al. 2015; Bahdanau et al. 2015). The attention mechanism permits more direct dependence between the states of the model at different points in time (Raffel and Ellis 2015). A model produces an intermediate context vector, or hidden state \(CV_t\), at time step t. Attention-based models compute a single context vector at time t, \(SCV_t\), as the weighted mean of the state sequence \(CV_i\), expressed in (6) and simplified as (7):

$$\begin{aligned} SCV_t= \sum _{i=1}^{L} CV_i . \alpha _{ti} \end{aligned}$$
(6)

or

$$\begin{aligned} SCV_t=CV_1 * \alpha _{t1} + CV_2 * \alpha _{t2} + CV_L * \alpha _{tL} \end{aligned}$$
(7)

Each \(CV_i\) contains information about the whole input sequence with a strong focus on the parts surrounding the i’th word of the input sequence, which is the main essence of the attention mechanism, to find mappings between an input element and its corresponding output. The attention weight computed at each time step t for each feature vector \(CV_i\) using Softmax is

$$\begin{aligned} \alpha _{ti}= \frac{ e^{(SCORE_{ti})}}{ \sum \nolimits _{K=1}^{L}e^{SCORE_{ti}}} \end{aligned}$$
(8)

where

$$\begin{aligned} SCORE_{ti}=FUNC_{ATT}(CV_i, W_{t-i}) \end{aligned}$$
(9)

\(SCORE_{ti}\) in (9) is the function of attention, which indicates the goodness of input position, and the generated output matches how well the input around the i’th position matches output at time t. The score is computed based on hidden state \(CV_i\) and decoder-generated output in the previous time step, i.e., \(W_{t-1}\). \(SCV_t\) is then concatenated with the word output from the decoder’s previous time step, resulting in a concatenated context vector with the weighted feature information conveying where to focus more attention while generating the word at this particular position. This process continues until reaching the decoder output \(\langle END \rangle\) token. The generic review network in Yang et al. (2016) also proposed the same review steps as the concatenation of feature vectors with attention weights, producing a thought vector after each review for input to the decoder attention mechanism. Figure 7 depicts the attention process carried out inside the ED architecture.

Strategies for model optimization during training include the teacher forcing technique (Williams and Zipser 1989), curriculum learning (Bengio et al. 2009), and RL-based optimization techniques. Teacher forcing is a simple way to train RNN-based models while constituting a concatenated context vector. A word is provided from reference annotation instead of the actual generated word at the previous time step to guide word generation. This showed improvements in the model’s learning capabilities, and produced better results in the testing phase. Later, (Huszár 2015) proved the biased learning tendency of teacher forcing and curriculum learning, and proposed professor forcing (Goyal et al. 2016) for RNN optimization by adopting an adversarial domain method for alignment of the RNN during training and testing phases (Chen et al. 2019a).

Fig. 7
figure 7

The attention mechanism in the Encoder–Decoder architecture at time t

Table 4 The attention mechanism employed for video captioning with visual and language components, short contributions and shortcomings (if any) are also mentioned with each approach

Different types of attention can be applied depending on the nature of the problem and situation. Figure 6 shows adaptation of different types of attention mechanism for video captioning. In Gella et al. (2020), the authors proposed two categories of temporal attention mechanisms (local and global temporal structures) considering the task of video description. The local temporal structure symbolizes fine-grained or detailed in-depth information like picking up the spoon or laying on the bed, whereas the global temporal structure mentions the sequence of events, objects, shots, and persons in the video. For a video description system to be state of the art, it must selectively concentrate on the most prominent features of the sequence in the video, exploiting both global and local temporal information.

To generate high-quality captions, the model needs to integrate the fine-grained visual clues from the image/frame. Lu et al. (2017) proposed a novel adaptive attention model with a visual sentinel. The spatial and adaptive attention-based model was capable of automatically making decisions on when to count on visual signals and which part of the image to focus on at a particular time, and vice versa. The combination of spatial and adaptive attention with the employed LSTM produced an additional visual sentinel providing a fallback option to the decoder. The sentinel gate helps the decoder get the required information from the image. Supporting that idea, researchers in Gao et al. (2019) and Song et al. (2017) suggested a system of hierarchical LSTM (hLSTMat) based on adaptive and temporal attention to enrich the representation ability of the LSTM. The model’s capability to adapt to low-level visual or high-level language information at a certain time step demonstrated robust description for videos.

Two commonly available design strategies in the captioning related literature are top-down and bottom-up. The top-down (or modern) strategy starts from the essence, gist, or central idea of the image/frame and then transforms that gist into appropriate words, whereas the bottom-up (or classical) approach first abstracts the words for the various dynamics of the frame, and then combines those words in a coherent manner. Both approaches suffer from certain limitations; top-down is unable to attend to the fine-grained details, and end-to-end training is not possible for the bottom-up approach. To get the benefits from both strategies You et al. (2016) developed a model to combine both approaches through semantic attention. The proposed model is capable of selectively attending to semantic ideas and regions, guiding when and where to pay more attention. The fusion of attention with the employed RNN structure leads to more efficient and robust performance. Likewise, to tackle the correlation between caption semantics and visual content, Gao et al. (2017) proposed an end-to-end attention LSTM with semantic consistency to automatically generate captions with rich semantics. Attention weights computed from video spatial dynamics are fed into the LSTM decoder, and finally, to bridge the semantic gap between visual content and generated captions, a multi-word embedding methodology is integrated in the system.

The role of spatial and temporal attention exploited for the task of video captioning is very important. Temporal attention refers to the specific case of visual attention that involves focusing attention on a particular instant in time, whereas spatial attention involves some specific location in space. Most of the recent models have adopted spatial-temporal attention to upgrade the accuracy of the model. The studies in Lowell et al. (2014) and Laokulrat et al. (2016) presented early approaches, exploiting temporal attention for sequence-to-sequence learning. To attend to both spatial and temporal information present in video frames, Chen et al. (2018b) presented a visual framework based on saliency spatio-temporal attention (SSTA) to extract the informative visual information better and then transform it into natural sentences using an LSTM decoder. The designed spatial mechanism facilitates capturing the dominant visual notions from salient regions, and the semantic context from the non-salient regions of the video frame. Experimentation on spatial attention demonstrated that employing residual learning for spatial attention feature generation can improve performance. Models with their approach, visual, and language components are summarized in Table 4 for convenience.

Temporal attention commonly captures global features, whereas spatial attention captures local features. Xu et al. (2020) proposed channel attention along with spatial and temporal attention to ensure consistency in the visual features when generating natural language descriptions. Channel features refers to several feature graphs generated by each CNN layer. Spatial (S), temporal (T), and channel (C) attention weights are used to compute the fused features for decoding and caption generation. Eight different combinations of the three attentions were investigated, and S-C-T was the best performing combination, defining the sequence of attention for consideration while capturing features. For end-to-end learning (Chen and Jiang 2019), Motion Guided Spatial Attention (MGSA) is a spatial attention system for exploiting motion between video frames and was developed with a Gated Attention Recurrent Unit (GARU).

Considering attention while incorporating external linguistic knowledge in a captioning system, Zhang et al. (2020) proposed a combination of the object-relational graph (ORG) model with teacher recommended learning (TRL) by Williams and Zipser (1989). The explored external language model (ELM) produces semantically more analogous captions for long sequences. Appearance, motion and object features are extracted by employing 2D and 3D CNNs, reflecting the temporal and spatial dynamics of the given video. The STAT captioning system decoder (Yan et al. 2020) automatically selects important regions for word prediction depending on the local, global, and motion features extracted, exploiting the spatial and temporal structures in the video. The end-to-end semantic-temporal attention (STA-FG) model (Gao et al. 2020) integrated global semantic visual features of a video into the attention network to enhance the quality of generated captions. The hierarchical decoder is comprised of a sematic-based GRU, a semantic-temporal attention block, and a multi-modal decoder for word-by-word semantically rich and accurate caption generation. SibNet (Liu et al. 2020) employs a dual branch structure for video encoding where the first branch deals with visual content encoding, and the second branch captures semantic information in the video, exploiting visual-semantic joint embeddings. The two branches were designed using temporal convolution blocks (TCBs) and fused employing soft attention for caption generation. OmniNet (Pramanik et al. 2019), employing transformer and spatio-temporal cache mechanisms, supports multiple input modalities and can perform parts-of-speech tagging, video activity recognition, captioning, and visual question answering. Due to the efficient capture of global temporal dependencies in sequential data by the employed self-attention mechanism in the transformer architecture, simultaneous shared learning from multiple input domains is possible for accurate and superior performance.

The intrinsic multi-modal nature of video (i.e., static or appearance features, motion features, and audio features) contributes while generating captions. Learning most of these features increases the model’s ability to better understand and interpret the visuals, thus improving the overall captioning quality. The video description systems proposed in Wang et al. (2018c), Li et al. (2017), Xu et al. (2017), and Hori et al. (2017) exploit a multi-modal attention mechanism for automatic natural language sentence generation.

More recently, dense video captioning (sports-related) was proposed in Yan et al. (2019) to segment distinct events in time and to then describe them in a series of coherent sentences, particularly focusing on multiple, fine-grain granularities or details of teams. The model auto-narrates the inter-team, intra-team, and individual actions, plus group interactions and all the interactive actions in a progressive manner. Incorporation of a dense multi-granular attention block exploits the spatio-temporal granular feature selection to generate a description. Authors also developed a Sports Video Narrative (SVN) dataset comprising 6k sports videos from YouTube.com and designed an evaluation metric Fine-grained Captioning Evaluation (FCE) to measure the accuracy of the generated linguistic description, demonstrating fine-grained action details along with the complete spatio-temporal interactional structure for dense caption generation.

3.4 Discussion—attention based approaches

Attention mechanism a general notion of memory, was implemented at first for the performance improvement of Encoder–Decoder based model in the machine translation domain (Bahdanau et al. 2015). Its key concept combines all the encoded input vectors in a weighted manner, with the most salient vectors being given the highest weights. The attention mechanism intended to form a direct connection with each time-step and enable the decoder to utilize the most relevant parts of the input sequence in the most flexible manner. The crucial limitation imposed by ED’s fixed length encoding vector for long and complex sequences is its inability to retain long sequences and hinder system performance. Attention mechanism’s primary reason for creation was to address the bottleneck of handling long range dependencies. In Implicit attention, the system tend to ignore some of the input parts while concentrating on the other parts. In contrast, explicit attention weigh each part of the input based on previous inputs and concentrate accordingly. Various types of proposed attentions include soft (Vaswani et al. 2017; Liu et al. 2018), hard, self (Pramanik et al. 2019), adaptive, semantic, temporal, spatial (Chen and Jiang 2019), spatio-temporal (Chen et al. 2018b; Yan et al. 2020; Zhang et al. 2020), semantic-temporal (Gao et al. 2020), residual (Li et al. 2019b) , global, and local (Peng et al. 2021) attention. The attention mechanism eliminates the vanishing gradient by providing direct connection between the visual and language modules. The memory in attention mechanism is encapsulated in attention scores computed over time. The attention score acts as a magnifier, directing where to focus in the input for accurate output generation. Several optimization techniques, i.e., teacher forcing, curriculum earning, and reinforcement learning are also combined with the attention mechanism in ED structure to further boost the system performance.With easy to understand nature of attention mechanism, there is a need for more theoretical studies that will contribute to an understanding of the mechanism of attention in complex scenarios.

3.5 Transformer mechanism

Transformer (the first sequence transduction model), which has promptly become the model of choice in language processing, is a novel deep machine learning architecture introduced in 2017. It transforms one sequence into another following the ED architecture employing an attention mechanism, but it differs from the formerly explained ED mechanism in the sense that it does not imply any recurrent networks, i.e., an RNN, a GRU, or an LSTM. Transformers are designed to handle ordered sequences of data. However, unlike RNNs, they do not require ordered processing of the data, resulting in effective and efficient parallelization during training, compared to recurrent architectures. Becoming a fundamental building block of the most natural language-related tasks, it facilitates more parallelization during training, along with training on a huge amount of data. Table 5 lists some transformer-based approaches.

Self-attention, or intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence. No doubt, the transformer mechanism theory revolves around self-attention following the Encoder–Decoder architecture instead of the recurrent network to encode each position. It has multiple layers of self-attention addressing the prohibitive sequential nature, computational complexity, and memory utilization of CNNs and RNNs, and allows for parallelization resulting in accelerated training, handling of global/long-range dependency learning requiring minimal inductive bias (prior knowledge), and facilitating domain-agnostic processing of multiple modalities, i.e., text, images, and videos. Because of the performance-boosting characteristics of transformers, it has become the model of choice in NLP, CV, and the cross-modal tasks relating to the combination of these two world-leading realms. On the other hand, without transformers, self-attention in the recurrent architectures relies on sequential processing of input at the encoding step, resulting in computational inefficiency because the processing cannot be parallelized (Vaswani et al. 2017). Undoubtedly, self-attention-based transformers are a considerable improvement over recurrence-based sequential modeling.

The strong motive behind the development of the transformer was to get rid of the problem faced when learning long-range dependencies in sequences, and to allow for more parallelization by eliminating convolution and recurrence. The transformer does not rely as heavily on the prohibitive sequential nature of input data as the CNNs and RNNs do. Ordered sequential processing in the former deep models is a considerable obstacle in parallelization of the process. If the sequences are too long, it is either difficult to remember the content of distant positions in the sequence, or too difficult to ensure correctness. Although CNNs are much less sequential than RNNs, even then, the number of steps required to collect information from far-off positions, as the sequence grows, increases the computational cost and causes the long-range dependencies issue.

Fig. 8
figure 8

The standard/vanilla transformer architecture (Vaswani et al. 2017)

Similar to the ED structure, a transformer also has two key components: an encoder and a decoder. Both encoder and decoder contain a stack of identical units. Each encoder consists of two layers: the self-attention layer and the feed-forward neural network layer. The self-attention layer helps the encoder connect a specific part of the input sequence with other parts. The embedding is performed only in the bottom-most encoder, because only that encoder will get a vector, and all other encoders will get input from the previous encoder; i.e., the output of one encoder will become the input of the next encoder. After embedding the parts of the input sequence, each of them flows through the two layers of the encoder allowing parallel execution. Both attention and feed-forward neural network layers have a residual connection around them and are followed by a normalization layer. The decoder also contains the same layers with an additional ED attention layer to help the decoder focus on the relevant parts of the input sequence while generating captions.

Multi-head attention (refinement of self-attention) improves the performance of the attention layer by efficiently extending the model’s ability to focus on different positions of the input sequence, and the attention layer is helped with multiple representation sub-spaces. In order to determine the position or distance of each word in the input sequence, the transformer adds a vector to each input embedding, i.e., positional encoding. Because of the same specific pattern of this vector, it facilitates efficient learning in the model. This positional encoding is able to scale unseen input lengths.

The output of the top encoder is transformed into attention vectors, K (Keys) and V (Values), and is fed into each decoder’s ED attention layer. The attention layer in the decoder can only attend to the earlier positions in the output sequence before the Softmax calculation. The working mechanism of the ED attention layer in the decoder is the same as that of the multi-head attention layer except that it creates its Queries vector from the layer below it, and accepts the keys and values vectors from the top encoder. Logit and Softmax layers at the end of the decoder choose the word with the highest probability.

Table 5 Transformer-based approaches for video description

To solve the issues related to the computational complexity, memory utilization, and long-term dependencies of sequence-to-sequence modeling, several variants of transformers have been proposed in the literature over time. Since video description is a sequence-to-sequence modeling task, these updated versions of transformers can be utilized for reduced complexity and superior performance.

3.5.1 Standard/Vanilla transformer

A simple transduction architecture for sequence modeling is entirely based on an attention mechanism (Vaswani et al. 2017) with the objectives of parallelization (the ability to process all input symbols simultaneously) and a reduction in sequential computation, i.e., a constant number of operations is required to determine the dependency between two input symbols without considering their positional distance in the sequence. The commonly used recurrent layers in the ED architecture are replaced with multi-head self-attention layers where self-attention is about computing the sequence representation by relating different positions or parts of it.

A positional encoding vector, which is used to determine the context based on the position of the words in the sentence, is combined with the input, embedding both in the encoder and decoder stacks, since no recurrence or convolution is involved, so the positional encoding vector will serve the purpose of determining a word’s relative or absolute position in the sequence. The dimensions for both input embedding and positional encoding is the same. Sinusoidals (sine and cosine functions of different frequencies) are used to compute these positional encodings. Although both learned positional encoding (Gehring et al. 2017) and sinusoidal positional encoding generate the same results, even then, sinusoidals are preferred because of sequence length extrapolation. Along with multi-head attention layers, each layer in the encoder and decoder section contains a feed-forward neural network (FFN). This FFN has two linear transformations with a rectified linear unit (ReLU) as an activation function.

Three main requirements–total computational complexity per layer, a parallelizable amount of computation, and the path length between long-range dependencies in the network–motivated the use of self-attention for mapping input and output sequences. The number of operations is fixed while computing the representation. Moreover, self-attention layers are considered faster, compared to recurrent layers, and are capable of producing models with increased interpretability. Figure 8 represents the architecture of the standard/vanilla transformer.

3.5.2 Universal transformer

The Universal Transformer (UT) (Uszkoreit and Kaiser 2019) proposed in 2019 by Google introduced recurrence in the transformer to address the issue of the standard/vanilla transformer not being computationally universal. The UT, a generalized form of the standard transformer, is a parallel-in-time recurrent self-attentive sequence model based on the ED architecture, and employs an RNN for representations of every position in both input and output sequences. The recurrence is over the depth, not over the position in the sequence. These representations are revised in parallel following two steps: first is use of a self-attention mechanism for information exchange; second is application of a transition function to the output from self-attention. For the standard transformer and RNNs, the depth (the number of sequential steps in the computation) is fixed because of the fixed number of layers, whereas there is no limit to the transition function count in a UT, proving its variable depth. This depth is the main difference between the standard transformer and the UT.

In the encoder section of the UT, representations are computed by applying multi-head soft attention at each time step for all positions in parallel tracked by a transition function and the residual connections; dropout and layer normalization are applied. The transition function can use either a separable convolution or a fully connected neural network. A dynamic per-position halting mechanism based on Adaptive Computation Time (ACT) is also incorporated for selection of the number of computational steps required for the refinement of each symbol, resulting in enhanced accuracy for many structured algorithmic as well as linguistic tasks. Bilkhu et al. (2019) employed a UT for single, as well as dense, video captioning tasks, utilizing a 3D CNN for video feature extraction, and reported promising results.

3.5.3 Masked transformer

Dense video captioning is about detecting and describing temporally localized events in a video. An end-to-end masked transformer model was proposed in Zhou et al. (2018) for dense video captioning. The proposed model consists of three parts. The video encoder is composed of multiple self-attention layers (since events are associated with long-range dependencies), so self-attention is required, instead of RNNs, for more effective learning. A proposal decoder following ProcNets (Zhou and Corso 2016) (an automatic procedure segmentation method) decodes the start and end times of events with a confidence score. A captioning decoder takes input from both the video encoder and the proposal decoder, decoding the event proposals into a differentiable mask to restrict attention to the proposed event. Both decoders learn during training to adjust for the best caption generation. Zhou et al. (2018) proposed a differentiable masking scheme by confirming training stability between proposals and captioning decoders. A standard transformer is employed for both encoder and decoder because of its fast self-attention mechanism implemented for accurate and useful performance.

3.5.4 Two-view transformer

Two-view Transformer (TvT) is a video captioning technique derived from the standard transformer and accompanied by two fusion blocks in the decoder layer to combine different modalities effectively. Parallelization, the primary quality of a transformer, leads to efficient and robust training activity, and instead of simple concatenation, two types of fusion blocks are proposed to explore information from frame features, motions, and previously generated words.

TvT (Chen et al. 2018) contains two views of visual representations extracted by the encoder block, i.e., a frame representation obtained using a 2D-CNN (ResNet-152 and NasNet pre-trained on ImageNet) on every frame individually, and motion representation is obtained by employing a 3D-CNN (I3D pre-trained on Kinetics) on connecting frames.

The decoder block contains two types of fusion block: add-fusion and attentive-fusion. The add-fusion block simply combines the frame and motion representation with a fixed weight between 0 and 1. The attentive-fusion block combines the two representations in a learnable way such that these two representations, with previously generated words, can jointly guide the model to accurately generate a description.

3.5.5 Bidirectional transformer

Bidirectional Encoder Representations from Transformers (BERT) (Kenton et al. 1953; Sun et al. 2019b) a conceptually simple yet powerful fine-tuning-based bidirectional language representation model, is the state of the art for several NLP-specific tasks. BERT uses a bidirectional self-attention mechanism to carry out the tasks of masked language modeling and next-sentence prediction. VideoBERT (Sun et al. 2019b) (based on BERT) was proposed basically for text-to-video generation or future prediction, and can be utilized for automatic illustration of instructional videos, such as recipes. VideoBERT is also applied to the task of video captioning following the masked transformer (Zhou et al. 2018) with a transformer ED, but the inputs to the encoder are replaced with features extracted by VideoBERT. VideoBERT reliably outpaces the S3D baseline (Xie et al. 2018), particularly with the CIDEr score. Furthermore, by combining VideoBERT and S3D, the proposed model demonstrated outstanding performance for all metrics. VideoBERT is capable of learning high-level semantic representations, and hence, achieved substantially better results on the YouCookII dataset. Vision & Language BERT (ViLBERT) (Lu et al. 2019) extended BERT to jointly represent text and images, and consists of two parallel streams (visual processing and linguistic processing) interacting through co-attentional transformer layers. The proposed ViLBERT with co-attentional transformer blocks outperformed the ablations and surpassed state-of-the-art models when transferred to multiple established vision-and-language tasks, e.g., visual question answering (VQA) (Antol et al. 2015), visual common sense reasoning (VCR) (Zellers et al. 2019), ground-referring expressions (Kazemzadeh et al. 2014), and caption-based image retrieval (Young et al. 2014).

3.5.6 Sparse transformer

Even with the transformers, the processing of lengthy sequences demands more time and memory, resulting in poor performance and inefficient systems. Sparse transformers (Child et al. 2019) introduced several sparse factorizations of the attention matrix, as well as restructured residual blocks, weight initialization for training enhancement of deeper networks, and a reduction in memory usage. Unlike a transformer, where training with many layers is difficult, the sparse transformer facilitates hundreds of layers by using the pre-activation residual block. Instead of positional encoding, learned embedding is useful and efficient. Gradient checkpoints are incorporated for effective reductions in memory requirements to train deep neural networks. Dropout is applied once, at the end of the residual attention instead of within the residual block. Experimentation with a sparse transformer demonstrated better performance on long-sequence modeling, with less computational complexity.

3.5.7 Reformer (the efficient transformer)

To improve the efficiency of the transformer on long sequences, Reformer (Kitaev et al. 2020) was proposed with reduced complexity and reversible residual layers (Gomez et al. 2017) for storing single-time activations during training. Inside the FFN layer, the activations are split and processed in chunks to save memory inside the FFN. Inclusion of locality sensitive hashing (LSH) in attention, depending on the total number of hashes employed, influences training aspects a lot. It was observed that regular attention is slower for lengthy sequences, but LSH attention speed remains smooth. Experimentation performed on text- and image-generation tasks produced the same results as the standard transformer but with more speed and efficient memory usage.

3.5.8 Transformer-XL

Transformer-XL (Dai et al. 2020) is based on the standard transformer architecture, and deals with better learning of long-range dependencies. Its key technological contributions include the concept of recurrence in a totally self-attentive model and developing an exclusive positional encoding scheme. It introduces a simple but more effective relative positional encoding design that generalizes attention lengths longer than the ones observed during training. For both character-level and word-level modeling, Transformer-XL is the first self-attention model that accomplishes significantly improved results compared to RNNs. Evaluating speed in comparison to the 64-layer standard transformer proposed in Al-Rfou et al. (2019), Transformer-XL achieved speeds up to 1,874 times faster.

3.6 Discussion—transformer based approaches

In the wake of Vaswani et al. (2017) successful implementation of the transformer in natural language processing, the transformer has become increasingly popular in a wide range of fields, including computer vision and speech analysis. Transformers have recently been improved in several variants compared to the vanilla model from the perspective of generalization, parallelization, adaptation, and efficiency. Its first application in the field of NLP for translation(Vaswani et al. 2017) initiated the tech journey. Recently with the robust representation competences it is proving its worth in the computer vision domain. Particular to transformer for video description (Zhou et al. 2018) introduced the first video paragraph captioning model using masked transformer. Due to the sequential nature of captioning task, unlike RNN which unroll sequence one step at a time, transformers can perform parallel processing of the entire sequence at both ends resulting in efficient and accurate captioning. The transformer enhanced with an external memory block further facilitates history maintenance of the visual & language information and augmentation of the current segment. Dependency among different sequence segments is learned through the self-attention mechanism inside the transformers. Considering long-range dependencies, hard to resolve for RNNs in the case of more extensive sequences is no longer an issue with the use of transformers. The vision transformer (ViT) (Hussain et al. 2022) recognized human activities in surveillance videos and adopted CNN free approach and capture long range dependencies in time to accurately encode relative spatial information. Likewise, video vision transformer (ViViT) (Arnab et al. 2021) factorized the spatial and temporal dimensions of the input video to handle long sequences of tokens encountered in video. Models employing modern transformers demonstrated comparable results handling long-range dependencies on vido description task, still developing efficient transformer models for computer vision’s tasks remains an open problem. Transformer models are usually huge and computationally expensive. In spite of their success in various applications, transformer models require a high amount of computing and memory resources, which limits their use on resources-constrained devices such as mobile phones (Han et al. 2022). So to cater resource-limited devices, research in designing efficient transformer models for visio-linguistic tasks need attention.

3.7 Deep reinforcement learning (DRL)

Trial and error, or experience and learn from experience, is the core of reinforcement learning (RL). It is all about taking appropriate actions in a certain environment and accommodating the reward/penalty by following a policy. Deep RL approaches have shown efficient performance in the field of real-world games. Particularly in 2013, Google DeepMind (Mnih et al. 2013, 2015) took the initiative and demonstrated that a single architecture could successfully learn control policies in a range of different environments with minimal prior knowledge. It showed successful integration of RL with deep network architectures. Although many adversities exist for DRL models, compared to conventional learning, even then, DRL has shown extraordinarily proficient performance in the field of captioning. Optimization of evaluation metrics, considered for the reward function, for increasing the generated caption readability, and for training stability and system convergence, is kept under consideration while employing DRL for descriptions. Figure 9 shows the RL agent-environment interaction for video descriptions, and some of the famous DRL approaches and their components (agent, action, environment, reward, and goal) are summarized in Table 6 for a quick view.

An efficacious combination of RL (He et al. 2019) with supervised learning was presented in a multi-task learning framework. The goal of the system is to learn the policy to correctly ground the specific descriptions in the video. Hence, as a reward, the model encourages the agent to better match clips gradually, which is carried out by helping the agent get precise information from the environment, and to maximize the reward by exploring or exploiting the whole environment, forming a sequential decision case. The actor-critic algorithm (Montague 1999) is employed to generate policy and take appropriate action. The agent is responsible for the iterative adjustment of temporal boundaries until specified conditions are met. After an action is accomplished, the environment (a combination of video, description, and temporally grounded boundaries) is modified accordingly. State vectors combine a description with global, local, and location features, which are then fed into a GRU-FC-based actor-critic module for policy and state-value learning. A penalty mechanism is also defined to keep computational costs within limits. As the name indicates, the agent (model) reads the description, watches the video and localization, and after that iteratively moves the temporal grounding boundaries for best clip matching, according to the description.

The ED architecture intrinsically obstructs the use of end-to-end training because of lengthy sequences in both input and output for the model. Therefore, a multi-task RL model (Li and Qiu 2020) to avoid over-fitting was proposed for end-to-end training. The primary job of the model is to mine or extract as many tasks as possible from human-annotated videos, which can regulate the search space of the ED network. After that, end-to-end training is carried out for video captioning. The auxiliary assignment of the model is to predict the characteristics mined from reference captions and, based on these predictions, maximize the reward defined in the RL system. Specific to RL, the objective of the model is to train an agent to accomplish the tasks in an environment by performing a sequence of actions. For video captioning, the model aims to automatically generate a precise and meaningful sentence after processing the provided video. The agent’s action is to predict the next word in the sequence at each time step. The model’s reward is defined as the evaluation metric used in the test phase. The CIDEr score functions as a reward signal. Finally, evaluation of multi-task training revealed that domain-specific video representation is more influential than generic image features.

Sequence-to-sequence models optimize word-level cross-entropy loss during training, whereas the video captioning model proposed in Pasunuru and Bansal (2017) optimizes sentence-level, task-based metrics using policy gradients and mixed loss methods for RL. Moreover, an entailment enhanced reward, CIDEnt, was proposed that adjusts phrase-matching-based metrics and, on achieving a low entailment score, penalizes the phrase-matching metric (CIDEr-based) reward. An automatically generated caption gets a high entailment score only when the generated caption has logical matching with the ground truth annotation, instead of word matching.

Table 6 DRL models with their system components

Cross-entropy loss and reward-based loss are combined as a mixed loss to maintain output fluency and resolve the exposure bias issue. At first, the CIDEr reward demonstrated significant improvement, and after that, the CIDEnt reward further enhanced system performance.

Most of the captioning systems are trained by maximizing the maximum likelihood estimation (MLE), i.e., the similarity between the generated and reference captions, or by minimizing the cross-entropy (XE) loss. However, the MLE/XE approach suffers from two inadequacies: objective mismatch and exposure bias. Recent research demonstrated that for the captioning task, evaluation metrics could be optimized directly using RL, keeping in mind the associated computational cost and the designed reward for system convergence. The Self-Consensus Baseline (SCB) model proposed in Phan et al. (2017) trains concurrently on multiple descriptions of the same video, and employs human-annotated captions as a baseline for reward calculation instead of creating a new baseline for each generated caption. Following the ED approach, as an encoder, ResNet is used for static image features, C3D is used for short-term motion, and MFCC for acoustic features; GloVe is for word embedding, and LSTM is the decoder employed for language generation. Taking the LSTM language model as an agent in the environment of video features and words, the action is to predict the next word, attaining the dual goal of coming up with an accurate textual alternate of the given video and minimizing the negative expected reward of the model. Compared to the MIXER approach (Ranzato et al. 2016), where RL training gradually mixes into XE training to stabilize the learning, CBT trains both RL and XE simultaneously. A connection between RL and XE training is established in this research, utilizing consensus among multiple reference captions for training improvement, objective mismatch, and exposure bias elimination.

Fig. 9
figure 9

DRL agent-environment interaction for video description

A fully differentiable deep neural network comprising a higher- and a lower-level sequence model, was proposed in Wang et al. (2018b) for video description. Employing a hierarchical-RL (HRL) framework, the agent, environment, action, reward, and goal are defined to efficiently learn semantic dynamics adopting the ED architecture. Each video is sampled at 3fps, extracting ResNet-152 (Zhang et al. 2017) features from the sampled frames. The extracted features are fed into a two-phased encoder, i.e., a low-level bidirectional LSTM (Schuster and Paliwal 1997) and a high-level LSTM (Cascade-correlation and Chunking 1997). In the decoding phase, the HRL agent resembles the decoder. To better capture the temporal dynamics, an attention strategy is employed. The HRL agent is composed of three components: (1) a low-level worker that selects certain actions in each time step to achieve the goal, (2) a high-level manager that set goals, and 3) an internal critic (an RNN structure) to ensure accomplishment of the task and serve the manager accordingly. Both worker and manager are accommodated with the attention mechanism. A strong convergence policy is a challenging area in RL implementation, and the proposed HRL model achieved high convergence by applying cross-entropy loss optimization. The model is able to capture in-depth details of the video content, and can generate more detailed and accurate descriptions.

To avoid redundant visual processing and to lower computational costs, a plug-and-play PickNet model (Chen et al. 2018a) was proposed to perform informative frame selection. The solution comprises two parts; first is PickNet for efficient frame selection, and second is a standard encoder (LSTM) and decoder (GRU) architecture for caption generation. RL-based PickNet selects the informative frames without having full details on the environment, i.e., it makes decisions to pick or drop a frame only on the basis of the current state and the history. The agent selects a subset of frames retaining the maximum visual content (i.e., six to eight frames selected, on average, from a video) while other models commonly need up to 40 frames for analysis. Following flexibility, efficiency, and effectiveness, the selected keyframes are capable of increasing visual diversity and decreasing textual inconsistency. Visual diversity and language rewards are defined. A negative reward is defined to discourage the selection of too many (or too few) frames. Model training is performed in three phases. First is the supervision phase, where the ED is pre-trained. Second is the reinforcement phase, where PickNet is trained by employing RL. Third is the adaptation phase in which both PickNet and the ED are jointly trained.

3.8 Discussion—deep reinforcement learning (DRL)

In recent years, the Encoder–Decoder structure demonstrated promising results fusing attention and transformer mechanisms. However, due to the long range dependencies handling and semantic gap between the visual and language domain, the generated descriptions contain numerous inaccuracies. These errors can be handled by adopting optimization through deep reinforcement learning. The polishing network (Xu et al. 2021) follows human proofreading mechanism by evaluating and improving the generated captions gradually for revise word errors and grammatical errors. Evaluation mtric selection as reward function also plays role in robust performance. CIDEr score is a choice in most articles. The deep reinforcement learning framework includes an environment, agent, action, reward and goal function. For video captioning, the goal is to generate accurate description aligned with the visual information of the video. The generative language model acts as an agent and takes an action of next word prediction. The provided video and the ground truth descriptions plays the environment role resulting in rewarding the selected evaluation metric on successful word generation or penalize the metric score otherwise. The environment upates the state of attention weights or hidden states based on the employed mechanism. This cycle of agent’s action and environment’s state and reward update continues to gradually improve the generated description as shown in Figure 9. There has been a growing interest in DRL and hierarchical RL based methods in recent years, which have shown comparable results in the video description.

4 Results comparison & discussion

The benchmark results generated by various models in the recent past are discussed in this section. Dataset-based segregated techniques are further categorized in chronological order according to the approach/mechanism adopted for experimentation.

Video description models are mostly evaluated on MSVD (Chen and Dolan 2011) and MSR-VTT (Xu et al. 2016) datasets because of the wide-ranging and diverse nature of the videos, the availability of multiple ground truth captions for model training and evaluation, and most importantly, task specificity. For models having multiple variants during experimentation, the best performing variant is reported here. Scores shown in bold were the best performing.

4.1 Evaluation metrics

Most of the metrics commonly used for automatically generated captions evaluation, namely, BLEU@(1,2,3,4), METEOR, ROUGE, and WMD, are from the NLP domain, namely, NMT , and document summarization. CIDEr and SPICE evolved as a result of the increased demand for task-specific (captioning) metrics. It is essential for the description to possess the qualities of acceptability, consistency, and expression-fluency, particularly when considering the evaluations made by humans (Sharif et al. 2018). The evaluation metric is considered best when it exhibits a significant correlation with the human scores (Zhang and Vogel 2010). A short description of the metrics mostly used to evaluate the automatically generated description is given below, For detailed computational concept along with the limitations, please refer to Rafiq et al. (2021).

4.1.1 BLEU

(Bi-Lingual Evaluation Understudy): The evaluation metric proposed by Doddington (2002) measures the numerical proximity between generated captions and their referenced counterparts. It Computes the unigram (overlap of single word) or n-gram(overlap of adjacent n words) between the two texts, i.e., referenced and generated. Multiple reference annotations for a single video can guarantee good BLEU score. The basis for this metric is the precision measure which is the main limitation of this metric. The research work by Lavie et al. (2004) demonstrated that considerably high correlation can be achieved by emphasizing more on recall measure than on the precision score.

4.1.2 METEOR

Metric for Evaluation of Translation with Explicit ORdering In order to ensure the accuracy of this metric (Lavie and Agarwal 2007), an explicit exact word match must be made between the predicted translation and one or more reference annotations. It supports the matching of identical words, synonyms, words with identical stem and also the order of words in referenced and predicted sentences. The computational procedure is based on the harmonic mean of precision and recall of uni-gram matches between the sentences (Kilickaya et al. 2017). Moreover, METEOR score is more closely correlated with human judgment (Elliott and Keller 2014).

4.1.3 ROUGE

Recall-Oriented Understudy for Gisting Evaluation This metric (Lin 2004) belongs to the NLP domain (documents summaries) evaluation metrics family. There are multiple variants in Rouge which are used to determine how closely the generated and reference summaries are comparable. Among these variants, Rouge-N (n-gram Co-occurrence), and Rouge-L (Longest Common Sub-sequence) are related to image and video captioning evaluation. In terms of Rouge-N, it is the n-gram recall between the predicted summary and one or more reference summaries. In contrast, Rouge-L uses a similarity score based on the recall and precision of the longest common sub-sequence between the generated and the reference sentences.

4.1.4 CIDEr

Consensus-Based Image Description Evaluation An image description evaluation metric based on human consensus is proposed by CIDEr (Vedantam et al. 2015). When comparing a generated sentence with the set of reference human annotations provided for an image, the CIDEr understands the underlying concepts of prominence, accuracy, and linguistics. Computational concept involves the cosine similarities between the referenced and generated captions for a provided image. The CIDEr-D variant of CIDEr is famous for image and video description evaluation. Where verb stem removal in the basic CIDEr metric ensured the usage of correct form of verb and exhibited high spearman’s rank correlation with respect to original CIDEr score.

All the evaluation metrics follow the strategy of the higher, the better, higher scores are considered better for BLEU, METEOR, ROUGE, and CIDEr. For the models computing BLEU@1, BLEU@2, BLEU@3, and BLEU@4, only BLEU@4 is reported here because of characteristics analogous to human annotations.

Fig. 10
figure 10

Performance evaluation of BLEU, METEOR, ROUGE-L and CIDEr on the MSVD & MSR-VTT dataset for standard encoder–decoder approach

4.2 Datasets for evaluation

Defining a dataset as a collection of video clips with their respective annotations or descriptions is the act of creating a basis for training, validating, and testing a model. Among the domain-specific datasets are those relating to cooking, movies, social media, wild and human actions. In contrast, a wide variety of videos can be found in open-domain datasets. Following is a brief description of the most widely used benchmark datasets used in recent research for video descriptions.

4.2.1 MSVD—the microsoft video description dataset

Table 7 summarizes the results from popular models using the MSVD dataset. MSVD (Chen and Dolan 2011) is one of the earlier available corpora frequently used by the research community around the globe. It is a collection of 1,970 YouTube video clips provided with human annotations. The collection of these clips was carried out by requesting them from Amazon Mechanical Turk (AMT) workers. They were guided to pick short snippets depicting a single activity and were asked to mute the audio. Each video clip is 10 to 25 seconds long, on average. Afterward, these snippets were labeled with multilingual, mono-sentence captions provided by annotators. Frequently used slices of the dataset for training, validation, and testing comprise 1,200, 100, and 670 video clips. Figure 10 shows histograms for BLEU, ROUGE-L, METEOR, and CIDEr scores employing standard Encoder–Decoder structures and evaluated on MSVD and MSR-VTT datasets. Figure 11 demonstrates performance evaluation of transformer based models on MSVD and MSR-VTT datasets. Figure 12 graphically explains the performance evaluation of DRL based methods employing MSVD and MSR-VTT datasets. Figure 13 depicts the results obtained employing attention based approaches and evaluated on MSVD and MER-VTT datasets.

Table 7 Video description performance evaluation on the MSVD dataset for all four approaches
Fig. 11
figure 11

Performance evaluation of BLEU, METEOR, ROUGE-L and CIDEr on the MSVD & MSR-VTT dataset for transformer mechanism based approach

Considering the standard ED mechanism, benchmarking with MSVD revealed that SeFLA Lee and Kim (2018), a semantic feature learning-based caption generation model, showed a better BLEU score, and VNS-GRU (Chen et al. 2020) achieved best performance results from METEOR, ROUGE, and CIDEr scoring. Advancements in the field of neural machine learning have demonstrated encouraging improvements on the video description task, but models trained using word-level losses cannot correlate well with sentence-level metrics, although all the evaluation metrics are sentence-level. So, metric optimization is critically needed for high-quality caption generation. Deep reinforcement learning is employed for optimization of many techniques. DRL approaches evaluated on the MSVD dataset concluded with the best performance from Pasunuru and Bansal (2017) in all metrics (BLEU, METEOR, ROUGE, and CIDEr).

Fig. 12
figure 12

Performance evaluation of BLEU, METEOR, ROUGE-L and CIDEr on the MSVD & MSR-VTT dataset for reinforcement learning based approach

Models employing a transformer mechanism are progressing at a good pace. Among the transformer-based models, Two-view Transformer (Chen et al. 2018) performed the best in BLEU scoring, whereas Non-Autoregressive Video Captioning with Iterative Refinement (Yang et al. 2019) performed excellently under METEOR scoring, and the recently proposed SBAT (Jin et al. 2020) outperformed all previous models based on the ROUGE and CIDEr metrics. For attention-based approaches, SemSynAN (Perez-Martin et al. 2021b) outperformed the existing methods based on BLEU@4, METEOR, ROUGE, and CIDEr scores with the MSVD dataset.

Table 8 Video description performance on the MSR-VTT dataset for all four approaches
Fig. 13
figure 13

Performance evaluation of BLEU, METEOR, ROUGE-L and CIDEr on the MSVD & MSR-VTT dataset for attention mechanism based approach

For the overall performance evaluation from all four mechanisms on the MSVD dataset, SeFLA (Lee and Kim 2018), a semantic feature learning-based caption generation model, demonstrated an excellent BLEU score; SemSynAN (Perez-Martin et al. 2021b) produced the top METEOR and ROUGE scores, and VNS-GRU (Chen et al. 2020) achieved the best CIDEr score. It is clear from Table 7 that for short video clips comprising a single activity, the standard ED mechanism and the attention-based mechanism achieved top results.

4.2.2 MSR-VTT—microsoft research video to text

Table 8 demonstrates the results reported from using the MSR-VTT dataset (Xu et al. 2016), which is an open-domain, large-scale benchmark with 20 broad categories and diverse video content bridging vision and language. It comprises 10,000 clips that originated from 7180 videos. Being open-domain, it includes videos from categories like music, people, gaming, sports, news, education, vehicles, beauty, and advertisement. The duration of each clip, on average, is 10–30 seconds resulting in a total 41.2 h of video. To provide good semantics from the clips, 1327 AMT workers were engaged to annotate each one with 20 natural sentences. Data were split in Xu et al. (2016), suggesting 6513 videos for training, 497 videos for validation, and 2990 videos for testing purposes.

Considering the standard ED mechanism, benchmarking on the MSR-VTT dataset demonstrated that VNS-GRU (Chen et al. 2020), a variational-normalized semantic GRU-based caption generation model, showed better BLEU and CIDEr scores; DCM (Xiao and Shi 2019z), a diverse captioning model with a conditional GAN, achieved the best performance from the METEOR and ROUGE metrics. Among the DRL-based methods, Consensus-based Sequence Training (CST) (Phan et al. 2017) was trained concurrently on multiple descriptions of the same video. It employed human-annotated captions as a baseline for reward calculation, instead of creating a new baseline for each generated caption resulting in directly optimizing the evaluation metrics. Using DRL performed well based on BLEU, METEOR, ROUGE, and CIDEr metrics with MSR-VTT. Approaches based on a transformer mechanism demonstrated that the recently proposed SBAT (Jin et al. 2020) outperformed all previous models for all four metrics. In the attention-based approaches with the MSR-VTT dataset, the recently proposed SemSynAN (Perez-Martin et al. 2021b) outperformed the existing methods based on the METEOR and ROUGE metrics, whereas MSAN (Sun et al. 2019b), a multi-modal semantic attention network, performed excellently based on BLEU and CIDEr.

Considering the overall performance evaluations from all four mechanisms with the MSR-VTT dataset, MSAN (Sun et al. 2019b) demonstrated an excellent BLEU score, DCM (Xiao and Shi 2019z) (a diverse captioning model with a conditional GAN) achieved the best results for METEOR and ROUGE metrics, and the DRL-based CST model (Phan et al. 2017) achieved the best score from CIDEr.

Table 9 Video description performance evaluation on the ActivityNet Captions dataset for three approaches
Fig. 14
figure 14

Performance evaluation of BLEU, METEOR, ROUGE-L and CIDEr on the ActivityNet Captiopns dataset for transformer mechanism & standard Encoder–Decoder based approaches

4.2.3 ActivityNet Captions

Results reported from the ActivityNet Captions dataset are presented in Table 9. ActivityNet Captions (Krishna et al. 2017) is a dataset specific to dense captioning events. It covers a wide range of categories, and comprises 20k videos taken from the activity net centered around human activities, with a total duration of 849 hours and 100k descriptions. Overlapping events occurring in a video are provided, and each description uniquely describes a dedicated segment of the video, so it describes events over time. Temporally localized descriptions are used to annotate each video. On average, each video is annotated with 3.65 sentences and 40 words. Event detection is demonstrated in small clips as well as in long video sequences.

Considering the standard ED mechanism, benchmarking on the ActivityNet Captions dataset demonstrated that Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning (JSRL-VCT) (Hou et al. 2019) produced the top METEOR, ROUGE, and CIDEr scores, whereas Video Captioning of Future Frames (VC-FF) (Hosseinzadeh et al. 2021) achieved the best results from the BLEU metric. Among the transformer-based approaches, the recently proposed COOT (Ging et al. 2020), a cooperative hierarchical transformer model, outperformed all previous models for all four metrics. Although the universal transformer approach (Bilkhu et al. 2019) demonstrated the highest BLEU score, evaluation based only on a single metric cannot guarantee whole-system performance. For attention-based approaches, the pioneer and creator of the ActivityNet Captions dataset (Krishna et al. 2017) reported scores for all four metrics. Related to dense video captioning, in overall performance evaluations of all four mechanisms on ActivityNet Captions, COOT (Ging et al. 2020) outperformed all mechanisms for all four metrics. Figure 14 represents graphical illustration of the results obtained by employing standard Encoder–Decoder and transformer based models on ActivityNet Captipns dataset.

4.2.4 YouCookII

Table 10 presents results reported with the YouCookII (Zhou and Corso 2016) dataset, another dataset mostly utilized to evaluate dense video captioning systems. This dataset comprises 2k YouTube videos that are almost uniformly distributed over 89 recipes from major cuisines all over the world, using a wide variety of cooking styles, components, instructions, and appliances. Each video in the dataset contains 3–16 temporally localized segments annotated in English. There are 7.7 segments per video, on average. About 2,600 words are used while describing the recipes. The data split was 67% videos for training, 23% for validation, and 10% for testing purposes.

Table 10 Video description performance evaluation on the YouCookII dataset for standard encoder–decoder and transformer-based approaches

Only transformer-based model evaluation results are reported using YouCook II dataset, showing that COOT (Ging et al. 2020) again produced excellent scores from all four metrics.

Table 11 Video description performance evaluation on the TV show Caption (TVC) dataset for transformer-based approaches

4.2.5 TVC - TV show caption

Results reported from using the TV show Caption dataset are presented in Table 11. The TVC dataset (Lei et al. 2020b) is a multi-modal captioning dataset with 262k captions extended from the TV show Retrieval (TVR) dataset by storing additional descriptions for every single annotated video clip or moment. It involves utilizing both video and subtitles for required information collection and appropriate description generation. The TVC dataset contains 108k video clips paired with 262K descriptions, and on average, there are two to four descriptions per video clip. Human annotators were engaged to write descriptions for video only and video+subtitle if subtitles already existed. The transformer-based MMT model (Lei et al. 2020b) evaluated on TVC for both video and subtitle modalities outperformed the models with only one of the modalities. It establishes the fact that both videos and subtitles are equally valuable for concise and appropriate description generation. Unlike previous datasets employed for video descriptions focusing on captions illustrating visual content, the TVC dataset aims at captions that also describe subtitles.

The creators of the TVC dataset, MMT (Lei et al. 2020b), reported comparable results; however, HERO (Li et al. 2020) demonstrated the highest scores from BLEU, METEOR, ROUGE and CIDEr, demonstrating tough competition from the MMT (Lei et al. 2020b) model.

4.2.6 VATEX - video and TEXt

Table 12 shows the results reported from using the VATEX dataset in both English and Chinese.

Table 12 Video description performance evaluation on the Video And TEXt (VATEX) dataset for standard encoder–decoder and transformer-based and attention-based approaches

VATEX (Wang et al. 2019b) is a large, complex, and diverse multilingual dataset for video descriptions. It contains over 41,269 unique videos covering 600 human activities from kinetic-600 (Kay et al. 2017). There are 10 English and 10 Chinese captions with at least 10 words for English and 15 words for Chinese captions for every clip in the dataset. VATEX comprises 413k English captions and 413k Chinese captions for 41.3k unique videos. Chinese descriptions for each video clip are divided into two parts; half of the descriptions directly describe the video content, while the other half is the paired English translation (done through Google, Microsoft, and a self-developed translation system) of the same clip.

For VATEX English and Chinese evaluations of the transformer model, only the X-linear+transformer model is reported, considering it had the highest scores for all metrics. For attention-based systems, Multi-modal Feature Fusion with Feature Attention (FAtt) (Lin et al. 2020) outperformed the baseline with a significant gap, and recorded the highest results for both English and Chinese captioning. However, if we consider the overall performance comparison, the X-linear+transformer model achieved the highest scores for both English and Chinese captioning based on all four metrics.

In Table 13, demonstrating results for miscellaneous datasets, none of the results is highlighted because they were all evaluated on different datasets with different diversities and complexities, so we cannot compare them directly.

Table 13 Video description performance evaluation on misc datasets

From all the above results, we conclude that for simple, single-sentence caption generation, the standard ED & attention mechanisms provide excellent performance, whereas for dense video captioning, the transformer mechanism outperformed the others. For the models to better correlate with sentence-level losses, DRL-based metric optimization is critically needed for high-quality caption generation.

5 Conclusions

Vision and language are the two fundamental systems of human representations, and combining these two into one intelligent and smart system has long been a dream of artificial intelligence.

This survey investigated in detail the four main approaches to video description systems. These deep learning techniques, primarily employing the ED architecture, further accommodate the attention mechanism, the transformer mechanism, and DRL for efficient and accurate output. Owing to the diverse and complex intrinsic structure of video, capturing all the fine-grained detail and complicated spatio-temporal information present in the video context has not yet been achieved. To accomplish the image captioning task, a lot of research is in progress across the globe on the task of creating video descriptions, and even then, there is a requirement for further achievement and improvement in diverse visual information extraction and accurate description generation.

Deep learning video description mostly revolves around recurrence for sequential data processing, but the main bottleneck from long-term dependencies remains. As an option to recurrence, the transformer mechanism is capable of parallel processing, accelerated training, and handling long-term dependencies; it is space-efficient and much faster than solely self-attention-based methods, and is the model of choice for current advanced hardware. Researchers worldwide have put their efforts into the task of improving generated video descriptions using different state-of-the-art methodologies, but still, even the best performing method cannot match human-generated descriptions. Despite tremendous improvements, generated descriptions are not yet analogous to human interpretations. So, we can say that the upper bound is still far away, and there is a lot more room for research in this area.

  1. I.

    There is a need for incorporation of rational expertise in the models to improve the generated captions.

  2. II.

    The intrinsic multi-modal nature of video contributes to generating captions. Learning multiple features, like visuals, audio, and subtitles (if available in the video), increases the model’s ability to better understand and interpret (Ramanishka et al. 2016; Iashin and Rahtu 2020; Wang et al. 2018c; Xu et al. 2017; Hori et al. 2017), thus, improving the overall captioning quality. There is a need to explore this research direction further.

  3. III.

    The design and development of diversity measuring evaluation metrics to facilitate diverse, efficient, and accurate caption generation is indispensable.

  4. IV.

    For optimization of video captioning systems, extensive exploration of DRL is required.

  5. V.

    The unprecedented breakthrough of data-hungry deep learning in various challenging tasks is due to a large number of publicly annotated datasets. The currently available video description datasets lack the visual diversity and language intricacies required to generate human-analogous captions. In particular, for dense video captioning, task-specific dataset creation for improved performance is indispensable. Since the acquisition of high-quality annotations is costly, as an alternative to passive learning (training on a massive labeled dataset), active learning (attempts to maximize a model’s performance while annotating the fewest samples possible) can be explored.

We hope this paper will not only facilitate better understanding of video description techniques, but will also accommodate scientists in future research and developments in this specific area.