1 Introduction

Describing video content using natural language has emerged as a formidable challenge, capturing significant attention in recent years. Despite the considerable advancements showcased in the Sect. 2, the field continues to grapple with significant challenges, with deep learning-based methodologies predominantly driving progress.

Fig. 1
figure 1

An example of the performance of the hybrid method in producing detailed, multiple- and single-sentence descriptions of a screwing scenario [8] containing 8 video snippets (1,...,8). Detailed sentences: For left hand: (1) The left hand touches the top of a screwdriver on the table. (2) They untouch the table. (3) They move together above the table. (4) They touch a hard disk on the table. (5) They move together inside of a hard disk on the table. (6) They untouch the top of the hard disk and move together above the table. (7) They touch the table. (8) The left hand untouches the top of the screwdriver on the table. For right hand: (1, 2) Idle, (3) The right hand touches near (“around”-relation) the hard disk on the table. (4) The right hand continues to touch near the hard disk on the table. (5) The right hand untouches near the screwdriver from the table. (6, 7, 8) Idle. Multiple sentences: For left hand: (1.2.3.4) The left hand picks up a screwdriver from the table and places it on a hard disk. (4.5) They perform screwing in the inside of the hard disk on the table. (6.7.8) The left hand leaves the screwdriver on the table. For right hand: (3.4) The right hand keeps touching near the hard disk on the table. (5) The right hand leaves the hard disk on the table. One sentence: For left hand: The left hand performs screwing inside of a hard disk on the table by a screwdriver. For right hand: The right hand holds the hard disk and then lets go

This study specifically targets videos of human manipulation actions, which are complex but crucial for applications in robotics, especially for learning from demonstrations. Videos and language interactions between humans and robots are becoming more common, making it important for machines to understand and generate clear and relevant sentences. However, human manipulation actions are complex and need a careful approach to arrive at good video descriptions.

In response to this need, we have introduced two distinct methodologies to tackle this issue. The first is a hybrid statistical method, which adeptly combines video analysis techniques with the generation of simplified semantic representations (SRe), seamlessly integrated with a stack of Long Short-Term Memory (LSTM) networks. On the other hand, our second approach is an end-to-end trained LSTM-stack, designed to operate directly on the video data. While the latter demands a substantial volume of training data to function optimally, the hybrid method significantly reduces this requirement through its use of SRe-representations, without compromising on the quality of the generated descriptions. Both methods exhibit exemplary performance, surpassing existing state-of-the-art solutions in the field of video description generation.

Our approach stands out because it can create different levels of descriptions, from very detailed to more general statements about the video. This flexible approach, shown in Figs. 1, 2, lets users choose the level of detail they need. Whether it is creating a detailed instruction manual from a video for training purposes or a general story for a movie, our method adapts to provide the right level of detail.

Fig. 2
figure 2

An example of the performance of our end to end method in producing detailed, multiple- and single-sentence descriptions of a wiping scenario [8] containing 8 video snippets (1,...,8). Detailed sentences: For left hand: (1) The left hand touches a sponge on a table. (2) They move together on the table. (3, 4, 5, 6, 7) They keep moving together on the table. (8) The left hand untouches the sponge on the table. For right hand: (1, 2, 3) Idle. (4) The right hand touches a cup on a table. (5) They untouch the table. (6) They move together above the table. (7) They touch the table. (8) The right hand untouches the cup on the table. Multiple sentences: For left hand: (1) The left hand touches a sponge on a table. (2.3.4.5.6.7) It wipes the table. (8) The left hand leaves the sponge on the table. For right hand: (4.5.6.7) The right hand picks and places a cup on the table. (8) The right hand then leaves the cup on the table. One sentence: For left hand: The left hand wipes the table by a sponge. For right hand: The right hand lifts and puts a cup on the table

Here it is worth mentioning that while Dense Video Captioning [20] aims to identify and describe every significant event within a video through comprehensive, temporally-anchored captions, our approach to Multi Sentence Description focuses on generating a layered narrative that captures the video content at varying levels of detail. This method allows for a broader understanding of the video by providing descriptions that range from granular, moment-to-moment actions to more generalized summaries of the video’s overall theme or storyline. Our methodology is specifically designed to offer a more flexible representation of video content, accommodating different levels of detail according to the user’s needs or the application’s requirements. By integrating diverse descriptive granularity levels, our approach enriches the video annotation landscape, bridging the gap between simple event captioning and comprehensive content summarization. This layered descriptive capability is crucial for applications requiring not just the identification of events but also an understanding of the narrative structure and thematic depth of video content, setting our work apart in the realm of video understanding and description generation.

The capability to generate multi-level descriptions is of paramount importance in robotic applications and learning from demonstration. Robots often operate in diverse and dynamic environments, requiring them to understand and execute complex tasks. The multi-level representations (descriptions) generated by our method enable robots to comprehend the intricacies of human actions at various granularity levels, facilitating a more nuanced understanding and execution of tasks. Naturally, for any intrinsic understanding of actions and its sub-action the robot does not have to utter any sentence and only make use of the descriptive depth per se. This is especially beneficial in learning from demonstration, where robots learn to perform tasks by observing human actions. The multi-level representations provide a rich semantic context, aiding the robots in deciphering the intent and subtleties of human actions, leading to more accurate and efficient task execution. The possibility of uttering a respective phrase will then, however, also allow the machine to engage in discourse, possibly asking detailed questions about different action steps.

In concluding our introduction, it is pivotal to note that Method 3: the Integrated Method, maintains the streamlined end-to-end style, a decision underpinned by its design to harness the comprehensive strengths of both the Hybrid Statistical (Method 1) and our foundational end-to-end method (Method 2). This integration ensures a seamless translation from video content to textual descriptions, optimizing for both data efficiency and narrative depth. Henceforth, when referencing “Method 1,” it pertains to the Hybrid Statistical approach; “our end-to-end methods” encompass both Methods 2 and 3 for broader discussions. However, “our end-to-end method” singularly refers to Method 2, highlighting its specific contributions within our suite of solutions.

2 Related works

Research on video description has evolved into three main categories: “classical methods,” “statistical algorithms,” and “deep learning-based approaches.”

2.1 Classical methods

The earliest approaches to video description relied on SVO (Subject, Object, Verb) triple-based methods. These methods comprised two fundamental stages: content identification and text generation. Classical computer vision (CV) and natural language processing (NLP) techniques were employed to detect visual entities in videos and map them to standard sentences using handcrafted templates. However, these rule-based systems were effective only in highly constrained environments and for short video clips [14, 18].

2.2 Statistical algorithms

To overcome the limitations of rule-based methods, researchers turned to statistical approaches. These methods utilized machine learning techniques to convert visual content into natural language descriptions using parallel corpora of videos and their associated annotations [13, 32]. They initially involved object detection and action recognition through feature engineering and traditional classifiers, followed by the translation of retrieved information into natural language using Statistical Machine Translation (SMT) techniques [17]. However, this separation of stages made it challenging to capture the interplay between visual features and linguistic patterns, limiting the ability to learn transferable relationships between visual artifacts and linguistic representations [1].

2.3 Deep learning-based approaches

The success of deep learning in both computer vision (CV) and natural language processing (NLP) inspired researchers to adopt deep learning techniques for video description tasks. Vinyals et al. [39] introduced an image description model based on an encoder-decoder architecture, using Convolutional Neural Networks (CNNs) as image encoders and Long Short-Term Memory networks (LSTMs) as language decoders. Building on this, Venugopalan et al. [38] proposed a similar framework for video annotation, where features were extracted from each video frame, and LSTM layers were used to generate textual descriptions. However, this approach struggled to capture the temporal order of events in a video.

Subsequent advancements included Pan et al.’s unified framework with visual semantic embedding [27], which extended to exploit temporal information in videos [26]. Yu et al. [44] introduced hierarchical Recurrent Neural Networks (RNNs) for video captioning. Wang et al. [40] designed a reconstruction framework with a novel encoder-decoder architecture, leveraging both forward (video to sentence) and backward (sentence to video) flows for video captioning.

Krishna et al. [20] suggested a two-stage method for describing short and long events, where events in a video were first detected, and descriptions were generated based on event dependencies. Nian et al. [25] presented an enhanced sequence-to-sequence model for video captioning, incorporating a mid-level video representation method known as the video response map, in addition to CNN features from sampled frames.

Recent advancements in video description have extended beyond traditional deep learning-based methods, starting the new era of transformer-based models. These transformer-based approaches have emerged as a distinct branch within the realm of deep learning, offering unique and promising capabilities for generating coherent and contextually relevant text descriptions for videos.

2.4 Transformer-based paradigms

Pioneering this paradigm shift, Ji et al. introduced the groundbreaking Action Genome framework [15]. Within this framework, actions are represented as intricate compositions of spatio-temporal scene graphs. This innovative approach empowers the generation of rich and structured descriptions that encapsulate the nuances of human activities, providing a fresh perspective on video description.

Subsequently, Kim et al. presented “ViLT” [16], a vision-and-language transformer model that redefines the landscape of video captioning. What sets ViLT apart is its ability to autonomously learn the art of generating video descriptions, all without relying on traditional convolutional layers or region supervision. This marks a significant leap towards harnessing the full potential of transformers in the domain of video captioning.

Another noteworthy contender in the transformer-based video captioning arena is “SwinBERT” [21]. This model incorporates sparse attention mechanisms, a novel approach that enhances efficiency without compromising performance. SwinBERT aims to strike an optimal balance between computational efficiency and captioning accuracy, contributing to the ongoing evolution of video description methods.

Looking ahead, the field of transformer-based video description continues to evolve, with ongoing research exploring new directions and potential solutions. Recent developments include Fu et al.’s framework as a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), which adopts a video transformer to explicitly model the temporal dynamics of video inputs [10]. Then then present an extension of the VIOLET framework [10] where the supervision from Masked visual modeling (MVM) training is backpropagated to the video pixel space [11]. In another study, Gu et al. [12] proposes a two-stream transformer model, TextKG, for video captioning. One stream focuses on external knowledge integration from a pre-built knowledge graph to handle open-set words, while the other stream utilizes multi-modal information from videos. Cross-attention mechanisms enable information sharing between the streams, resulting in improved performance.

Transformer-based models excel at generating coherent single-sentence video descriptions but may face challenges when crafting multi-level descriptions of manipulation actions, especially in complex scenarios. Transformers are inherently optimized for single-sentence descriptions and may not naturally accommodate the intricacies of manipulation actions, which often involve decomposing actions into atomic components and organizing them into complex sequences.

By contrast, our proposed hierarchical approach is tailored to address the nuances of manipulation actions, offering the flexibility to generate multi-level descriptions that span from detailed atomic actions to more concise narratives.

While transformer-based methods are strong in language generation and comprehension, direct comparisons with our approach may not be straightforward due to their single-sentence focus. The choice between these approaches should align with the specific task requirements, whether it entails generating single-sentence descriptions or multi-level, detailed descriptions of manipulation actions.

Note that at the end of the Methods section we are discussing some more approaches in greater detail, which we are using for comparison with our results.

3 Results and discussion

This section presents a comparative analysis of our proposed methods, the hybrid statistical method and the end-to-end method, against state-of-the-art approaches on related datasets.

3.1 Evaluation metrics

Evaluation of video descriptions is a challenging task as there is no specific ground truth or “right answer”, that can be taken as a reference for benchmarking accuracy. A video can be correctly described in a wide variety of sentences, that may differ not only syntactically but also in terms of semantic content [1]. In this work, we report the results from two points of view: (1) applying automatic measures which have a high correlation with human judgement, such as “Bilingual Evaluation Understudy (BLEU) [28]”, while BLEU@N (N = 1 to 4) metric is used to calculate the matched N-grams between machine-generated and reference sentences, “Metric for Evaluation of Translation with Explicit Ordering (METEOR) [2]” and “Consensus based Image Description Evaluation (CIDEr) [36]”, (2) using human evaluation which is divided into three measurements: (I) grammatical correctness, (II) semantic correctness and (III) the relevance of the produced video description compared to what is present in the video.

These metrics are widely acknowledged for assessing the quality of generated textual descriptions in terms of their linguistic accuracy, relevance, and overall fluency.

3.2 State of the art methods used for comparison and data sets

Here, we introduce several cutting-edge video annotation methods, with a specific focus on those designed for annotating manipulation actions. These methods will serve as key points of reference for a comparative analysis with our proposed approaches. We categorize these comparisons based on different aspects to provide a comprehensive evaluation of our method. Note, that in the Results section we have analysed the different methods in comparison to our approaches using the here (below) mentioned data sets. In addition to this we have analysed all methods using the KIT Bimanual Action DataSet.

3.2.1 Multi-level descriptions

Ji et al. [15] presents an innovative framework, “Action Genome,” for video understanding and description. This method focuses on actions within videos and delves into the intricate details of actions and their representations. It represents actions as compositions of spatio-temporal scene graphs, allowing for rich and structured descriptions of human activities. Our approach shares a common emphasis with “Action Genome” on capturing nuanced actions within videos, albeit with different methodologies. While “Action Genome” emphasizes scene graph-based representations, our approach employs a hierarchical system for generating multi-level descriptions. This allows us to capture fine-grained details within manipulation actions.

Krause et al. [19] presents a hierarchical approach for generating descriptive image paragraphs. This method allows for multi-level descriptions, which aligns with our goal of generating multi-level descriptions for manipulation actions. Similar to our approach, it focuses on capturing fine-grained details within images and organizing them into coherent narratives.

Dataset YouCookII [46] contains cooking videos and is suitable for evaluating multi-level descriptions of manipulation actions. It allows us to assess how well your method can generate descriptions that capture actions at different levels of granularity within cooking tasks.

3.2.2 LSTM-based encoding and decoding

Nabati et al. [23] proposes a framework that includes two LSTM layers for encoding and decoding visual features and enhancing generated descriptions. This architecture bears resemblance to our approach, which also utilizes LSTM layers for both encoding and decoding stages. Additionally, both methods employ a module for word selection, with similarities to our word lattice module. This comparison highlights the commonality in utilizing LSTM-based encoding and decoding mechanisms.

Dataset MSR-VTT [41] is a diverse dataset that includes video descriptions. It can be used to evaluate LSTM-based encoding and decoding methods for video annotation..

3.2.3 Semantic summarization

Singh et al. [34] introduces an approach for summarizing visual information using a boundary-based key frames selection method. This aligns with our semantic approach, where we summarize visual information using semantic representations (SRes) before translating them into human language sentences. Both methods focus on the effective summarization of visual content, albeit through different techniques.

Dataset ActivityNet Captions [20] provides video segments with descriptions and can be used for evaluating semantic summarization. This dataset allows you to assess how well your method summarizes visual information into semantic representations and translates them into human language sentences.

3.2.4 Robotic manipulation actions

Nguyen et al. [24] is one of the few video description papers that specifically focus on manipulation actions. It translates videos into commands applicable for robotic manipulation using deep recurrent neural networks. Similar to our work, it concentrates on the description of manipulation actions, making it a relevant comparison. This emphasizes the shared focus on annotating videos containing manipulation actions.

Dataset EPIC-Kitchens [6] includes kitchen-focused videos and is suitable for evaluating methods related to robotic manipulation actions. It captures real-world cooking and manipulation actions in a kitchen environment, making it a good choice for assessing the annotation of such actions.

3.2.5 All methods checked with one additional dataset

In tandem with the previously mentioned datasets, we have incorporated the KIT Bimanual Action Dataset [8] as a crucial 3D dataset in our study. This inclusion serves to enhance the comprehensiveness of our research, with a specific focus on manipulation actions. All methods mentioned above have been analyzed qualitatively using this data set.

3.3 Multi-level descriptions

The evaluation of multi-level description generation capabilities was conducted on both, the curated data [46] and the full set of manipulation actions of YouCookII.

The subset comprised approximately 20% of the total dataset, specifically chosen to include a diverse range of cooking activities and ingredient complexities, providing a representative sample of common kitchen tasks. This subset selection aims to simulate real-world scenarios where data may be limited but varied.

We juxtaposed the performance of our Hybrid Statistical Method (Method 1) and our End-to-End Method (Method 2) against the methods detailed by Ji et al. [15] and Krause et al. [19], employing BLEU, METEOR, and CIDEr as our evaluation metrics.

Table 1 Comparison of multi-level description generation performance using BLEU, METEOR, and CIDEr metrics on both a representative subset and the full set of manipulation actions in the YouCookII dataset [46]

The results presented in Table 1 confirm the strengths of our Hybrid Statistical Method (Method 1) in scenarios with constrained data, underscoring its efficacy where data may be limited. This mirrors the few-shot learning achievements demonstrated by Ji et al. [15], where actions are decomposed into spatio-temporal scene graphs to encode activities hierarchically. However, our method refines this approach by capturing action-object interactions with greater granularity, significantly improving recognition accuracy in limited-data contexts. On the other hand, the Integrated Method closely competes with the Hybrid Method on limited data, achieving even better scores in the METEOR metric. This improvement suggests that the Integrated Method is particularly effective at producing narratives that align well with human judgments of linguistic quality and fluency, indicating a refined ability to generate content that resonates more accurately with the semantic intentions of the video content.

By contrast, our End-to-End Methods (Method 2) and (Method 3) demonstrate their superior scalability with the expanded data set of the full subset of manipulation action in YouCookII dataset, achieving higher scores across all metrics. While Ji et al. [15] lay the groundwork for hierarchical representation, our end-to-end approach capitalizes on this by incorporating a comprehensive understanding of atomic actions and temporal dynamics, which is crucial for the nuanced generation of video descriptions.

Furthermore, Krause et al. [19] explore narrative generation through dense captioning of image regions, which our approach advances by translating into the video domain, thus managing to weave a coherent story over the sequence of frames. By doing so, our method maintains the intricate details and the broader context of the visual narrative, a challenge not fully addressed by the static image focus of Krause et al. The seamless integration of visual perception with linguistic output in our End-to-End Methods (especially Method 3) distinguish them from the multi-stage processes typical of previous works. This streamlined pipeline preserves semantic integrity and ensures continuity in the narrative flow, resulting in video captions that are not only contextually accurate but also rich in detail.

3.4 LSTM-based encoding and decoding

To evaluate the LSTM-based encoding and decoding strategies for video captioning, we focused on a manipulation-specific subset of the MSR-VTT dataset [41]. This subset was carefully curated to include a variety of interaction-intensive videos, offering a rigorous testing ground for our methods. Further, we extracted a 20% segment of this manipulation-focused subset to assess the performance of our Hybrid Statistical Method (Method 1) under the constraints of limited data availability. This smaller sample reflects real-world scenarios where comprehensive datasets may not be readily accessible. In these conditions, our Hybrid Statistical Method demonstrated a notable advantage over our End-to-End Method (Method 2), underscoring its suitability for contexts with scarce data resources (see Table 2). The table demonstrates that the Integrated Method aligns closely with Method 1 when data is limited but surpasses it according to the CIDEr metric. This variance implies that both methods are adept with constrained data sets, yet the Integrated Method shines in generating descriptions that are both contextually deeper and more pertinent. This distinction, underscored by superior CIDEr scores, showcases the Integrated Method’s adeptness at crafting text that is not only more informative but also highly specific, underlining its capability to produce content that is significantly richer and more detailed.

Our comparative analysis also included the Boosted and Parallel LSTM (BP-LSTM) architecture proposed by Nabati et al. [23]. While their architecture innovatively employs a boosted and parallel approach to enhance the LSTM’s video captioning capabilities, our End-to-End Method leverages a more integrated processing pipeline. This cohesive approach facilitates a direct flow of information from the video content to the generated captions, optimizing the use of semantic cues and temporal dynamics. This mechanism allows for a nuanced discernment of relevant features across video frames, leading to captions that are not only syntactically and semantically coherent but also contextually richer and more descriptive. The results on the full manipulation subset of the MSR-VTT dataset indicate that our End-to-End Method outperforms the BP-LSTM, especially in generating accurate and detailed descriptions for complex manipulation actions.

Table 2 Comparison of LSTM-based encoding and decoding methods on both a representative subset and the full set of manipulation actions in the MSR-VTT dataset [41]
Table 3 Comparison of semantic summarization performance on the ActivityNet Captions dataset

The comparative evaluation presented in Table 2 illuminates the strengths of our approach in relation to the BP-LSTM architecture proposed by Nabati et al. [23]. Their architecture introduces an ensemble of LSTM layers augmented with a boosting algorithm, representing a notable development in video captioning. Their method employs multiple LSTMs in a parallel configuration to encode and decode video data, aiming to enhance the precision of the generated textual descriptions.

In our work, we adopt a different perspective by simplifying the captioning process with an End-to-End Method. This approach may appear modest in comparison to the parallel and boosted networks utilized by Nabati et al., but it aligns closely with the inherent sequential processing strengths of LSTM networks. By refining the encoding scheme to better capture the temporal flow of video content, we reduce redundancy and focus the LSTM on the critical task of generating contextually rich captions.

Our architecture’s strength lies in its ability to distill and utilize salient video features effectively, thereby enhancing the LSTM’s capacity for predictive accuracy in caption generation. It is particularly adept at handling the intricacies of manipulation actions, which are essential for producing coherent and detailed narratives that resonate with the unfolding events in the videos.

When applied to the expansive dataset of MSR-VTT, our method consistently performs well, demonstrating its capability to scale and maintain performance without the need for complex, multi-stage frameworks.

3.5 Semantic summarization

Semantic summarization of video content is pivotal for generating concise yet comprehensive descriptions. Our evaluation extends to the ActivityNet Captions dataset [20], focusing on a subset of manipulation actions and approximately 20% of this subset to reflect limited-data scenarios.

Our comparison encompassed our Hybrid Statistical Method (Method 1) and our End-to-End Method (Method 2) against the boundary-based keyframes selection approach by Singh et al. [34]. Their approach efficiently reduces computational costs by encoding visual information through selected keyframes, but this can lead to overlooking the temporal relationships essential for fully understanding the sequence of events within a video.

By contrast, our End-to-End Method maintains continuity of the video narrative, capturing the essence of actions and contexts without the information loss that may accompany keyframe reduction. Our Hybrid Statistical Method, tailored for sparse data environments, generalizes from limited examples to yield accurate summaries.

Table 4 Comparative analysis of translation methods for robotic manipulation actions using the EPIC-Kitchens dataset

The results, presented in Table 3, demonstrate that our methods surpass the keyframe-based summarization, especially in scenarios where maintaining narrative flow and temporal dynamics is critical for generating precise video descriptions. This is due to the comprehensive analysis of video content that our methods undertake, ensuring no vital information is missed during the summarization process. As indicated in the table, although the Integrated Method closely parallels Method 1 in performance on limited data, it surpasses it in the CIDEr measure. This suggests that while both methods are competent in handling constrained datasets, the Integrated Method excels in generating contextually richer and more relevant descriptions, a key strength highlighted by the CIDEr metric which assesses the informativeness and specificity of generated text.

3.6 Robotic manipulation actions

Robotic manipulation commands derived from video data require a nuanced understanding of sequential actions and context. Our analysis leveraged the EPIC-Kitchens dataset [6], scrutinizing a comprehensive set of manipulation actions and a representative 20

Nguyen et al. [24] introduced a method employing RNNs to translate video sequences into robotic commands. While their method capitalizes on advanced feature extraction and a two-layer RNN for improved accuracy, it may not fully capture the subtleties of manipulation actions within a kitchen environment.

Our Hybrid Statistical Method (Method 1) excels in scenarios with sparse data, effectively capturing relevant features and translating them into accurate commands. Conversely, our End-to-End Method (Method 2) demonstrates its superiority in a data-abundant environment by seamlessly integrating video frame analysis and command generation, thus preserving the contextual flow of actions.

The results, as summarized in Table 4, suggest that our both End-to-End Methods outperform Nguyen et al.’s approach when ample data is available, due to its comprehensive processing of sequential video frames, which is crucial for generating contextually rich commands. The Hybrid Statistical Method, on the other hand, showcases its robustness and adaptability in limited-data scenarios.

The Integrated Method efficiently transforms video data into manipulation commands, showcasing superior performance due to its ability to preserve video content’s integrity and sequence. This method excels in generating contextually and temporally relevant commands, crucial for robotic tasks demanding nuanced action understanding. While the Hybrid Statistical Method proves highly effective in scenarios with limited data, showcasing its adaptability and precision where data is scarce, the Integrated Method closely rivals its performance. The Integrated Method’s strength lies in its sophisticated data analysis and learning algorithms, which allow it to achieve comparable accuracy and adaptability, even when faced with data limitations. This underscores the Integrated Method’s versatility and effectiveness across varying data availability scenarios.

3.7 Evaluating multi-sentence generation for single events

Our approach introduces the novel concept of multi-sentence production for single events in video content, a significant departure from conventional multi-level descriptions as was discussed in Sect. 3.3. This innovative methodology focuses on generating a sequence of sentences that each provide a unique perspective or layer of detail about a single event, thereby offering a more nuanced and comprehensive narrative.

3.7.1 Novelty and lack of direct comparisons

This concept represents uncharted territory in video description research. Due to its especial attitude, there is a lack of existing works with which we can directly compare this aspect of our methodology. Consequently, our evaluation is focused on assessing the performance of our three proposed methods: the hybrid method, the End-to-End method, and the integrated approach.

Following the introduction of our novel concept, we’ve developed a unique evaluation strategy tailored to assess the effectiveness of multi-sentence generation for single events. Given the absence of direct benchmarks, we manually annotated the KIT Bimanual Action Dataset with granularity levels of action descriptions, incorporating multi-sentence narratives based on a comprehensive library of action mappings. This manual annotation serves as our ground truth, reflecting the multi-granular narrative depth our methods aim to achieve.

According to our observation in the actions of this dataset, the longest action sequence annotated encompasses up to 14 sentences, presenting a substantial range for evaluating narrative continuity and detail richness.

3.7.2 Corresponding metrics and evaluations

To comprehensively assess the quality of our multi-sentence descriptions for single events, we first compare each produced sentence with its corresponding ground truth description in the library of action mapping using traditional metrics such as BLEU-4, CIDEr, and METEOR as we used already. These metrics provide a baseline for sentence-level accuracy and fluency.

Given the complexity of multi-sentence narratives, we need to introduce additional measures to evaluate narrative structure and content richness:

Table 5 Performance evaluation of the integrated method for multi-sentence video descriptions on the KIT Bimanual Action Dataset

Cohesion and Coherence To ensure logical flow and connectivity between sentences, we employ BERT-based sentence embedding [7] to calculate semantic similarity across consecutive sentences. This approach allows us to quantitatively assess the narrative’s coherence, ensuring that each sentence logically follows from the one before it, thereby maintaining the integrity of the event’s description. In our analysis, cohesion and coherence metrics, derived from BERT-based sentence embeddings, are quantitatively assessed with values ranging from 0 to 1, where 1 indicates perfect semantic alignment between consecutive sentences, ensuring a logically coherent narrative flow.

Coverage and Diversity To evaluate the breadth of content covered and the variety within the descriptions, we utilize topic modeling techniques. Specifically, Latent Dirichlet Allocation (LDA) [3] is applied to identify the range of topics covered by the multi-sentence narratives. This method ensures that all relevant aspects of the video event are captured, promoting a diverse and comprehensive depiction. The values of coverage and diversity, evaluated through Latent Dirichlet Allocation (LDA), are also measured on a scale from 0 to 1. Scores closer to 1 demonstrate comprehensive topic representation within the multi-sentence descriptions.

It is worth mentioning that given the comprehensive nature of this analysis and the extensive results derived from our evaluation, we have opted to specifically highlight the performance of our integrated method in the forthcoming table. This decision is predicated on the integrated method’s superior efficacy in generating multi-sentence descriptions for single events, as evidenced by its comparative analysis against the hybrid and end-to-end methods. This focus allows us to succinctly present the most impactful findings from our research, demonstrating the integrated approach’s advanced capabilities in addressing the complexities of multi-sentence generation for single events.

The Table 5 presents the evaluation metrics for multi-sentence descriptions applied to the KIT Bimanual Action Dataset, spanning a range of 2–14 sentences per description. It includes traditional linguistic quality metrics such as BLEU-4, METEOR, CIDEr, and introduces novel metrics for Cohesion and Coverage, designed to assess the narrative structure and thematic completeness specifically within the context of our generated multi-sentence video descriptions. To ensure a nuanced assessment, we average the scores for BLEU-4, METEOR, and CIDEr across each sentence within the multi-sentence constructs. Moreover, for a detailed assessment of cohesion and coverage in multi-sentence descriptions, we compute these metrics by averaging their values across consecutive sentences. For example, in cases with three sentences, the reported cohesion or coverage value represents the mean of the metric calculated between sentence 1 and sentence 2, and then between sentence 2 and sentence 3. This method ensures a comprehensive evaluation of how well each sentence transitions to the next, capturing the narrative’s overall continuity and thematic breadth.

The observed trends in BLEU-4, METEOR, and CIDEr metrics, alongside the nuanced variations in Cohesion and Coverage, encapsulate the intricate balance our methodology seeks to achieve in narrative construction. As the number of sentences increases, we generally observe an enhancement in linguistic precision and fluency, indicative of the method’s adeptness at generating detailed, coherent narratives. The observed increments in Cohesion and Coverage metrics, as delineated in the table, directly correlate with our method’s underlying principle of enriching single event narratives through multi-sentence descriptions. This methodological approach, by design, fosters a comprehensive narrative framework, where each additional sentence contributes to a more nuanced and layered portrayal of events. Consequently, as the narrative extends, it not only becomes more interconnected, enhancing the overall cohesion, but also broadens the scope of covered themes and details, thereby improving coverage. However, the occasional decreases in these metrics alongside Cohesion and Coverage reflect the inherent challenges in maintaining uniform narrative quality and thematic diversity across more extended sequences. These outcomes underscore our method’s efficacy in leveraging extended narrative structures to produce more coherent and inclusive video event descriptions, reflecting a deeper engagement with the content’s inherent complexities.

3.8 Qualitative analysis on the KIT bimanual actions dataset

3.8.1 Comparison with previously discussed works

A comprehensive qualitative analysis was performed to assess the descriptive capabilities of our methods compared to the state-of-the-art methods mentioned earlier. The KIT Bimanual Actions Dataset was chosen for this analysis due to its size, which made a detailed qualitative evaluation by human raters feasible. Ten evaluators, comprised of bachelor and master students aged between 18 to 29 and well-acquainted with the task, participated in the assessment. Each evaluator was briefed about the objectives of the study to ensure an informed and unbiased rating process.

Participants observed 540 video recordings from the KIT Bimanual Actions Dataset, along with the corresponding descriptive sentences generated by each method. They were asked to score the outputs based on three criteria: “grammatical correctness,” “semantic correctness,” and “relevance,” with up to 5 points awarded for each category. These criteria were selected to cover the essential aspects of natural language descriptions that contribute to the overall comprehensibility and utility of the generated text.

Table 6 summarizes the average scores from the raters for each method across the three evaluation dimensions. Each entry in the table provides a score out of 5 for three categories: Grammatical Correctness, Semantic Correctness, and Relevance.

Table 6 Qualitative analysis of description generation on the KIT bimanual actions dataset (Scores Out of 5)

The scores were assigned based on human judgment and expert analysis of the output captions. A higher score in each category reflects a closer approximation to human-like performance. For instance, a score of 4.5 in Grammatical Correctness would imply very few grammatical errors, whereas a score of 4.7 in Semantic Correctness would suggest that the generated descriptions closely match the semantic content of the actions being performed.

The results provide insights into the practical effectiveness of each method in generating descriptions that are not only grammatically sound but also semantically accurate and contextually relevant.

Our Methods 1 (Hybrid), 2 (End-to-End) and 3 (Integrated) have been compared against existing approaches to highlight their relative performance. In particular, we observe that Our Method 3 (Integrated) typically achieves higher scores in the Relevance category, indicating its superior ability to focus on the most crucial elements of the video content. This observation is further corroborated by the data presented in Tables 2 and 3, underscoring the Integrated Method’s superior capability in crafting contextually rich and relevant descriptions compared to Method 1. Conversely, Method 1 (Hybrid) demonstrates robust performance in Semantic Correctness, indicating its efficacy in producing meaningful captions even within limited data scenarios. This distinction highlights the nuanced strengths of both methods in addressing different aspects of semantic summarization challenges. On the other hand, Our Method 1 (Hybrid) shows a strong performance in Semantic Correctness, suggesting that it can generate meaningful captions even with limited data.

These scores further substantiate the adaptability and effectiveness of our proposed methods, providing qualitative insights that align with the quantitative metrics reported earlier. They reflect the balance between producing grammatically sound sentences, capturing the essence of the visual content semantically, and maintaining relevance to the core actions depicted in the videos.

3.8.2 Comparative analysis of comprehension accuracy with new state-of-the-art approaches

To augment our analysis and deepen our insights, we now introduce a distinct conceptual comparison approach. This analysis aims to situate our research within the context of recent advancements, specifically referencing the contributions of Papers [22, 33, 42] which are recognized as leading state-of-the-art contributions in the field of video understanding. Each paper explores unique dimensions of video–text interaction, ranging from retrieval to dense captioning, setting a foundation for their inclusion as benchmarks. However, the distinct focus of our methodology on narrative generation precludes a direct comparison due to divergent aims and techniques. Thus, we pivot towards a conceptual discourse to underscore the shared objective of enriching video content comprehension. This comparative analysis not only clarifies the positioning of our work within the research landscape but also illuminates the complementary nature of these diverse approaches in pushing the boundaries of video understanding.

Comprehension Accuracy serves as a pivotal metric in this study, evaluating the precision with which our model interprets and narrates video content. This measure is crucial for assessing the effectiveness of our narrative generation approach in capturing the nuances of video events accurately. To implement this, we compare the narratives generated by our model to expertly curated ground truth descriptions, through human evaluation to ensures that our generated narratives not only resonate with the actual content but also maintain coherence and contextual relevance. To refine our methodology for assessing comprehension accuracy, we introduce a framework for human evaluation. This process entails a meticulous analysis by human judges across pivotal dimensions: Contextual Relevance (assessing the narrative’s fidelity to the video content), Coherence (evaluating the narrative’s logical flow and sentence connectivity), and Engagement (measuring the narrative’s capacity to engage and captivate the audience). Employing a Likert scale for each criterion, this evaluation offers a detailed perspective on the narrative qualities, ensuring a holistic understanding of each method’s ability to convert video content into compelling narratives.

The human judges we have utilized for the qualitative analysis in above subsection have been similarly employed in our current examination. This evaluation is also done on the KIT Bimanual Action Dataset. For simplicity, we have reported the results of this conceptual analysis only for our integrated approach.

Table 7 Comprehension accuracy analysis on the KIT bimanual action dataset

As shown in Table 7 our approach outperforms the three state-of-the-art methods in Contextual Relevance and Coherence, indicating a superior capability in generating narratives that are closely aligned with the video content and maintaining a logical, cohesive narrative structure. This excellence can be attributed to our method’s focus on providing detailed, granular descriptions for single aspects, ensuring high relevance and coherence. The lower Engagement score reflects the depth of detail, which, while informative, may surpass the usual audience’s need for entertainment, leading to a perception of being less captivating compared to other methods that describe various aspects in a frame, thereby offering a broader, potentially more engaging overview.

Fig. 3
figure 3

Flow diagram of both methods. Top (blue arrows): Hybrid method, bottom (orange arrows): end-to-end method, where the end to end method uses the temporal (start- and end-) information from the segregation into complex actions from the top-path. Similar to [37] we unroll the LSTMs to a fixed 80 frames (using zero padding or frame drop in case of too short or too long video snippets) as this number offers a good trade-off between memory consumption and the ability to feed many video frames to the LSTM. The operations in the gray box are repeated n times, where n is the number of atomic actions needed to desribe one given complex action. Furthermore, we use a stack of k LSTMs to arrive at different levels of description granularity. The role of both variables is explained in more detail in the text

4 Methods

The goal of the study is to provide a framework for describing manipulation actions in video using human language sentences at different levels of granularity. We will here compare two new methods. One of our methods (Method 1) first computes novel semantic representations capturing spatio-temporal relations between objects from the video stream and then uses them as a front-end to a neural network. The other (Method 2) is directly using an end-to-end-trained network combined with a novel approach for generating action proposals in time. Figure 3 provides an overview over these methods where we will in the following sections describe their different components in detail.

Fig. 4
figure 4

a The list of static spatial relations (SSRs) are “Above (Ab)”, “Below (Be)”, “Around (Ar)”, “Inside (In)”, “Surround (Su)”, “Cross (Cr)”, “Within (Wi)”, “Partial within (Pwi)”, “Contain (Co)” and “Partial contain (Pco)” while “Above”, “Below” and “Around” relations in combination with “Touching” were converted to “Top (To)”, “Bottom (Bo)” and “Touching Around (ArT)”, respectively. The dynamic spatial relations (DSRs) are “Getting close (Gc)”, “Moving apart (Ma)”, “Moving together (Mt)”, “Halting together (Ht)”, “Fixed-moving together (Fmt)” and “Stable (S)”, b A sample of a convex hull, ce samples of some static spatial relations defined in this paper. If we assume the green cube as \( \alpha \) and the blue cube as \( \beta \), then: c \( SSR(\alpha ,\beta )=Cr \) and \( SSR(\beta ,\alpha )=Cr \), d \( SSR(\alpha ,\beta )=Wi \) and \( SSR(\beta ,\alpha )=Co \), e \( SSR(\alpha ,\beta )=Pwi \) and \( SSR(\beta ,\alpha )=Pco \)

5 Method 1: Hybrid approach

This method embodies a hybrid model that synergistically integrates a statistical semantic pre-analysis of actions with the temporal processing capabilities of Long Short-Term Memory (LSTM) networks. The statistical approach aims at distilling the inherent semantics of various actions, facilitating a structured breakdown of complex actions into simpler, atomic units. Subsequent processing by the LSTM aids in capturing the temporal dependencies and nuances within sequences of actions. A significant advantage of this hybrid approach lies in its efficiency: it demands a relatively limited dataset for effective training, making it feasible for applications where large volumes of annotated data might be scarce or challenging to obtain.

5.1 Object characteristics and pre-processing (Fig. 4)

To craft accurate and descriptive sentences about a given environment, object recognition is pivotal. This happens in two distinct stages:

  1. 1.

    2D Pre-processing Using RGB Images:

    • Objects are detected employing YOLO [30], which has been trained on the labeled objects from the pertinent datasets.

    • Given the intricate motion and positions of human hands, OpenPose [4] is requisitioned for their detection. Utilizing key points delineated by OpenPose, a 2D bounding box for each hand is determined.

    The result of this stage is a compilation of 2D bounding boxes: one set representing objects identified by YOLO and another for hands discerned via OpenPose.

  2. 2.

    3D Processing and Spatial Relation Inference:

    • Processing with Depth Information: For datasets that contain depth metrics, the 2D data harvested from the preceding stage is amalgamated with point clouds extracted from depth images. This synthesis leads to the generation of 3D bounding boxes for objects, facilitating a more granular comprehension of their spatial inter-relationships.

    • Inferring 3D Relations from 2D Imagery: For datasets that solely offer 2D information without accompanying depth data, our method taps into the potential of the 3D-R2N2 approach [5]. The essence of the 3D-R2N2 method is its capacity to employ deep learning models to reconstruct 3D objects from 2D images. More specifically, it utilizes a recurrent neural network (RNN) that integrates information from multiple views of an object (or even a single view) to produce a consistent 3D shape. This network is trained using a combination of synthetic and real images, allowing it to make predictions about the 3D structure of objects seen in diverse 2D images. Hence, even in the absence of explicit depth information, 3D spatial relationships between objects can be inferred.

The result of this is a three-dimensional representation of the scene, which augments the system’s spatial awareness and allows for the completion of atomic action quintuples, as a three-dimensional spatial relationship is a fundamental element of an atomic action’s definition.

Our method relies on a purely data driven recognition of actions using computer vision. This is achieved by recognizing the time-chunks and the structure of so-called atomic actions, which are at a later stage concatenated into real action-names. Atomic actions are defined by the following quintuple: (Subject, Action-Primitive, Object, Spatial Relation, Place). For example, if a hand touches the top of a cup on a table, its corresponding quintuple will be: (hand, touch, cup, top, table). In the following, we will give a brief reference to each of these five items.

  1. 1.

    The subject has two states and can be “Hand (H)” or “Merged entity (Me)”. Merged entity is used when the hand and its touched object act as a same entity and perform an action on another object.

  2. 2.

    The action primitive has four states. Two of those are “Touching (T)” and “Untouching (U)”, which are used when one object starts touching or untouching another object, respectively. The next two states are “Moving together (Mt)” and “Fixed moving together (Fmt)” for the times that two objects move together or one is fixed and the other is moving on the fixed one. These primitive states are enough to create the required semantic representations, where real actions, like cutting, stirring, etc., need to be captured (see below).

  3. 3.

    The object has four constituents, which are “\((O_{1,2,3}),~G\)” (Object 1,2,3 and Ground) where the latter supports all other objects except the hand in the scene. \(O_1\), \(O_2\) and \(O_3\) are the objects which are the first, second and third to display a change in their T/N relations, respectively. In [48], it has been discussed that there are never more than these four constituents (+ the hand) existing in any manipulation action.

  4. 4.

    The spatio-temporal relation between two objects (e.g. subject and object) has thirteen states (see Fig. 4). Below we describe how those are computed.

  5. 5.

    The place has five states. An action can occur on the surface of another object (\(O_1\) to \(O_3\)), on the ground, or in the air (Air).

5.2 Spatio-temporal relations: item 4 from above

To perform this step we determine the convex hulls of all objects recognized in the video using the “gift wrapping algorithm [35]”. Then we compute their spatio-temporal relations by standard set-calculus (see Supplement).

Consequently a set of static as well as dynamic relations between the different objects can be determined in a straight forward manner summarized in Fig.  4.

5.3 Mapping complex actions to atomic actions

Above we had described that we use action primitives with only four states as one component in the atomic action tuple, where these four states can all be recognized directly in a video. Real actions, to which one needs to associate a characteristic action word for video description, like cutting, stirring, pick &place, consist of a string of atomic actions. In a pre-processing step we used a context free grammar (see Supplement) to decompose real actions into their constituting atomic actions. As a result any complex action, for example “cut” is then represented as a string of atomic actions \(AA_i\), e.g., \(cut=[AA_1, AA_2,\dots , AA_n]\). This processing step had been done offline and the resulting mappings have been stored in a so-called library of action mappings.

5.4 Semantic representations: SRe

Semantic representations use the same 5-tuple structure as in the atomic action: (Subject, Action, Object, Spatial Relation, Place). Different from atomic actions, in the SRe real action(-names) are being used. This creates a nested system of the kind, f.e.

$$\begin{aligned} \begin{aligned} SRe_{cut}=(Subject, cut=[AA_1,AA_2,\dots AA_n],\\ Object, Spatial Relation, Place) \end{aligned} \end{aligned}$$
(1)

where each atomic action takes the structure of AA=(Subject, Action Primitive, Object, Spatial Relation, Place). Evidently, entities from atomic actions can reoccur in the SRe. Importantly, atomic actions can be easily recognized from video, because of the small number and simple structure of the action primitives (only 4). SRes are then derived from them as a string of AAs. As mentioned above we used a context free grammar (see Supplement) to pre-compute and decompose all complex actions in the data set into their constituting atomic action.

5.5 The core of the hybrid statistical method

As a main advantage, the hybrid method works with a modest amount of data. In the hybrid method, first we generate the semantic representation (SRe) of the visual content including subject (tool), action, object, spatio-temporal relations and place, then we model the semantic relationships between the visual components through learning a Conditional Random Field (CRF) structure. Finally, we propose to formulate the generation of natural language description as a decoding stage using two layers of LSTMs. The LSTM framework allows us to model the video as a variable length input stream and creates output as a variable length sentence.

The start of a manipulation actions can be defined when the hand touches something and the end when it is free again [48]. This way we can break a long video into snippets that contain only one action. Then we associate every component of the SRe-5-tuple to a graph node and create for every video snipped a fully connected graph with 5 nodes. The edges between the nodes will after training contain the probabilities of a co-occurrence of two nodes. For example in a cutting-SRe it shall be more likely to find “Subject=hand+knife” (i.e. the subject is here a merged entity) together with “object=apple” than with “object=lamp”.

The training of the Conditional Random Field (CRF) stands is central to this method, marked by the need to choose an appropriate dataset. We draw upon ’The KIT Bimanual Actions Dataset’ [8] from which the CRF learns to extract relationships and to determine probabilities for all existing Semantic Representations (SRes). We emphasize that our choice of dataset is a calculated one, focused on the deployment of a quite limited dataset. This constraint serves to imbue our model with precision, preventing it from lapsing into an over-generalized state.

However, the decision to employ a limited dataset, while advantageous in terms of focus, introduces a potential challenge. With the restricted data at hand, there exists the risk of overestimating the probabilities of connections between nodes within our graphs. Such an overestimation can lead to inaccuracies when the CRF tries to predict connections for novel, previously unseen data. To mitigate this risk, we employ the Word Lattice Model ([9]).

Formally, the Word Lattice Model is presented as a Directed Acyclic Graph, encapsulating an array of potential outcomes, each accompanied by a corresponding confidence level. Its primary utility, traditionally in the domain of machine translation, extends beyond decoding the most probable hypothesis. This is, because it considers alternative interpretations that, while bearing slightly lower confidences, remain plausible. Within our methodology, the Word Lattice Model emerges as a secondary line of verification. It scrutinizes the probabilities elucidated by the CRF and juxtaposes them with its own cartography of potential relationships. Should the probabilities derived from the CRF starkly contradict those inferred by the Word Lattice Model, we invoke a re-adjustment process. This entails a reordering of probabilities to ensure their alignment with empirical and logical expectations.

The steps of the re-adjustment process are as follows:

  1. 1.

    Discrepancy Evaluation: Initially, we perform a quantitative analysis to identify any significant discrepancies between the CRF’s predicted probabilities and those of the Word Lattice Model. This analysis includes statistical tests to ascertain whether the differences are beyond acceptable margins, thereby warranting adjustments.

  2. 2.

    Identification of Outliers: Probabilities that exhibit substantial deviations from those proposed by the Word Lattice Model are flagged as outliers. These outliers are then thoroughly examined to understand the underlying causes, such as data sparsity or bias in the training set.

  3. 3.

    Probability Recalibration: The recalibration involves adjusting the outlier probabilities. The steps involved in this recalibration are as follows:

    • Empirical Evidence Gathering: We revisit the training set to gather instances that correspond to the outlier probabilities. This involves analyzing segments where the co-occurrence of nodes is either over-represented or under-represented.

    • Reassessment of Contextual Factors: Each instance is evaluated in its contextual entirety, considering factors such as the frequency of action-object pairings in various contexts and the diversity of the actions performed.

    • Expert Verification: Subject matter experts review the instances to confirm or dispute the validity of the co-occurrence probabilities. Their insights are crucial for determining whether to retain, increase, or decrease specific probability values.

    • Adjustment Based on Consensus: If a consensus is reached that the current probability values are inconsistent with the empirical evidence and expert opinion, we employ a mathematical model to calculate the new probability. This model incorporates the frequency of the observed co-occurrences and the experts’ qualitative assessments to produce a more balanced and representative probability value.

    • Integration and Normalization: The newly calculated probabilities are integrated back into the CRF’s model, followed by a normalization process to maintain the probabilistic model’s consistency.

    These steps ensure that each adjustment is not an arbitrary decision but is instead grounded in actual data and expert validation, leading to a CRF model that is both accurate and reliable.

  4. 4.

    Validation and Iteration: Following recalibration, we conduct a validation phase using a held-out validation dataset, which serves as a new reference point for assessing the accuracy of the adjusted probabilities. If the validation results indicate that the adjustments have not yielded an improvement, we iteratively refine the probabilities. This iterative process is guided by a combination of empirical evidence and the feedback provided by the validation phase until the discrepancies are resolved.

  5. 5.

    Integration and Finalization: Once the adjusted probabilities pass the validation phase, they are integrated back into the CRF model. The final model is then subjected to a comprehensive evaluation to ensure that the adjustments have enhanced its predictive performance and generalization capabilities.

Through this rigorous re-adjustment process, we maintain the integrity and applicability of our CRF model, ensuring that it accurately reflects the complex relationships within the data, despite the limitations imposed by a smaller training set.

For readers seeking a more profound understanding of the nuanced mechanics of the Word Lattice Model, we refer to [9], which provides a comprehensive exploration of its intricacies.

Following this data preparation process, where we carefully extracted, verified, and fine-tuned our training samples, these curated datasets are now incorporated into the initial layer of the Long Short-Term Memory (LSTM) decoder. This step serves as the gateway to a complex series of computations, laying the foundation for the generation of detailed and descriptive narratives.

While most existing video description methods produce only one sentence for each video clip, an important capability that we provide with our approach is the possibility of generating descriptions in the form of several coherent sentences for each action with different levels of descriptive detail. For this purpose, we use k (Fig. 3, right) parallel levels of LSTMs, where each LSTM level has two layers. To capture the lowest, most detailed descriptive level, the total number of LSTMs needed should equal the most complex string of atomic actions used to describe any real action in the data sets. The pre-analysis performed by us to create the library of action mappings yielded that the most complex action consisted here of 14 AAs. Hence, 14 levels of LSTMs are maximally needed. At the first, bottom level descriptions consist of a set of very detailed sentences, each of which describes just one single atomic action. Obviously such sentences are non-human like and very awkward, but we can now begin to concatenate atomic actions, where this process is guided by the entries in the library of action mappings, leading to sentences that describe concatenated groups of AAs by single action verbs. This is done until full concatenation. For every level that results this way one LSTM is trained, where often levels are left out due to the fact that not all AAs can be concatenated into a meaningful verb. As a result, this algorithm can produce coherent sentences describing manipulation videos at different levels.

6 Method 2: end-to-end network

To address these constraints, we propose an end-to-end trainable framework designed to learn not only the temporal structure of input video sequences but also the intricate sequence modeling required for generating comprehensive multi-level descriptions. In the following sections, we will provide a detailed breakdown of the key components and steps in our end-to-end framework.

6.1 Data collection and annotation

Our atomic action recognition model relies on a selected collection dataset of videos which spans diverse domains to ensure a rich variety of actions, enhancing the model’s versatility. For this we carefully selected a set of five diverse datasets: YouCook2, MSR-VTT, ActivityNet Captions, EPIC-Kitchens, and the KIT Bimanual Action Dataset, to ensure the richness and versatility of our training data. These datasets cover a wide range of activities and domains, providing a comprehensive representation of human manipulation actions. This diversity is essential for enhancing our atomic action recognition model’s adaptability and robustness. To harness the full potential of these datasets and create a reliable foundation for our model, we employed a meticulous data annotation process. This process involves human annotators who systematically analyzed each video in these datasets. They followed specific guidelines and the atomic action definitions to label and describe actions within the videos. This annotation process is essential to provide precise information about the sequence of atomic actions in each video forming the basis for our atomic action recognition model’s training.

6.2 Automatic atomic action recognition

In this section, we present an overview of our automatic end-to-end atomic action recognition and description approach, which consists of three integral stages:

  1. 1.

    Visual Feature Extraction (CNN):

To initiate the process, we employ Convolutional Neural Networks (CNNs) for visual feature extraction. This stage serves as the foundation for our atomic action recognition and description approach. The choice of CNNs is rooted in their proven effectiveness in handling spatial information, a fundamental requirement for accurate atomic action recognition. Atomic actions, as elemental components of video narratives, often exhibit fine-grained spatial intricacies, such as subtle hand movements, object interactions, or body postures. CNNs are renowned for their ability to capture intricate spatial patterns, making them a natural and robust fit for our problem.

The input to this stage is a set of video frames \( F = \{f_1, f_2, \ldots , f_{n_f}\} \), where \( n_f \) represents the number of frames in the video sequence. These frames contain rich visual information, encompassing the dynamic interplay of objects, scenes, and actors in the video. Our objective is to extract meaningful visual features from these frames.

To achieve this, we leverage the Visual Geometry Group (VGG) architecture, specifically exploring variants such as VGG16 and VGG19. VGG architectures are well-regarded for their architectural simplicity and remarkable proficiency in capturing spatial details. They have demonstrated exceptional performance in image classification tasks, making them a compelling choice for feature extraction in our context.

The output of this stage is a sequence of feature vectors \( X = \{x_1, x_2, \ldots , x_{n_f}\} \), where \( x_i \) represents the feature vector extracted from frame \( f_i \). These feature vectors encode the spatial characteristics and patterns observed in the video frames. They encapsulate information related to the distribution of edges, textures, shapes, and object arrangements within each frame.

  1. 2.

    Temporal Modeling (GRUs):

In this stage, we delve into the modeling of nuanced temporal dynamics within atomic actions using Gated Recurrent Units (GRUs). This is a pivotal step, as atomic actions often involve subtle and temporally precise movements that necessitate sophisticated modeling.

The input to this stage is the sequence of feature vectors \( X = \{x_1, x_2, \ldots , x_{n_f}\} \), which were extracted by the CNNs during the “Visual Feature Extraction (CNN)” stage. Each feature vector \( x_i \) encodes the spatial characteristics of the corresponding frame \( f_i \).

Our objective is to model the temporal evolution of these feature vectors, capturing how atomic actions unfold over time. To achieve this, we employ GRUs, which are recurrent neural networks well-suited for handling sequential data.

The key operations of GRUs can be described as follows:

For each frame \( f_i \), the GRU computes the hidden state \( h_i \) using the input feature vector \( x_i \) and the previous hidden state \( h_{i-1} \). This operation is expressed as:

$$\begin{aligned} h_i = \text {GRU}(x_i, h_{i-1}) \end{aligned}$$

Here, \( h_i \) represents the hidden state at time step \( i \), and \( h_{i-1} \) is the hidden state from the previous time step. This recurrent operation enables the network to capture the temporal dependencies within the sequence of feature vectors.

The output of this stage is a sequence \( Y = \{y_1, y_2, \ldots , y_{n_f}\} \), where \( y_i \) represents the output at time step \( i \). These outputs \( y_i \) encapsulate the temporal dynamics of the input feature vectors. Each \( y_i \) can be thought of as a rich representation that encodes how the features within atomic actions evolve over time.

In essence, the “Temporal Modeling (GRUs)” stage transforms spatial information into a temporal representation, allowing us to capture the temporal dynamics inherent in a sequence of atomic actions. The output sequence \( Y \) serves as a valuable foundation for the subsequent stages of our end-to-end approach, enabling us to generate comprehensive video descriptions.

  1. 3.

    Description Generation (Stacked LSTMs):

With the temporal modeling stage completed, we now focus on generating descriptive annotations for the video frames. This phase bridges the gap between visual data and textual descriptions.

The input to this stage is the sequence of output vectors \( Y = \{y_1, y_2, \ldots , y_{n_f}\} \) from the GRUs. However, instead of generating a single sentence, we employ a hierarchical approach to create multi-level descriptions.

Our hierarchical description generation system consists of \(k\) levels of Long Short-Term Memory (LSTM) layers, each with two layers as mentioned. This hierarchy enables us to produce descriptions at various levels of detail.

Each level of the hierarchy corresponds to a specific level of detail in the descriptions. We train one LSTM for each level, often leaving out levels that do not result in meaningful concatenation due to the nature of the atomic actions.

Formally, the distribution over the output sequence \( (W) \) given the input sequence \( (F) \) can still be defined as \( P(W|F) \), where:

$$\begin{aligned} p(w_1, w_2, \ldots , w_{n_w} | f_1, f_2, \ldots , f_{n_f}) = \prod _{t=1}^{n_w} p(w_t | w_{t-1}, h_t) \end{aligned}$$

Here, \( p(w_t | w_{t-1}, h_t) \) represents the distribution over the words in the vocabulary for predicting the next word \( w_t \) given the previous word \( w_{t-1} \) and the corresponding hidden state \( h_t \) of the LSTM at the current hierarchical level.

During training in the decoding stage, our objective is to estimate the model parameters to maximize the likelihood of the predicted output sentence given the hidden representation of the visual frame sequence and the previous words it has processed.

This hierarchical approach allows us to provide multi-level descriptions, ranging from detailed atomic actions to more concise and contextually organized narratives, enhancing the expressiveness and adaptability of our description generation system.

In summary, our end-to-end approach integrates VGG-based CNNs for robust visual feature extraction, GRUs for modeling intricate temporal dependencies, and stacked LSTMs for generating descriptive annotations. This multi-stage process enhances the precision and efficiency of our automatic annotation generation system, enabling it to identify subtle, temporally precise, and contextually significant atomic actions within video sequences.

Figure 3 depicts our end-to-end model at the bottom, as well as shows its relation to the hybrid approach.

7 Method 3: integrated method for enhanced video description

Building upon the detailed exposition of both the hybrid statistical and end-to-end methods presented in the preceding sections, this section introduces a novel integration strategy that synergistically combines these approaches to enhance the generation of video descriptions. The necessity for such integration stems from the observation that while each method independently offers unique advantages in analyzing video content, their combined potential to address the multifaceted nature of video description tasks has not been fully realized.

The hybrid statistical method, with its robust capability for extracting precise contextual insights through statistical analysis of object interactions and spatial dynamics, provides a detailed understanding of specific elements within the video. Concurrently, the end-to-end method employs deep learning architectures to capture an extensive array of visual features, offering a comprehensive overview of the video scenes. The integration of these methodologies aims to harness the detailed contextual awareness of the hybrid method alongside the broad visual comprehension of the end-to-end approach, thereby producing video descriptions of appropriate depth and accuracy.

Through this integrated approach, we systematically address the video description generation process in several key steps, as follows:

  1. 1.

    Feature Enhancement and Contextual Integration: Initially, we merge the contextual insights derived from the hybrid statistical method with the dense visual features extracted by the end-to-end method’s deep learning architecture. This merger is facilitated through a custom fusion technique, employing a weighted blending algorithm to harmonize these distinct feature sets into a cohesive representation. This ensures a comprehensive feature set that encapsulates both the micro-level details and the macro-level scene context.

  2. 2.

    Advanced Temporal Feature Fusion: Subsequent to the initial feature enhancement, the integrated feature set undergoes a process of advanced temporal fusion. This stage employs a hybrid of GRU and LSTM networks, designed to capture the nuanced temporal dynamics of video content. The GRU units provide agility in updating features to reflect short-term changes, while the LSTM layers ensure continuity and coherence over longer sequences, effectively modeling both immediate and extended narrative arcs without the reliance on Transformer architectures.

  3. 3.

    Dynamic Multi-granularity Description Generation: The final step involves the generation of video descriptions, where a dynamic granularity adjustment algorithm plays a crucial role. This algorithm intelligently determines the level of detail required for each segment of the video, enabling the generation of descriptions that vary in granularity. This adaptability ensures that each description is contextually appropriate, accurately reflecting the complexity and narrative requirements of the video content.

Each of these steps represents a critical component of our integrated method, working in concert to produce video descriptions that are not only accurate and detailed but also nuanced and contextually rich. The following subsections will provide a more in-depth exploration of these steps, highlighting the technical intricacies and the methodological rationale behind our approach.

7.1 Feature enhancement and contextual integration

The first step in our integrated approach, Feature Enhancement and Contextual Integration, involves the amalgamation of high-level contextual insights from the hybrid statistical method with the visual features extracted by the end-to-end method. This process is formulated as follows:

Let \(F_{HS}\) denote the set of features extracted using the hybrid statistical method, where each feature \(f_{hs} \in F_{HS}\) represents contextual insights derived from object interactions and spatial dynamics within the video. Similarly, let \(F_{EE}\) denote the set of features extracted by the end-to-end method’s deep learning architecture, with each feature \(f_{ee} \in F_{EE}\) encapsulating a broad array of visual information from the video frames.

The goal of Feature Enhancement and Contextual Integration is to produce a unified feature set \(F_{U}\), which harmonizes these two diverse feature sets into a cohesive representation. This is achieved through a weighted blending algorithm, mathematically represented as:

$$\begin{aligned} F_{U} = \alpha (F_{HS}) \oplus \beta (F_{EE}) \end{aligned}$$

where \(\oplus \) denotes the fusion operation, and \(\alpha \) and \(\beta \) are weighting functions applied to \(F_{HS}\) and \(F_{EE}\), respectively. These weighting functions are designed to dynamically adjust the importance of each feature set based on its relevance to the video content’s narrative context.

The fusion operation \(\oplus \) is implemented through a specialized fusion layer, which employs an attention mechanism to assess and prioritize features from both sets. Formally, the attention mechanism can be represented as:

$$\begin{aligned} A(F_{HS}, F_{EE}) = softmax(W_{a}[F_{HS}; F_{EE}]) \end{aligned}$$

where \(W_{a}\) is a learnable weight matrix that projects the concatenated feature sets \([F_{HS}; F_{EE}]\) onto an attention space, and softmax is applied to normalize the attention weights. The output of this attention mechanism is then used to modulate the contribution of each feature in the final unified feature set \(F_{U}\).

The attention mechanism is particularly suited for this task as it allows the model to dynamically focus on the most informative features at each step of the description generation process. By weighting features based on their relevance, the mechanism facilitates a more nuanced and contextually aligned integration of visual and contextual information, mirroring human cognitive processes in prioritizing salient aspects of a scene.

Through this process, Feature Enhancement and Contextual Integration not only enriches the feature set with both micro-level details and macro-level scene context but also ensures that the integrated features are optimally aligned with the video content’s narrative structure. This foundational step sets the stage for the subsequent advanced temporal feature fusion and dynamic multi-granularity description generation processes.

7.2 Advanced temporal feature fusion

At the heart of our integrated methodology lies the Advanced Temporal Feature Fusion step, which intricately combines the capabilities of Gated Recurrent Unit (GRU) and Long Short-Term Memory (LSTM) networks. This fusion is designed to enhance our model’s understanding of video content’s temporal dynamics, crucial for generating coherent and contextually accurate video descriptions.

7.2.1 Hybrid temporal modeling strategy

The integration of GRU and LSTM networks is predicated on leveraging their complementary strengths to capture a comprehensive temporal profile of video sequences. While both networks are adept at processing time-series data, their combination allows for a nuanced modeling of video content that spans immediate action transitions and extends over longer narrative arcs. The strategy is implemented as follows:

$$\begin{aligned} F_{T} = LSTM(GRU(F_{U})) \end{aligned}$$

where \(F_{U}\) represents the unified feature set derived from the previous feature enhancement and contextual integration step. This feature set is first processed through GRU units, which are efficient in capturing the short-term temporal dependencies within the video. The output of the GRU network serves as the input to the LSTM network, which is then responsible for integrating these short-term dynamics into a coherent long-term context, resulting in the temporally fused feature set \(F_{T}\).

7.2.2 Rationale and benefits

The rationale behind this hybrid approach is twofold:

  1. 1.

    **Comprehensive Temporal Understanding**: By sequentially processing the unified feature set through GRU and then LSTM networks, our model can capture the full spectrum of temporal dynamics in video content. This includes immediate reactions and subtle transitions, as well as the development of events over time, ensuring that the generated descriptions are both accurate in the moment and consistent across the video sequence.

  2. 2.

    **Optimized Computational Efficiency**: The sequential application of GRU and LSTM networks allows for an optimized computational process where the model can quickly adapt to new information while maintaining essential historical context. This efficiency is crucial for processing complex video datasets where computational resources may be a limiting factor.

This hybrid temporal feature fusion strategy significantly contributes to the model’s ability to generate dynamic, multi-granularity video descriptions. By ensuring that temporal coherence is maintained throughout the description generation process, we set a foundation for producing high-quality video summaries that accurately reflect the intricate interplay of actions and events over time.

7.3 Dynamic multi-granularity description generation

The final cornerstone of our integrated method is the Dynamic Multi-granularity Description Generation. This phase capitalizes on the advanced feature set provided by the preceding stages, applying a refined version of our established description generation mechanism. The innovation here does not lie in a complete overhaul of the generation technology itself but in the sophisticated application and enhancement of existing capabilities, tailored to leverage the enriched context and temporal coherence achieved through the integration of hybrid statistical and end-to-end methods.

7.3.1 Enhanced description generation mechanism

Building on the description generation foundations established in earlier sections, this stage introduces nuanced advancements that enable dynamic adjustment of description granularity. This enhancement is made possible by the integrated feature set \(F_{T}\), which provides a more detailed and temporally-aligned view of the video content than was available to either method independently.

The mechanism for adjusting description granularity operates on the principle of contextual relevance and narrative significance, automatically determining the level of detail required for each segment of the video. This results in a tailored generation of descriptions that can vary from detailed accounts of specific actions to broader summarizations of the scene, depending on the assessed need.

7.3.2 Implementation and advantages

The implementation of this nuanced approach involves minor but critical adjustments to our existing description generation algorithms, enabling them to utilize the enriched feature set effectively. These adjustments focus on:

  1. 1.

    **Contextual Weighting**: Enhancing the algorithm’s ability to weight features based on their narrative significance, ensuring that descriptions focus on the most relevant aspects of the video content.

  2. 2.

    **Temporal Alignment**: Refining the generation process to better account for the temporal structure of the video, producing descriptions that are coherent over the entire length of the content.

This dynamic and adaptable approach to description generation brings several key advantages:

  • **Increased Relevance**: Descriptions are more closely aligned with the viewer’s informational needs and the specific context of the video, enhancing the overall utility and engagement of the generated content.

  • **Flexibility**: The system can more effectively handle a wide range of video types and content complexities, making it more versatile and broadly applicable.

  • **Efficiency**: By focusing on narrative significance, the generation process becomes more efficient, avoiding unnecessary detail where a broader summary is more appropriate.

While the foundational technologies for description generation remain consistent with those previously detailed, the Dynamic Multi-granularity Description Generation introduces a strategic advancement in their application. By finely tuning the description process to the integrated feature set’s unique strengths, our method achieves a new level of precision and adaptability in video description generation, enhancing both the accuracy and relevance of the produced descriptions.

8 Conclusion

In this paper, we proposed two methods for producing multiple sentence descriptions of complex manipulation actions, which is a class of actions very common for humans but also prevalent in human-robot interaction. Our central focus had been on the generation of a hierarchically ordered set of different annotations: from detailed and complex to condensed and simple which was achieved by using stacked LSTMs. The problem of generating multiple sentences was studied before e.g. Senina et al. [31] modeled intra-sentence consistency by considering the probability of occurrence between pairs of objects and actions in consecutive sentences. Zhang et al. [45] proposed a method for multi sentence generation by modeling the relational information between objects and events in the video using a graph-based neural network. More recently, Rohrbach et al. [29] presented an adversarial inference for multi sentence video description. However, the primary emphasis of existing methods has been on describing the sequence of complex actions executed by one or more subjects and they do not pay attention to the constituent sub-actions that make up each action. Different from this, our work is the first that constructs a hierarchy of action descriptions building on the concept of atomic actions, which are in a video directly recognizable by conventional computer vision methods. This way we can produce descriptive sentences at different levels of granularity. Even transformers may not be the ideal choice to this end as they depend on substantial labeled data and extensive pre-processing procedures. Additionally, they may encounter difficulties in capturing the spatial and temporal relationships in videos. Our joint embedding space approach, on the other hand, employs pre-trained visual and linguistic features to match inputs directly, facilitating efficient generation of multi-sentence descriptions for complex manipulation action videos without extensive pre-processing or large amounts of labeled data.

Quantitative analysis has demonstrated our methods are comparable to the state of the art. Additionally, human raters have confirmed the resulting descriptions as understandable and generally appropriate. In conclusion, we would argue that hierarchical action description should offer additional functionalities allowing users to adopt the descriptive depth to their individual needs.

8.1 Comparison of our methods: strengths and limitations

When selecting an approach for video description generation, it is crucial to consider the strengths and limitations of both the Hybrid Statistical and End-to-End methods. This comparison aims to provide insights for informed decision-making. Here we summarize the aspects of different methods and discuss their pros and cons. In the Appendix, we provide an itemized version of this to allow for an easier side-by-side comparison.

The Hybrid Statistical Method utilizes a combination of handcrafted features, rule-based modeling, and linguistic knowledge to create interpretable descriptions of actions. Its strength lies in providing multi-level descriptions that are not only clear but also based on well-established rules, making it particularly effective for complex actions that follow known patterns. This method, however, is not without its limitations. It relies heavily on manual engineering and linguistic input, which can make it less adaptable to new, diverse actions that are not covered by the predefined rules. While there are strategies to handle unseen data, such adaptations often require manual intervention, which may not be ideal in rapidly changing environments.

On the other hand, the End-to-End Method leverages the power of Convolutional Neural Networks (CNNs) for feature extraction, Gated Recurrent Units (GRUs) for temporal modeling, and Long Short-Term Memory networks (LSTMs) for generating descriptions. This method is, thus, fundamentally data-driven, learning directly from very large datasets, which allows it to generalize across a wide range of tasks and actions. It is scalable and capable of handling unseen data with more ease than the hybrid approach, especially when employing strategies like data augmentation and transfer learning to enhance its adaptability. The end-to-end method is suitable for applications where the availability of large amounts of training data and the need for diverse action handling are paramount, offering automated, data-driven solutions that minimize the need for manual tweaking.

The Integrated Approach represents a strategic synthesis of the Hybrid Statistical and End-to-End method, aiming to harness their respective strengths while mitigating their limitations. By combining the interpretability and rule-based precision of the Hybrid Statistical Method with the scalability and data-driven flexibility of the End-to-End Method, the integrated approach offers a comprehensive solution for video description generation. It provides the ability to generate multi-level descriptions with improved adaptability to new actions, leveraging large datasets for learning while maintaining the capacity for detailed, rule-based interpretations where necessary. This makes the integrated approach particularly versatile, suitable for a wide array of applications that benefit from both the depth of interpretability and the breadth of data-driven insights.

When deciding between the two, the Hybrid Statistical Method is the go-to for scenarios that require highly interpretable, multi-level descriptions for well-defined actions. It’s ideal when dealing with complex actions with known patterns and where interpretability is crucial. Conversely, the End-to-End Method is preferred in situations with abundant training data, a need for handling a diversity of actions, and a preference for automated, data-driven solutions. It excels in adaptability, making it capable of handling new actions more effectively. In some cases, combining both methods in the form of the integrated approach may offer the best of both worlds, with the hybrid approach tackling complex actions and the end-to-end method providing automation for more common tasks.