Video summarization using deep learning techniques: a detailed analysis and investigation

Saini, Parul; Kumar, Krishan; Kashid, Shamal; Saini, Ashray; Negi, Alok

doi:10.1007/s10462-023-10444-0

Video summarization using deep learning techniques: a detailed analysis and investigation

Published: 15 March 2023

Volume 56, pages 12347–12385, (2023)
Cite this article

Download PDF

Artificial Intelligence Review Aims and scope Submit manuscript

Video summarization using deep learning techniques: a detailed analysis and investigation

Download PDF

Parul Saini¹,
Krishan Kumar¹,
Shamal Kashid¹,
Ashray Saini¹ &
…
Alok Negi¹

12k Accesses
9 Citations
Explore all metrics

Abstract

One of the critical multimedia analysis problems in today’s digital world is video summarization (VS). Many VS methods have been suggested based on deep learning methods. Nevertheless, These are inefficient in processing, extracting, and deriving information in the minimum amount of time from long-duration videos. Detailed analysis and investigation of numerous deep learning approach accomplished to determine root of problems connected with different deep learning methods in identifying and summarizing the essential activities in such videos. Various deep learning techniques have been investigated and examined to detect the event and summarization capability for detecting and summarizing multiple activities. Keyframe selection Event detection, categorization, and the activity feature summarization correspond to each activity. The limitations related to each category are also discussed in depth. Concerns about detecting low activity using the deep network on various types of public datasets are also discussed. Viable strategies are suggested to evaluate and improve the generated video summaries on such datasets. Moreover, Potential recommended applications based on literature are listed out. Various deep learning tools for experimental analysis have also been discussed in the paper. Future directions are presented for further exploration of research in VS using deep learning strategies.

Deep learning for time series classification: a review

Article 02 March 2019

Visualizing and Understanding Convolutional Networks

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

1 Introduction

Videos are the most potent and popular multimedia form as they quickly connect with users. With the arrival of high-speed Internet and low-cost storage, the amount of data has been generated at a rocket pace, most of it in the form of visual or video data (Money and Agius 2008). Video hosting, television show hosting, social media, and online news platforms such as Wistia, SproutVideo, You-tube, Netflix, Amazon Prime, Twitter, LinkedIn, and Facebook have housed a vast amount of video material. YouTube alone produces more than 10 h of video content every second. Video requires more storage and bandwidth to transmit than image and text. Moreover, numerous human resources are necessary to analyze such videos. For such hefty data, effective methods and tools are required to capture videos and present them more compactly and concisely, which may be further used in various applications (Kumar et al. 2018).

The primary objective of VS is to analyze the video by dropping the unnecessary or redundant frames and preserving the keyframes (Kumar et al. 2016). Moreover, it helps to accelerate the browsing of an extensive collection of video data and achieve structured access and representation of the video content. An enormous amount of video recordings are generated and shared on the internet around the clock. In this multimedia era, practical use cases of the videos may be applicable in every corner. Therefore, a video summary can be convenient in any circumstances or situation when a user wants to graze rapidly at video content. Consequently, Automatic VS (AVS) (Binol et al. 2021) is the major trending and growing research area in this field. Artificial Intelligence (AI) enabled software can easily do the task of summarizing lengthy videos.

Various professional and educational applications based on multiple types of VS. These generate or use enormous amounts and volumes of multimedia data, such as monitoring, tracking, diagnosing, identifying, investigating, security analysis, etc. Different media organizations, including sports or entertainment videos, generate teasers or trailers of movies and TV series (Emon et al. 2020) can also use VS to create video highlights for events. Moreover, video search engines can also use VS for video indexing, browsing, retrieval, and recommendation (Emon et al. 2020). In addition, medical video analysis can use VS for complex diagnostics.

Further, VS is also employed to remove frame redundancy, reducing storage requirements and computational time (Xiao et al. 2020b). By choosing the most informative segment of the video, video summarising technologies attempt to provide a brief and complete description. It speeds up the video processing, storage, management, and retrieval of the videos effectively and efficiently, making interpreting and analyzing certain situations or events in long videos easier. Video summary can be static or dynamic. A static summary refers to a group or collection of frames called a key-framing or video-storyboard. The static summary is insufficient for users to understand the video, especially in the case of long videos (Emon et al. 2020; Xiao et al. 2020b). However, these techniques can view and index videos and present videos as thumbnails.

Other type, video skimming consists of related shots, i.e., a collection of video segments with corresponding audio information, improving the summarization’s semantics. Further, seeing a skim or video summary rather than a slide show of frames is generally more entertaining, enjoyable, and fascinating for the users or the audience (Emon et al. 2020; Xiao et al. 2020b); however, time-consuming. While storyboards are not bounded by timing or synchronization issues, they give more flexibility for browsing and navigation of data organization and allow for greater freedom in the data structure for browsing, and navigation (Binol et al. 2021; Emon et al. 2020; Xiao et al. 2020b). Some examples of the VS are generated highlights, video synopsis (Tiwari and Bhatnagar 2021; Ajmal et al. 2012; Sridevi and Kharde 2020). Thumbnail generation domain also considered very close to VS. Conventional thumbnail generation techniques cannot provide meaningful synopsis to the users.

A graph convolved video thumbnail pointer (GTP) can produce a semantically meaningful and coherent video thumbnail from an input video. It also generate the thumbnail semantically related to the natural sentence query (Yuan et al. 2019b). A sentence guided temporal modulation (SGTM) (Rochan et al. 2020) technique uses sentence embedding to control the video thumbnail generating network’s normalised temporal activations. These can be coarsely classified into the unsupervised approaches (Mahmoud et al. 2013; Li et al. 2006; Ma et al. 2002; Barbieri et al. 2003) and supervised approaches (Sundaram et al. 2002; Agnihotri et al. 2001; Li et al. 2001). Gesture, audio-visual and objects based detailed framework (Hu et al. 2011) is presented for visual content-based video indexing and retrieval, including structural analysis, feature analysis, video data mining using extracted features and feedback. Barbieri et al. Barbieri et al. (2003) divides video summary into various levels including local (scene level) (Sundaram et al. 2002), global (Agnihotri et al. 2001; Li et al. 2001), and meta-level (Hussain et al. 2019).

Based on the different aspects in the literature, deep learning-based VS (Del Molino et al. 2016; Senthil Murugan et al. 2018; Sreeja and Kovoor 2019; Money and Agius 2008) splits the VS techniques into split subtypes, internal (Khan et al. 2020a; Pereira et al. 2019), external (Sharghi et al. 2017b; Coppola et al. 2020; Lee et al. 2018), and hybrid (Zhu et al. 2016), based on the source of information. The visual surveillance system is proposed to detect a moving object to summarize the videos (Senthil Murugan et al. 2018). VS can also be classified based on the generated summary as generic, object-based, or event-based (Nair and Mohan 2021; Money and Agius 2008; Basavarajaiah and Sharma 2021). A VS technique (Pereira et al. 2019) considered the various standards like the type of source of video, summary or synopsis, preferences, genre, mechanism, and application in different areas or domains. A categorization of various VS focused on compressed domain summarization techniques has been presented (Basavarajaiah and Sharma 2019). Moreover, some state-of-the-art techniques (Hussain et al. 2021) are presented for Multi-View VS (MVS), which poses distinct challenges in summarization than the mono-view videos.

In supervised approaches, training a deep network takes a long time. Thanks to the Graphics processing unit (GPU) for reducing the training time and handling the computational difficulty in deep learning. A large number of Convolutional Neural Networks (CNN) and Deep Convolutional Neural Networks (DCNN), including GoogleNet, Inception V3, AlexNet, variations of ResNet, and variations of VGG Very Deep Convolutional Networks (VGGNet) (Nair and Mohan 2021; Kumar and Shrimankar 2017; Ji et al. 2019; Muhammad et al. 2020; Hussain et al. 2019) have been demonstrated for several applications (Brezeale and Cook 2008). GoogleNet seems to be the most widely used so far. Some key steps on Video Summarization techniques using deep learning are mentioned below:

Step1: Analyze Information Sources Each information source needs to be analyzed, so that the primary information content can be recognized and used further.
Step2: Measure of Relevance The information content based on generic or specialized to a certain issue is generated based on features or semantic approaches.
Step3: Synthesize Appropriate Output The extracted data is structured in an understandable format and represented as accurately as feasible as a output of the model.

The above literature reveals that deep techniques can be more beneficial in solving the video summarization problem. Therefore, it is decided to analyze and investigate the recent developments using State-Of-The-Art (SOTA) deep-learning-based algorithms in the video summarization domain. The significant contributions of the work are as follows:

In this work, various deep learning frameworks have been analyzed with their pros and cons for video summarization and offer category-wise video summarization techniques using deep learning approaches.
A detailed exploration of the various existing video summarization techniques is done. It covers the essential aspects, video summarization process, feature-based video summaries, and genre-based summarization.
An application-based analysis of several video summarization techniques is also presented, along with their limitations and solutions compared with the other existing approaches.
The details of the existing datasets in the literature have been provided, with the challenges and future directions for future research and video summarization applications.

The remaining article is organized as follows: Sect. 2 presents a detailed comparison of the existing video summarization techniques with their contributions and limitations. Section 3 elaborates on the deep learning techniques-based Video Summarization models and their properties on the basis of supervised, unsupervised, and weakly supervised-based Video Summarization techniques are analyzed. Section 4 presents a detailed and comprehensive overview of several deep learning-based applications of video summarization. Section 5 provides the details of the recent contributions in Video Summarization with the help of Deep learning. Section 6 provides the details of the various datasets and their performance. Section 7 discusses the different evaluation methods for Video Summarization. Section 8 highlights Video Summarization challenges. Section 9 introduced the future directions and the work has been concluded in Sect. 10.

2 Video summarization techniques and their contributions

The video summarization classifications based on their characteristics and properties are shown in Fig. 1.

2.1 Feature based VS techniques

Li et al. (2006) discuss three distinct feature parts of the films, including exposition at the beginning, conflict in the middle, and resolution at the end. In feature-based, the user mainly focuses on the video features like motion, color, gesture, audio-visual, speech,objects, etc. Low-level features such as color and texture are most commonly used to extract the information from the video content because they are easy to compute but not very accurate (Brezeale and Cook 2008; Ajmal et al. 2012).

2.2 Clustering based VS techniques

In Kumar et al. (2016), equal partitions based clustering technique is proposed to detect the key-frames based on the pixel intensity. The research (De Avila et al. 2011; Peker and Bashir 2007) revealed that many clustering techniques, including k-means, partitioned, and spectral clustering, have been used for VS. Kumar et al. Kumar et al. (2018) suggested an Eratosthenes sieve based key-frame extraction clustering technique. Summary length is decided based on the inclusion of the content decided on the specific criteria and uses of different evaluation techniques.

2.3 Shot selection based VS techniques

Generic video summaries (Money and Agius 2008; Basavarajaiah and Sharma 2021) are not personalized to the specific user’s command or interest but produced by extracting keyframes or shot boundaries detection, scene changes methods, and redundancy reduction (Tiwari and Bhatnagar 2021). In VS, shots are also detected by measuring the transition between the successive frames. VS (Hu et al. 2011) is also classified as static video abstracts, dynamic skims, and hierarchal summarization, where video skimming is achieved by removing redundancy, detecting objects or events, and multimodal integration. Function-based VS methods (Ma et al. 2002) use the attention mechanism to determine the important parts of the video. At the same time, the structure-based VS strategy exploits hierarchical story structure in the form of frames and shots.

2.4 Event based VS techniques

Agius et al. (Dimitrova et al. 2003) presents the different types of generated video summaries based on the object, event, perception, and feature. High-level features such as events, specific face, motion, gestures, etc., are highly reliable for giving important video content information (Xu et al. 2016a; Wei et al. 2021; Shingrakhia and Patel 2022). In Kumar et al. (2018), events are renovated from the extracted key-frames by fixing the minimum and maximum frame number for the event boundaries. Video events are extracted using graph theory (Kumar 2019) and scale free network (Kumar and Shrimankar 2018a) in mono-view videos and using Basic local alignment searching technique (Kumar 2021) and collections of weak ensembles (Kumar and Shrimankar 2018b) in multi-view videos. Some of the SOTA techniques are proposed for creating video event summary of soccer, cricket, tennis, and basketball games (Vasudevan and Sellappa Gounder 2021). DL is based on an artificial neural network in which the word “deep” reflects the use of multiple hidden layers in a neural network to extract high-level features and can learn vast amounts of data.

2.5 Trajectory based VS techniques

Most researchers initially worked on static VS. A dynamic video summary is generated through a trajectory with stationary backgrounds, which required a lot of computing resources. Deep learning may be the best solutions to detect the important content from video.

3 Deep learning based video summarization

Deep learning (DL) is a dominant branch of machine learning which has been extended with different network structures (Chai et al. 2021). It has been successfully used in various domains, including cybersecurity, natural language processing, bioinformatics, robotics and control, medical information processing, and many more (Alzubaidi et al. 2021). DL has also achieved superior results in video processing, in which VS plays a critical role. DL methods for VS can be supervised, weakly supervised, unsupervised, and Reinforcement learning, as shown in Fig. 2.

3.1 Supervised learning based VS

The supervised techniques learn from the data to predict future outcomes. However, the biggest challenge in supervised learning is to label the data. It requires a high cost to create well-defined datasets as it needs domain knowledge and does not work well with a wide variety of content on the internet. Supervised models are categorized as classification and regression models. Classification models are those where output can be classified as “pass" or “fail" and are used to predict the categories where regression models are used where output is a fundamental value such as sales revenue or weight. Linear classifiers, K-Nearest Neighbors (K-NN), support vector machines, decision trees, and random forests are all standard classification algorithms. Linear, logistic, and polynomial regression are common types of regression algorithms considered machine learning techniques. Table 1 shows comparison of Supervised Learning based DL techniques for VS. Some of the DL techniques are elaborated as follows:

Table 1 Comparison of supervised learning based DL techniques for VS

Video summarization using deep learning techniques: a detailed analysis and investigation

Abstract

Similar content being viewed by others

Deep learning for time series classification: a review

Visualizing and Understanding Convolutional Networks

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

1 Introduction

2 Video summarization techniques and their contributions

2.1 Feature based VS techniques

2.2 Clustering based VS techniques

2.3 Shot selection based VS techniques

2.4 Event based VS techniques

2.5 Trajectory based VS techniques

3 Deep learning based video summarization

3.1 Supervised learning based VS

3.2 Weakly supervised learning based VS

3.3 Unsupervised learning based VS

3.4 RL based VS

4 DL based VS applications

5 Recent contributions in DL based VS

6 Datasets used in VS

7 Performance measures

7.1 Static VS evaluation

7.2 Dynamic VS evaluation

8 Challenges

9 Future direction

10 Conclusion

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation