1 Introduction

Currently, video technologies are facing several challenges and difficulties, mainly attributed to the extraction of information in real-time from a large number. The extracted information can be useful to identify and detect many events that can help in many analyses, such as abnormal events and people’s behavior, as well as to predict events that usually happen in the scenes. Recently, a number of researchers focused their studies on finding effective techniques to summarize useful information from videos. This research field is essential for the improvement of video surveillance systems that require large storage space and complex data analysis, considering that data is captured 24 hours a day and 7 days a week. Therefore, summarization of video data is required in such systems to simplify data analysis, facilitate information storage, and to improve the access to each time video. The summarization process can also be related to the type of scene (private or public) where the data analysis depends on whether the scene is dynamic or static, as well as whether it is crowded or uncrowded. Since the summarization process should consume less time for processing and less space for storage it might require pre-processing to enhance the process without losing any information before the feature extraction task [1,2,3,4,5,6,7]. Video summarization methods are generally classified into two main categories: scene-based (static, dynamic) and content-based. From a video containing changing scenes, static-based methods consist of selecting keyframes and dynamic-based methods consist of selecting short video clips. Since the scenes can change, the cameras can move, the summarization, in this case, is carried out by determining the video sequences (shots) that represent the same scenes [8,9,10,11,12,13,14,15]. This allows the keyframe to be selected using extracted features and appropriate clustering methods. The selection can lead to some redundant frames, called meaningless frames, requiring an operation for their removal. On the other hand, in content-based methods, the summarization is made using semantics and content of the video. Several types of summarizations can be found in this category, including motion-based, event-based, and action-based methods. Figure 1 illustrates the classification of video summarization, including the subcategories of each method. Content-based video summarization is based on video content that requires pre-processing, for example, motion-based methods utilize the results of motion detection or the trajectories of objects to summarize the video. In addition, the summarization using this technique can be static(keyframes) or dynamic(short clips) [15,16,17,18,19,20,21,22]. The challenge in video summarization is related to the video understanding and the classification of important sequences of the video, and the importance depends on the types of actions or objects that must be summarized. The complexity and variety of scenes in a video making the foundation of generic methods impossible [40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56].

Fig. 1
figure 1

Video-summarization-based methods

This paper proposes an action-based video summarization by recognizing actions made by each person in the scene and then summarizing these actions at the end of the video. So it is an action recognition and summarization approach where human body actions are first detected using a proposed background-subtraction-based approach and then recognized. The proposed method provides the allows detecting and recognizing many human body actions, unlike other methods that capture the action of only one person in the scene. For that, two methods are proposed the first one uses the Cosine similarity measure of the HOGs of Temporal Difference Map (TDMap) and the second one is a CNN classification of actions from TDMap images. The rest of the paper is organized as follows. Section 2 provides an overview of research studies that have been conducted in the area of video summarization. Section 3 details our proposed approach including action detection and recognition. The results of the implemented work are discussed and analyzed in Section 4. Finally, conclusion and future works are discussed in Section 5.

2 Related works

The growth of video technologies has led to the creation of efficient tools to manipulate this type of data. Summarization aims to generate a short version of a video as a representation, using keyframes of important subsequences. This summarization provides a rapid view of the information contained in a large video. It also provides a good evaluation for users of the video and provides knowledge regarding the topic and the most important content in the video. Considering the information contained in each video, many methods have been developed using several techniques. Each technique summarizes the video using a specific feature, such as trajectories, moving objects, abnormal detection, and many others. These categories of techniques can be classified into two general categories, scene-based (i.e., static [1,2,3,4,5,6,7, 9, 17, 21, 22], dynamic [8, 12, 13, 16, 18,19,20] and content-based approaches, and the content-based approaches can be further decomposed into three types related to the content of the video including motion-based [10,11,12,13,14,15,16, 20], action-based [21, 22] and event-based [11, 15, 17,18,19], as shown in Fig. 1. Video summarization is a short version of the longer video sequence. The static video summarization is a collection of frames (keyframes) selected from the original video. The proposed approaches extract the keyframes using many features [1, 2]. In general, a video contains many parts, called shots, which represent different sequences. Each sequence represents a scene captured by a fixed or moving camera. The general idea of these methods is to classify these shots using clustering techniques [2], after which keyframes from these shots would be extracted. The meaningless frames that are similar are removed [3]. In the same context, [4] proposed a video keyframe extraction method using Jensen– Rényi divergence (JRD), Jensen — Shannon divergence (JSD), and Jensen– Tsallis divergence (JTD) to measure the difference between neighboring video frames, segmenting a video clip into shots and then possibly into sub-shots, and choosing keyframes in each shot. This is computationally inexpensive and yet effective. In [5], authors utilized sparse dictionary selection to extract keyframes directly, developed an online version to summarize the video in real-time and provided a guide for users to obtain a summary with an appropriate length. Video summarization is a reduced representation for fast video retrieval. In another work [6],a temporal- and spatial-driven approach was proposed. In this study, Optimum-Path Forest (OPF) clustering was used to automatically determine the number of keyframes and extract them to compose the final summary. To generate a video summary, [7] used a graph-based hierarchical clustering method. Called HSUMM, the proposed approach adopts a hierarchical clustering method to generate a weight map from the frame similarity graph in which the clusters can easily be inferred. In the same context and to generate an efficient summarization, the authors in [8] proposed a divide-and- conquer-based framework. In this work, the original video data is divided into shots, where an attention model is computed from each shot in parallel. Viewer attention is based on multiple sensory perceptions. From the deployment of surveillance cameras, a large amount of data is produced, and the intelligent systems can extract several types of information from videos. A monitoring system can analyze videos and extract any information regarding the content of the covered areas, including information about objects (motion, action, and trajectory) and the event that happened in the scene. This information can help any system to understand the content of the video [9]. According to the purpose of the system and the tasks to be handled, the system needs to learn and extract just the needed information, thus, video summarization is a good solution to abstract the content of any video [10, 11]. In the following, we describe each category of methods that utilize content-based summarization, including motion-based (object-based and trajectory-based approaches), action-based and abnormal-event-based methods. Detection of moving objects provides a good understanding of the content of each scene covered by cameras for video surveillance systems. The information about motion also represents an effective feature for video summarization. In some methods, the motion of objects is used to summarize the video content. Based on the extraction of moving objects in video sequences, [12] combined adaptive fast-forwarding and content truncation to summarize the content of videos.

In another work [10], the authors used background subtraction, clustering techniques, and a noise algorithm to summarize the content of videos. In [13], the authors ’ surveillance video was converted into a temporal domain image (temporal profile). This technique makes it easy for human operators to search within a long video. Most video summarization methods use a single view captured by a single camera. Some researchers try to use the advantage of multiple views of a scene covered by several cameras to summarize videos. For example, Panda et al. [14] exploited multi-view videos of a scene in video summarization using the sparse selection as selected shots. The trajectories can be a good solution for the recognition and summarization of the activity of objects during their presence in the scene. A good number of proposed methods use the object trajectory that is obtained by the tracking operation [15]. The authors used trajectories for abnormal event detection, which in turn was exploited for the generation of video summaries. Similarly, a framework has been developed for multiple-scene understanding and scene activity summarization [16]. The authors proposed a representation, using motion flow, of shared areas of interest in scenes covered by multiple cameras to understand the activity and behaviors in each scene. Objects ’ trajectories can be an efficient solution for many situations in video surveillance and a good technique for activity understanding and video summarization tasks.

Video surveillance systems play an important role in ensuring people ’s safety. In addition, the detection of abnormal events and unusual activities can be useful for these systems. The summarization of these events and activities is good support for each system to learn and understand the content of the covered area. Consequently, based on activity and event detection, many methods have been proposed for video summarization [17]. An overview of video summarization methods based on abnormal event detection can be found in [18]. The main steps of these methods are the detection of unusual events followed by their summarization. For example, in [19], the authors proposed a visual surveillance briefing system (VSB) that retrieves abnormal events using object appearances and motion patterns and adopted a video summarization algorithm. Some authors have proposed patch-based methods to model the key regions in the scene and learn the normal activity patterns in it [20]. After that, based on previous feature results, the unusual activities are detected. As a final step, a summarization of all abnormal activities is developed to create a short-period summary from a long video. In the same context, [15] proposed a novel approach for large-scale surveillance video summarization on the basis of event detection. The detection of trajectories of vehicles and pedestrians has been achieved. Using these features, abnormal event detection is developed. The video summarization step exploits the event detection results to summarize the short period that contains unusual activities in the scene.

Human action recognition is an important task for many applications, including video surveillance systems, video indexing and retrieval, sports applications, and multimedia. Action detection, recognition, and summarization can be exploited to support many other tasks. For example, in sports applications, recognizing and understanding player poses allows the judges to make a good decisions in the case of player fouls, especially for football games, which require precise decisions in many situations. Several methodologies have been proposed for the summarization of actions. [21] proposed a sports pose summarization method for self-recorded RGB-D videos. The authors chose games to test the performance of the methods because they contain a succession of complex actions. The extended version of this approach uses deep neural networks to extract two types of action-related features and classify video segments into interesting or uninteresting parts [22]. The authors proposed a method to recognize actions, which can lead to a good selection of meaningful informative summaries. In addition, there is a reciprocal task that recognizes the actions of the generated video summary. For that, the authors used the latent structural SVM framework combined with an algorithm for inferring the action.

Video summarization methods in [1,2,3,4,5,6,7,8,9] select key frames or short clips without analyzing the content of the video can lose information’s if the purpose of the summarization is not specified. Also, for these methods the summarization made on the videos which contains different scenes changing during the video periods like movies video. For that the summarization based on the scene’s variation still insufficient if the videos contain some important information.

For content-based approaches, video summarization is performed for some cases that specify the goal of summarization like summarization of abnormal event that can happened in a monitored scene [11, 15, 17, 18]. But find a generic method that can handle all event categories still a difficult task. For action-based summarization method in [21, 22] which there are a few research papers for this goal, we find most methods limited to the summarization of sport action. Also, the summarization of multiple actions in the same time is not processed.

2.1 Action recognition

In literature, many methods have been proposed to classify human actions [3, 23,24,25,26,27,28,29,30,31, 33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61]. The proposed methods can be split into three categories: motion-based methods, appearance-based methods, and space-time-based methods. Motion-based methods consist of computing parametric and generic optical flows before comparing results with motion templates. For appearance-based methods, images’ motion history is extracted to be compared with the active shape models. In space-time approaches, space-time features with the training results are used in the space-time domain. In the same context, authors in [23, 24] used the concept of Compact Descriptors for Visual Search (CDVS). The use of local features and global data structure provided by CDVS is useful when it comes to real-time feature extraction in real-time especially with the use of computing optimization. Dasari et al. [25] classified human actions by tracking CDVS feature trajectories of the human body. Authors in [26] start by selecting regions (patches) in the video that can be described as actions. Then, they generate boxes contain ing the detected motions and each one of these boxes a discrimination score assigned. For action recognition, the authors applied a clustering technique to each box to identify different actions.

El-Henawy et al. [27] proposed a technique for human action recognition using fast HOG3D and Smith-Waterman of the partial shape matching of each frame. First, the foreground of video subsequences is extracted from the input stream. Then, the keyframes of the current subsequence are blended before the extraction of the contour of the resulting frame. To classify the HOG3D features, the author utilizes a non-linear SVM decision tree.

Using human motion for action recognition purposes, Xu et al. [28] exploited wearable sensors to extract human motion using natural physical properties. Extracted features are used to classify related actions. Human action recognition from video surveillance data can be viewed from different angles. In particular, 2D analysis of action recognition for human-computer interaction requires a good exposition of the human body for video acquisition. Zhang et al. [29], propose a new algorithm which starts with a pre-training phase based on synthetic data to extract view-invariance between 3D and 2D videos. And to encode extracted trajectories from 3D videos, they introduce a new feature named as 3D dense trajectories.

For video surveillance systems, human body is incomplete in some cases, which represent a challenge for human action recognition. The proposed methods suffer from some limitations and especially in the case of occlusions and highly crowded scenes. Indeed, it is rather hard to detect and recognize multiple human bodies in crowded scene. One solution to overcome these limitations is to apply some preprocessing and learning process in order to cope with the various occlusions and crowded scenarios. Furthermore, it is worth mentioning that to the best of our knowledge there is no publicly available datasets for person detection in highly crowded scenes that could be used for the learning process.

There are other alternatives to handle this problem like the use of a Kinect camera [57], that can be helpful for recognition of the action as well as the summarization of it. But, the occlusion and action recognition in a crowd scene still represent a challenge [58, 62,63,64].

3 Proposed method

In this paper, we propose a new approach for a combined multiple human action recognition and summarization technique. In our method, we start with detecting human bodies using a proposed background-subtraction-based approach. Then, each one was tracked separately to generate a corresponding video sequence of each person during his presence in the scene. We designed the training part to represent all categories of human actions by a set of Histograms of Oriented Gradient (HOG) of the Temporal Difference Map (TDMap) that represents the motion history of the target. For the action recognition step, we extracted from the scene a sequence of each moving person. Then, shots representing homogeneous parts of the sequence were selected using a similarity histogram between frames. Peaks of the histogram represent the transactions in the sequence, a shot is defined by the subsequence between each pair of peaks. Next, each shot was used to identify the considered action based on a training set which is a set of HOGs of each action. The action classification is performed using two methods: (1) Cosine similarity measure for comparing the HOG of the current action h those of the training set. (2) convolutional neural network (CNN) model to classify actions using TDMap images as input.

After the recognition stage, the summarization of actions from the scene is performed by representing the timeline of the actions for each person in the scene from each shot (representing an action). The flowchart of our proposed method for human action detection is depicted in Fig. 2. Recognition and summarization steps using detected human body sequences are detailed in Fig. 3.

Fig. 2
figure 2

Flowchart of the proposed method

Fig. 3
figure 3

Summarization steps based on recognized actions

3.1 Pre-processing

Existing methods for action recognition are designed to detect human actions from a scene involving one person. In this paper, the detection, and recognition of multiple human bodies is proposed. The idea is to extract the silhouette of each person during his/her presence in the scene. Next, we use motion detection using a background-subtraction-based method to identify all the persons present in the scene and generate detection masks of detected objects.

For the background subtraction method, we start by initializing the background model using the N first frames from the video, based on the decomposition of each frame into blocks of 16×16 pixels. After the generation of the background model, the persons are detected by using background subtraction and object segmentation in each frame of the video.

Based on detected masks, each person is tracked through a bounding box. In this work, the Kalman-based tracker [44] is used, which applies the Kalman filter to predict the centroid of each track in the current frame and update its bounding box accordingly. While most methods only track one object at a time, the extended version of [44] is tailored to track multiple moving objects; During their presence in the scene, the silhouette of each detected person is extracted to form a new sub-sequence of the considered silhouette.

Figure 4 illustrates the proposed method for motion detection, tracking until Oriented Gradient images extraction.

Fig. 4
figure 4

Pre-processing steps for extracting each human silhouette

3.1.1 Motion detection

Before staring the gait recognition process and in order to ensure proper detection of the silhouette of the human body moving in the scene, a background-subtraction-based method is proposed. Background subtraction methods are the most used techniques for motion detection. The main operation for these methods is the background modeling which consists of extracting the unchanged pixels and region during the video. For that, we proposed a method for modeling the background by computing the similarity between blocks during a small period of the video representing here by the 100 first images of the video.

The modeling starts by dividing each frame into w × w blocks, then computing the Sum of Similarity (SS) of each consecutive block during T frames of the video. The SS values are computed using the following expression:

$$ \begin{array}{@{}rcl@{}} SS_{b(i,j)}=\sum\limits_{t=1}^{T-1}\sum\limits_{i=1}^{w}\sum\limits_{j=1}^{w}\cos ine(I_{t}^{(i,j)},I_{t+1}^{(i,j)}) \end{array} $$
(1)

Where b(i,j) represents the block background of the coordinates (i,j). Cosine similarity defined in [30] for computing the similarity between two vectors where the result is the interval [0,1] for the positive values. The cosine similarity between two vectors a and b is computed by the following expression:

$$ \begin{array}{@{}rcl@{}} \cos ine(a,b)=\frac{{\displaystyle\sum\limits_{i}^{n}}a_{i}b_{i}} {{{\displaystyle\sum\limits_{i}^{n}}}{a_{i}^{2}}{{\displaystyle\sum\limits_{i}^{n}}} {b_{i}^{2}}} \end{array} $$
(2)

Where a and b here represent blocks of two consecutive frames.

The background model is generated using the SS value of each block. By collecting the maximum values of the sum of similarity of each block (i,j). Regions of the blocks that did not change a lot during the 100 frames will have the most significant values because the value is 1 where two clocks are similar. The generated background model is defined based on SS values by the following expression:

$$ \begin{array}{@{}rcl@{}} B^{(i,j)}=Argmax\{SS_{(i,j)}\} \end{array} $$
(3)

After the generation of the background model, the background subtraction is the next step to subtract the background from each current frame of the video suing absolute difference. Then, based on the subtraction results, a segmentation operation is performed to classify the pixels belong to the background and those belong to the foreground or to the moving objects. This operation uses a threshold where the most method tests a set of thresholds that after that choose the one that gives the best results. In this paper we propose a segmentation method for selecting this threshold adaptively using the exponential function of the absolute difference between the current frame and background frame:

$$ \begin{array}{@{}rcl@{}} T(i,j)=1-e^{-\left|I(i,j)-B(i,j)\right|} \end{array} $$
(4)

Where T values have to be in the range of [0,1] and It is the current frame and Bt denotes the background image.

The threshold value converges to 0 when the background subtraction result goes to 0, and the threshold values tend to 1 when the background subtraction value is significant.

The computation of moving object at each time in the video represented by a binary images is performed using the selected threshold. The binary frame at time t of the video is computed using the following expression:

$$ \begin{array}{@{}rcl@{}} D(i,j)=\left\{\begin{array}{lc}255&if\ T(i,j)\simeq1\\ 0&otherwise \end{array}\right. \end{array} $$
(5)

3.2 Data preparation

The training phase aims to select a number of features from the training videos. First, a pre-processing is applied to the training video to select the silhouette of the human bodies. For the selection of features, we applied the Histogram of oriented gradient (HOG) and we computed the Temporal Difference Map (TDMap) between frames (each pair of consecutive frames) for each subsequence. TDMap is used to efficiently generate motion history of object s’ movement in all video regions.

The existing method uses the entire video region to extract the MHI of person action, which is not tolerated when the scene contains many people acting. For that, we used just the region where there motion. So our representation of data corresponds to the sequences of human body regions during his motion in the video. The new data is just an extraction of the regions containing the human body and not all the scenes. Accordingly, we cannot extract the trajectories of the body or the MHI because we do not use the entire video region. An example of the generated data and the corresponding TDMap is represented in Fig. 5. Human action history can have different structures from one activity to another. And so, we use HOG to extract information related to the structure of each action obtained from the oriented gradient representation. Then, the Histogram of Oriented Gradient is computed and the corresponding set of histograms for each action is collected. The same process is reproduced for all actions.

Fig. 5
figure 5

Training process with the generation of HOGs of each action. a Histogram of oriented gradient of one video training of an action. b HOGs of each action video

3.3 Shots detection

After the extraction of human silhouette from the original video, we obtain a sequence of silhouettes of each person in the scene. In this sequence, a person can perform many actions : the person can be in a walking state before starting to run or wave his hand. Transitions between actions change the appearance in the sequence. To detect the changes, we traced the histogram of similarity values between each pair of consecutive frames in the sequence using a cosine similarity measure defined in [30] and in the (2).

The peaks in the histogram represent the change between each action. Thus, using a smoothing operation and local maxima selection through a threshold, we extracted the subsequence between these pics that represents shots. The threshold value is taken to be 0.15 after an extension experimental evaluation. Each shot in the same sequence is exploited to recognize the action in it.

3.4 HOG based recognition

For the training data, we used the Weizmann, KTH, and UCF-ARG datasets, since they involve several human actions, namely, walking, running, hand waving, jacking, jumping, bend and side. When it comes to human interactions we used videos form UT-Interaction and INRIA XMAS (IXMAS) datasets for training. Figure 5 represents the flowchart of our training analysis; Fig. 5a represents the computation of HOGs of one subsequence. Therefore, for each action, we computed a set of HOGs that represent each action. Figure 5b illustrates the HOGs of four categories of videos, and each video represents an action. After detecting the human’s body silhouette and extracting the shots for each detected person, the histogram of oriented gradient of the temporal difference map between each pair of consecutive images is computed. Then, the HOG within each shot is compared to all HOGs formed in the training phase. The comparison is made using the distance computation between histograms, and the smallest is selected to detect each shot. The recognition can be formulated as follows:

$$ \begin{array}{@{}rcl@{}} \mathit{Action}\_\mathit{index} = \mathit{Argmin}\{\mathit{dist}(\mathit{HOGs}_{\mathit{training}},\mathit{HOG}_{\mathit{current}})\} \end{array} $$
(6)

The distance between histograms used in this paper is the same in (2) and defined in [30]. Here, A and B are two vectors representing histograms, aiand bi are histogram values.

The recognition of such action is made by comparing the computed distance between HOG of the current action and all HOGs of the training phase. Then the minimum distance indicates the real action.

3.5 CNN-based recognition

Depending on the application, the selection of the optimal CNN architecture is challenging. The proposed deep-learning-based approach involves the preprocessing of action videos before feeding them to the convolution neural network. The preprocessing consists of extracting the target region that contains human bodies in action, followed by an extraction of TDMap and then resizing the data before creating NumPy. A Convolution Neural Network (CNN), which is a supervised learning with multistage deep learning network, is implemented. CNN could learn multiple stages of invariant features from input images. Convolution and pooling are the main layers in a CNN model. Any complex CNN can be constructed with a convolution-pooling combination.

The architecture of our model, as illustrated in Fig. 6, composes of two convolution-pooling units, with six convolutional layers and four MaxPooling layers, one flattened layer and two fully connected layers. The output layers comprise ten neurons that represent the number of actions. We introduced convolution neural network with the following notations: I(x,y) as an input image with size of x × y and d the temporal depth; Conv(x,y,f) is the convolutional layer and pooling Mpool(x,y,k) where x and y are image dimension, f number of channels, and k number of kernels. PReLUs indicates Parametric Rectified Linear Unit, FC(n) is a fully connected layer with n neurons, and D(r) is a dropout layer with a dropout ratio r. Using the notations, the proposed CNN model can be described as follows:

Fig. 6
figure 6

CNN model trained on TDMap images

I(120,120,1), conv(119,119,32), conv(118,118,32), Mpool(59,59,32), conv(58,58,64), conv(57,57,64), Mpool(28,28,64), conv(27,27,128), Mpool(13,13,128), conv(12,12,128), Mpool(6,6,128), flatten(2304), FC(128),D(0.65), FC(number of identities).

The input of our system is a TDMap image with a resolution of 120×120 pixels. For the training and testing we used the preprocessed data from KTH, Weizmann, and UCF-ARG datasets. The model is trained using CrossEntrpy with a batch size of 128 examples, Adam as optimizer with a learning rate of 1e-3.

As an activation function we use the Parametric Rectified Linear Unit (PReLU), which is a generalized parametric formulation of ReLU. This activation function the parameters of rectifiers are learned adaptively and improves the accuracy with a negligible extra computational cost [53]. Only positive values are fed to the ReLU activation function, while all negative values are set to be zero. PReLU assumes that a penalty should be applied for negative values, and it should be parametric. The PReLU function can be defined as :

$$ \begin{array}{@{}rcl@{}} f(y_{i})=\left\{\begin{array}{lc}y_{i}&if y_{i}>0\\a_{i}y_{i}&if y_{i}<0 \end{array}\right. \end{array} $$
(7)

Where ai controls the slope of the negative part. When ai = 0, it operates as a ReLU; when ai is a learnable parameter, it is referred to as a Parametric ReLU (PReLU). If ai is a small fixed value, PReLU becomes LReLU (ai = 0.01). As shown in [53], PReLU can be trained using the backpropagation concept.

3.6 Summarization of actions

Once the detection and recognition of action is completed , the summarization is carried out by recording each action made by each person. The shot detection is the operation of splitting the frames of each action from the succession of actions made by each person. So, summarized actions of each person are represented by a timeline from each shot. The representation of the summarization can be performed also by selecting a frame that represent each action. Herin, we use the two representations.

Before choosing the frame from the shot, we performed a shot detection operation which defines the similar frames in each shot. So that the frame in each shot represents the same action. Then, a random selection of frames is performed since all frames in the shots represent the same action.

In this work, a video summary is defined as a concatenation of action labels and keyframes of the video where the actors perform a specific action.

3.7 Illumination change detection and video enhancement

Low lighting, uneven illumination or any change of illumination in the observed scene are among the sources of degradation that strongly affect video quality and consequently the process of scene analysis and understanding and particularly object detection and visual tracking performance [65, 66]. It is therefore useful to detect the illumination changes and apply the appropriate pre-processing before performing high-level vision tasks such as moving object detection and tracking. In this work, we propose an illumination change detection technique before enhancing the video quality using a Retinex-based perceptual Contrast Enhancement method using Luminance Adaptation (RCELA) [31]. This method is adapted to our problem to make it appropriate for real-time processing.

Illumination change detection is an active research topic in computer vision [67, 68]. In this work, we use a simple and efficient method for illumination change detection based on the entropy associated with the gray-level histogram of the pixels. Indeed, any change in luminance significantly affects the grayscale histogram of an image. This variation is much more pronounced in the entropy associated with the distribution of pixel gray-levels.

The entropy is defined as follows:

$$ \begin{array}{@{}rcl@{}} E_{t}=\sum\limits_{k}P_{k}\log P_{k} \end{array} $$
(8)

where Pk is the probability distribution of gray-levels (0 ≤ kK) in the input frame Ft

According to the characteristic cited below, and to detect the illumination changes at any time during the video, we use the entropy of each frame. Thus, the current image of the video can be enhanced when the absolute difference \(\left |E_{t}-E_{t-1}\right | \) between the entropy of the current image Et and the entropy of the previous background model Et− 1 is greater than a threshold T which is set to 0.7 after the experiment.

The proposed illumination changes detection is used to detect the occurrences ofillumination changes at any time in the video: for example, when the light is switchedoff in a period in the video. If the video has a bad illumination from the beginning, oran illumination degradation is detected, we can apply the enhancement before starting therecognition process.

4 Results and discussion

In this section, we evaluate the performance of the proposed method. For the proposed background modeling technique, the evaluation is made on SBI dataset and comparing with two background-subtraction-based approaches. Also, the built multiple human action dataset is presented with a description.

For evaluating the performance of the multiple action recognition approach, Weizmann and UCF-ARG human action recognition datasets utilized to train the proposed approach, when the proposed dataset as well as PET09 dataset are used for testing. The performance of the proposed action recognition methods are presented also. The recognition of detected actions is performed using two method : (1) Action classification named (CS-HOG-TDMap) using Cosine similarity measure between HOGs of TDMap in the training set and the current action to recognize. (2) Action classification named (CNN-TDMap) using the proposed CNN model trained on TDMap images.

Summarization of action is tested on the generated video considering the lack of videos containing many person acting many actions in the same scene. The proposed method was compared with state-of-the-art methods using the same datasets.

4.1 Experimental

In the pre-processing phase, which has a main contribution in our approach, we used the simplest motion detection method based on the background subtraction technique. Based on object detection results, a tracking of moving objects, is directed for further analysis of events. In other words, object tracking aims to trace a moving object (i.e., person) and recognize their actions. Human activity recognition methods can barely recognize actions if more than one person is present in a scene. Using our proposed method we aim to overcome such obstacle. Herein, moving objects are tracked using a bounding box enclosing each tracked object. To each bounding box is associated a label to designates every human present in a scene. Next, we extract the moving person and record all movements with an image resolution of 320x240 pixels. The absolute differences between each extracted box are then computed to generate the motion mask. Binary frames are used to compute the motion zone in each box by computing the absolute temporal difference between frames. The number of frames (N) used in the training phase is set to 40. Then, HOG is computed for each action. In the testing phase, this histogram is computed and then compared with all action histograms using the local soft cosine measure in [30] the described in the proposed method section.

To summarize the action, the change of appearance between the frames is computed using a similarity measure to extract the class within each video by selecting the shot. From the histogram of similarities, the local maxima are calculated to detect the shots in order to recognize the action within the shot. Then, the video is summarized by recording one frame from each shot. The proposed method is implemented using MATLAB R2018a on a computer with the following configuration: An Intel Core i5 processor running at 3.4 GHz and 8 GB of RAM.

The proposed method is an action-based video summarization approach. After the recognition of multiple action of the people in the scene, the summarization of each person action within a timeline is performed. The used datasets are suitable for summarization of the human action after recognition of each one them. While the other datasets like YouTube, UCF50 or Hollywood are not. So, the used dataset are the only ones in literature that are suitable for summarization of human actions in a private or public scene monitored by a

surveillance camera.

Datasets including HMDB51, UCF101, YouTube, and Hollywood cannot be used for recognizing actions with the proposed algorithm owing to the complexity of the videos that are collected from movies (eg.YouTube), videos captured by moving cameras, and video produced by a jitter camera such as the variation of the point of view, the variation of illuminations. Many approaches use the entire video to classify the action in it without analyzing the content of the videos. For that, we did not use this kind of datasets. Also, our method is to recognize and summarize multiple human actions for surveillance videos.

Using the presented representation of data that divide each video into short clips of 1 second, considering some videos that contains more than 10 second. The numbers of videos used are about 1500 clips used in the training and testing parts. The summarization of the number for each action is illustrated in the Table 1.

Table 1 Number of used videos for each action from each datasets

4.2 Datasets

The datasets used in the experiments, the Weizmann, KTH, UCF-ARG, UT-Interaction, IXMAS and MHAD (our dataset) are briefly reviewed in this section. The other types of video that are collected from movies (eg.YouTube), like HMDB51, UCF101, YouTube , and Hollywood cannot be used for recognizing actions with the proposed algorithm owing to the complexity of the videos [55]. Also, the videos are captured by moving cameras, and video produced by a jitter camera such as the variation of the point of view.

The Weizmann’s Actions as Space-Time Shapes dataset was recorded in 2005, aiming to test new algorithms for human action recognition [32] . Weizmann’s dataset use s a space-time-based algorithm where each sequence represents only one person acting. The background is known, which makes it easy to remove. The collected dataset contains the main human actions, including (walking, running, jumping, galloping sideways, bending, one-hand waving, and two-hands waving, jumping in place, jumping jack and skipping). The dataset contains nine actions, and each action has nine videos that represent the situations of human actions made by nine different actors.

Similarly, the KTH dataset contains a set of human actions including walking, running, boxing, hand waving, hand clapping, and jogging [33] . In this paper, the video contains four different scenarios which represent many states of objects and scenes, including outdoors and indoor videos, different scales of human body and cloth e s with different colors. This dataset contains 2391 images with a resolution of 160x120 pixels captured by a static camera.

The Multi-view Human Action dataset UCF-ARG is a set of videos recorded from different angles and classified into three categories: a ground camera, a rooftop camera at a height of 100 feet, and an aerial camera mounted into the payload platform. Each of these subsets contains 10 actions acted by 12 actors, representing most situations possible for each action, including 4 iterations by each actor in different directions.

Because of the similarity between the KTH and UCF-ARG videos and UIUC dataset [3] videos, w e use UIUC dataset just for testing. The dataset consists of 532 high-resolution video sequences of 14 human action classes, and every action is performed by eight persons. All the video sequences are recorded indoor scenes.

For human-human interaction the UT-Interaction [46] and IXMAS [47] datasets are used. The dataset contains 6 classes including shake-hands, point, hug, push, kick and punch. There is a total of 20 video sequences whose lengths are around 1 minute. For IXMAS datasets we choose Material class that represent sequences for human-human interaction.

For multiple human action recognition, we built our dataset named Multiple human action dataset (MHADFootnote 1). MHAD, as the naming reflects, to provide s a new dataset that contains many actions made by many actors in the same video. In one hand and related to video surveillance needs, each actor can act many actions during his presence in the scene. This can represent a rich data for many tasks in computer vision including video summarization methods based on human action, motion detection and tracking methods, people detection and recognition and people counting.

On the other hand, many persons can be found in the same video in action. Compared to the existing video surveillance dataset (that contains moving objects in the scene but with one action like walking), our dataset provides many persons acting different actions in the same video.

The proposed dataset can help computer vision researchers, especially those working on video summarization, motion detection and tracking, real-time human action recognition and many related tasks. By the following we present the characteristics of the proposed dataset in details.

Dataset characteristics:

The proposed dataset includes a set of human actions representing usual human activities. MHAD composed of 10 actions, including: boxing, walking, running, hand waving, hand clapping, jogging, carrying, standing, backpack carrying, and two persons fighting.

Generated videos contain from 3 to 5 persons acting in the scene. The duration of each video is 2-3 minutes and the duration of each action is from 2 to 3 minutes. In addition, three of the videos are outdoor and one is an indoor. The background is generated for each video and annotations of each moving actor is provided.

In the current work, we only used datasets that are captured by a fixed camera because our approach consists of a modification of the background and the detection of moving human bodies before tracking each one of them. For the same purpose, we have built our dataset containing three videos. For each video we can find persons acting different actions. Also, consecutive actions are performed by each person during his presence in the scene.

The accuracy of summarization is related to the recognition accuracy. So, if actions are well recognized, the summarization is just a representation of these actions by one image for each person ’s action.

The ground truth of the action in our dataset depends on the succession of actions in each video. Unlike the other summarization methods, wherein the scene changes every time, our dataset is for multiple action recognition and the summarization is rather based on detection. Figure 7 represents the succession of actions for each person in two videos from our datasets.

Fig. 7
figure 7

Succession of actions for each person in each video

4.3 Action recognition and summarization results

In order to evaluate the proposed method for background modeling, SBI dataset is used. Figure 8 represents the generated background using the proposed approach. The obtained results are convincing and the using our method the background is built without artificial ghost for all videos. For Foliage and People&Foliage sequences the background the proposed method success to estimate the background with good results even the sequences a full of moving object during all time of the videos.

Fig. 8
figure 8

Background results on the SBI dataset using theproposed approach

In order to consolidate the visualized results, we used different metrics, including Gray-level Error (AGE), Total number of Error Pixels (EPs), Percentage of Error Pixels (pEPs), Total number of Clustered Error Pixels (CEPs), Peak-Signal-to-Noise-Ratio (PSNR), MultiScale Structural Similarity Index (MS-SSIM), Color image Quality Measure (CQM).

These metrics are presented in Table 2 that illustrate the obtained results comparing with two background modeling methods IMBS-MT [54] and [43] respectively. As shown, the proposed method succeed to modelized the background with good results compared to the other method in the most of dataset videos including HighwayI, Hall&Monitor, sallen, and Foliage.

Table 2 Performance results the compared methods on SBI dataset

To evaluate the proposed method, obtained results are compared with state-of-the-art methods. Most action recognition methods perform the recognition on the original data that contains one person in action. The presence of more than one person in a scene can reduce the performance of the recognition process. In addition, most methods use the silhouette of the moving object to recognize the action. Hence, the accuracy of the recognition can be influenced by a bad binarizaion or segmentation. To overcome all these problems, our method recognizes multiple actions of multiple actors. The mask of each moving human body in action is computed using the background. Then, each one of the extracted silhouettes is used to compute the histogram of oriented gradient (HOG), which is compared with all histograms formed in the training phase.

The recognition of human actions is attained by computing the distance between HOGs of each extracted target based on training-phase results. In the training phase, the histogram of oriented gradient of the temporal difference map frame of the video is computed. Therefore, for each action we have a set of HOGs where each HOG represents one situation of one action is obtained. In the testing phase we compute the HOG of the detected sequence. A comparison of HOG with all HOGs of the training phase using the distance listed above is then performed. The minimum value of all distance values represents the action. To evaluate the effectiveness of the proposed technique, KTH, Weizmann, and UCF-ARG datasets are tested against each other. In addition, we perform a test between the video of within each dataset.

The extracted silhouette for each human body during a video sequence might contain more than one action. The proposed technique for detecting shots, which are similar images that contain the same action, is presented in Fig. 9. The histogram of similarity between each two consecutive images is computed. Then, a filter is applied to extract the shots using the difference between each two consecutive values. The histogram presented in Fig. 9b shows the transaction between actions. From that, we can observe the transaction caused by the change from one action to another.

Fig. 9
figure 9

Actions sequence and the histogram of similarity between each two consecutive frames. a Sequence of actions. b Histogram of similarity values

Tables 3456 and 7 illustrate the similarity distances between HOGs of the actions of the three datasets: within the KTH dataset, within the Weizmann dataset, within the UCF-ARG dataset, and between the KHT and UCF-ARG datasets . The distance is computed using cosine similarity measure of equation (4). The results tabulated in Table 2, which represent the similarity distances within the actions of the KTH dataset, reveal that the distance between similar actions such as walking, and jogging is smaller than that between very different actions such as hand-waving and running. It can be clearly seen in Table 3 that the distance between running and walking is smaller than that between other actions, as well as the jumping action. In addition, waving with two hands in Hand wave 1 is close to Hand wave 2, which is an action of two hands waving. The evaluation of actions of the Ground dataset and Rooftop dataset of the UCF-ARG dataset, which represent categories of videos captured by a ground camera and a rooftop camera, respectively, are shown in Tables 4 and 5. From tabulated results, it can be clearly seen that similar or close actions such as walking, jogging, carrying and running have a minimum distance between them because of the similarity of the appearance in terms of the direction and moving component of the human body. The same applies for hands waving and clapping; the distance is close for many situations. Similarly, for the recognition of KTH actions in Ground(UCF-ARG) actions represented in Table 6, we can observe that the proposed method failed to recognize the jogging actions. In addition, the boxing action of the KTH dataset is recognized as a waving action in the Ground dataset; this is due to the similarity in the appearance of boxing and waving in many situations.

Table 3 HOG distances between actions in the KTH dataset
Table 4 HOG distances between actions in the Weizmann datasets
Table 5 HOG distances between actions in the Rooftop dataset (UCF_ARG)
Table 6 HOG distances between actions in the Ground dataset (UCF_ARG)
Table 7 HOG distances between actions of the KTH dataset and the Ground of the UCF_ARG dataset

The errors rate in Table 7 represents 2% of the tested data, and between the closest actions like jigging and running. Also, in some cases, actions can be similar, and so the summarization using images can be useful to spot the differences.

For the sake of comparison of the proposed method’s results with some of the state-of-the-art methods, the accuracy of each approach is presented in Table 8. The two proposed method for action recognition are named respectively by: (1) the classification using cosine similarity measure between HOGs of the training set and the current HOG of the current action (CS-HOG-TDMap).(2) the recognition using CNN model (CNN-TDMap). The accuracy for the state-of-the-art methods are the same values reported in the papers. While the accuracy rate obtained using the proposed method is the ratio between the number of recognized action s and the total number of actions. The KTH, Weizmann and UCF-ARG datasets have been used by many methods in the literature for the last three years. The obtained results using the proposed approach for all the datasets are convincing and robust. However, the proposed method succeeds in recognizing over 98% of actions in the KTH and Weizmann, and UCF-ARG datasets using CS-HOG-TDMap and 99% using CNN-TDMap, because of the simplicity of the data, which contains a simple background; in addition, the actions are clear and in normal situations. For instance, in the UT- interaction dataset, the proposed method reach 87% and 98% recognition rates. For the IXMAS dataset, a large dataset with many situations of each action, the proposed method had a successful recognition rate of 99%.

Table 8 Recognition rate comparison using single action

Compared to state-of-the-art-methods related to multiple human action recognition, that use the same category of datasets represented, Table 9 represents the accuracy rate for each method, it can be observed that the proposed method results are improved and more effective. Obtained results are related to the use of the new representation of the data.

Table 9 Accuracy comparison of multiple human action recognition with state-of-the-art-methods

Detection and recognition of multiple human action using the proposed method can be implemented in real-time via an extended version. The proposed approach is tested on our dataset by extracting the sequence of each person in the scene and apply the proposed algorithm. The obtained results are showed in the Fig. 10 illustrating the detected person and their actions on MHAD and PET datasets. The visualized results represent one example from PET09 dataset and three videos from our dataset.

Fig. 10
figure 10

Action recognition results tested on three of MHAD videos and on video from PET09

The proposed approach is validated in three major steps including human detection, subsequence extraction, recognition and summarization of each person actions. UIUCI dataset is use also for testing the trained actions. Figure 11a shows some obtained results. These videos are not included in the training phase because they contain the same category of action in the used datasets. For UIUC dataset video the recognition rate reach 98%.

Fig. 11
figure 11

Recognition results of UIUC, IXMAS and UT-Interaction datasets. a UIUC dataset. b IXMAS dataset. c UT-Interaction dataset

For the human-human action we use two datasets such as IXMAS and UY-interaction Fig. 11b and c represents some obtained results of detection and recognition of human interactions. For example, in IXMAS dataset we use some video where two persons fighting in the training phase. The obtained results shown in Fig. 11b shows the recognition in four videos captured from different field of view (FOV). Also, the results in Fig. 11c illustrates some results from UT-interaction dataset representing some recognized actions.

The accuracy of the proposed algorithm is related to the detection tracking and the segmentation of the human body in the scene. Additionally, the detected person may be in an invariant position that requires a large number of videos in the training phase to represent all of the actions in several positions.

In this paper, we defined a video summary as a concatenation of action labels and keyframes of the video where the actors perform that action. The summarization using human actions presented in this paper consists of splitting the actions of each person present in the scene. Based on the results of the recognition and the extraction of each action, the summarization using the entire image is generated. Figure 12 presents the results of the proposed method’s application on our dataset, which includes two persons and many actions for each of them. The body silhouette of a person is extracted before detecting the shots that represent one action. For each shot, our algorithm selects frames that contain the corresponding body silhouette. The same process takes place for each person in the scene. If a person enters the surveilled zone and performs one action, walking into the scene for example, the summarization of this is one image. The proposed method can help video surveillance system operators show just the most important times in the video, noting that most areas are empty at most times.

Fig. 12
figure 12

Action-based summarization of each person in the scene. a Summary of person 1 using extracted shots of the silhouette detection. b Person 2 summary during his presence in the scene

As presented in Fig. 13, in the testing part a sequence of each detected person is generated each time5 and the action in each is recognized using the proposed model. The succession of analysis (motion detection, motion tracking and the proposed architecture) provides a multi human action recognition. After that, each action can be represented at the original video. Also, in order to summarize the detected and recognized actions during the entire time of a video, Fig. 13 represents a summarization using graphs of the recognized actions of each person during his presence in the scene.

Fig. 13
figure 13

Summarization of actions made by each person during his presence in the scene. First row : actions recognition and summarization for video 1. Second row: actions recognition and summarization for video 2

The action recognition results can be influenced by the illumination changes. The detection of any illumination changes in the scene can be useful to enhance the video captured before recognizing the action. The proposed illumination changes detection methods allow to apply the video quality enhancement just in the case when there is any change. The following Fig. 14 illustrate the results of enhancement for two videos (LightSwitch and Lobby) from Star dataset. After the illumination changes detected the enhancement of next frames is takes place.

Fig. 14
figure 14

Sub-sequence quality enhancement after detection of illumination changes

The enhancement is an additional part in the work that allows us to enhance the video if there is any illumination change during the video. But the novelty and the difference of the proposed method from the existing methods is the consideration of multiple human action(s) recognition. In addition to the combination of recognition and summarization of actions is not used for action recognition methods in literature.

5 Conclusions

In this work, a novel approach for multiple human detection, recognition and summarization was developed, where actions of a person presenting in a scene are summarized. Herein, for a specific scene, motion/actions of each person are detected, tracked, and a sequence of each human body silhouette is generated. Upon recognizing and summarizing each action within a shot (i.e. each shot represents a sub-sequence that represent one action), shots detection operation was developed in order to determine the set of actions in the generated sequence. Shots that represent the homogeneous part of the sequence were selected using the cosine similarity measure of consecutive frames. The recognition of each action was based on two steps. First, using a training dataset of HOGs of TDMaps that was generated in each sub-sequence. The generated HOGs, which represent different situation of each action, were computed and then used in the testing phase. The recognition is made using the distance between the HOG of the current action and the HOGs of the training by selecting the minimum one. in addition, the computed TDMaps images are used in a CNN model to classify also the action. Summarization was made by representing all shots with an image for each detected person during his presence in the scene. Using the proposed algorithm, multiple detection and recognition of human action could be achieved. This algorithm can be also used in real-time due to its simplicity and the number of features used in this method.