1 Introduction

Visual saliency detection identifies regions of an image that are more significant and facilitates image and video processing algorithms. Video Saliency Detection (VSD) is a visual detection category that identifies important regions and objects in videos. Usually, video frames have lower resolution and more noise than typical images and should be continuously processed; therefore, video processing encounters more challenges than image processing. Human eye is focused more on moving regions of videos when recognizing relevant features; therefore, motion detection is a critical component of VSD [1]. Processing the regions of interest (ROIs) of a video detected by a saliency map or Video Salient Object Detection (VSOD) can significantly reduce the computational cost and increase the efficiency of video processing algorithms, such as HAR systems.

A review of video processing algorithms revealed that most of them use random regions or frame selections throughout the input video frames [2,3,4], which decreases their efficiency. Particularly, if only ROIs of video frame sequences are used as the input, video processing systems can be better trained. Therefore, this article proposes a hybrid time-spatial video saliency detection method to enhance the accuracy of HAR systems. The existing VSOD methods can be categorized into Supervised, Unsupervised, and Semi-supervised methods [5]. Some studies have divided VSODs into Deep Learning (DL) methods and other methods [6]. DL-based methods, which are mainly supervised methods, have been widely developed in recent years but have shortcomings, such as the long-time learning process, the need for a large amount of training data, and powerful hardware for implementation. Furthermore, general DL systems must be retrained for each new application to achieve maximum efficiency. Wang et al. [7] presented a saliency detection method based on multiple instance learning, which extracts low-, mid-, and high-level features and evaluates the possibility of each region being salient. Li et al. [8] suggested a DL-based approach called Flow Guided Recurrent Neural Encoder for VSOD. Sun et al. [9] proposed a Step-gained Fully Convolutional Network (SGF) model for VSOD, which uses temporal and spatial information simultaneously. Lee et al. [10] presented a Saliency Map Prediction method based on a Convolutional Neural Network (CNN) for 360-degree video streaming, and Li et al. [11] proposed a Deep Neural Network (DNN) for VSOD that employs spatial and optical flow features. Yan et al. [12] introduced a Semi-Supervised Convolutional Neural Network for VSOD. Fan et al. [13] analyzed different VSOD methods on datasets commonly used in this field and compared their results. They also proposed an improved in-depth learning method for VSOD.

Yang et al. [14] suggested superpixels, a Fully Convolutional Network (FCN), and a random walk method on a graph for image salient object detection. Liu et al. [15] proposed an approach based on motion saliency and mid-level patches for action recognition, which uses a threshold-based motion region segmentation algorithm to extract motion saliency maps. Gu et al. [16] proposed a Pyramid Constrained Self-Attention Network for VSOD. Kousik et al. [17] combined a CNN with a Recurrent Neural Network (RNN) to improve salient object detection accuracy. Ji et al. [18] reviewed CNN-based encoder-decoder networks used in salient object detection. Zong et al. [19] proposed a motion saliency-based multi-stream Multiplier ResNets (MSM-ResNets) structure for action recognition, which can also extract motion saliency maps. Ji et al. [20] proposed a DNN structure for VSOD, including an encoder-decoder backbone network, self-and cross-attention modules, and a multiscale feature fusion operation called CASNet.

Zhang et al. [21] proposed a VSOD termed the dynamic context-sensitive filtering network, which includes convolutional layers, optical flow and long-term memory (LSTM). Wang et al. [22] suggested a hybrid feature-aligned network for semantic salient object detection in images, which mitigates the disturbance of a complex background. Liu et al. [23] proposed a dual-branch network based on multitask learning, which extracts the localization information of salient objects from scene classification and then employs dynamic guided learning to enhance saliency detection. The authors suggested another salient object detection method using a universal super-resolution-assisted learning approach [24]. Alavigharahbagh et al. [25] proposed a time saliency map based on motion features, which takes into account the disturbance of the camera movement in the used motion features. As an alternative to VSOD methods, multiscale object detection methods such as the one proposed by Liu et al. [26] can also extract salient objects from image frames.

DL networks with different layers typology, combinations and time-memory specifications have been proposed for DL-based VSOD systems. The efficiency of the used DL network depends significantly on the application, mainly of the type of the input video. The most used layers in DL-based VSOD systems are convolutional layers, LSTM, and rectified linear units (ReLU). The shortcomings of DL methods also exist in supervised non DL-based VSOD systems. However, non DL-based systems require fewer samples for training and not so powerful hardware. In this category of research, one can find the work of Vijayan and Ramasundaram [27], which combined the PSO framework with the saliency map technique for more accurate object detection, and the work of Huang and McKenna [28], which considered a combination of superpixels and optical flow for VSOD. Generally, non DL-based VSOD systems, which have attracted much less research attention, demonstrate a lower level of performance than DL-based systems.

Unsupervised and semi-supervised systems are less efficient than supervised systems in real-world applications, and their advantages are their high generalizability, independence from training samples, and ability to work under different conditions. In all unsupervised VSOD methods, the combination of information in video frames, that is, the spatial information, and information between image frames, i.e., the motion, is used to identify the ROIs of the input video. The use of object recognition algorithms as part of or in combination with VSOD is usual in applications such as the ones that can be found in transportation and security. Mahapatra [29] showed that the human visual system is more sensitive to moving objects than static objects in a scene.

Therefore, motion is an essential part of the video saliency map. Lee et al. [30] performed motion analysis based on a video dynamic visual saliency map model using frames’ optical flow and spatial characteristics. Jeong et al. employed a combination of motion, color, edge, and fuzzy logic to create video saliency maps [31]. Cui et al. [32] relied on the temporal spectral residual to separate moving objects from the static scene background and considered spectral-temporal features a fast descriptor. Woo et al. [33] used general and local features for image obstacle categorization which can be combined with motion characteristics and applied to videos. Belardinelli et al. [34] exploited spatiotemporal filtering to extract motion saliency maps. Morita [35] used only motion information to detect saliency maps in videos. Then, the author modified his method by combining color and motion information to increase the efficiency of the obtained saliency maps [36]. Hu et al. [37] examined three ways to obtain motion saliency maps in videos: optical flow, motion contrast, and spatiotemporal filtering.

Mejía-Ocaña et al. [38] removed the motion of the used camera from the acquired video and estimated the motion vector in different parts of the video hierarchically. The result was then reprocessed to eliminate possible errors in the motion vector calculation. Gkamas and Nikou [39] tried to increase the optical flow accuracy using superpixels. Li et al. [40] applied visual and motion saliency information for object extraction in videos. Their method considered the spatiotemporal consistency of structural information, color information, and optical flow. Chang and Wang [41] used superpixels to increase the optical flow performance in low-resolution or noisy videos. Huang et al. removed scene motion from image frames and used spatial, entropy, and temporal features to detect moving objects in videos and video saliency maps [42]. Dong et al. [43] proposed a HAR method that combines optical flow and superpixel residual motion characteristics to detect actions. Giosan et al. [44] used a combination of superpixel, optical flow, and image depth information for obstacle segmentation in videos.

Dellaert and Roberts [45] used optical flow templates to tag motion superpixels in video frames. Xu et al. [46] applied the discrete cosine transform to create an initial saliency map and then modified it using the global motion estimation method. Srivatsa and Babu [47] proposed a method based on foreground connectivity in superpixels for salient object detection. Donn’e et al. [48] introduced a dynamic optical flow calculation method using superpixels that increased the optical flow calculation speed in high-quality videos. Li et al. [49] used superpixels, color information, and spatial features for VSOD, and Hu et al. [50] proposed an enhanced optical flow algorithm using superpixels. Guo et al. [51] also applied a combination of superpixels and optical flow to extract the initial saliency map and corrected it using cross-frame cellular automata. Tu et al. [52] extracted motion objects from the input video, removed the noise, and obtained the final saliency map using motion continuity.

Hu et al. [53] proposed a video segmentation method using motion information and spatial properties of video frames, which used optical flow and edge information to create a neighborhood graph. Ling et al. [54] corrected camera motion using a feedback-based robust video stabilization method. Video stabilization enhances the efficiency of motion features such as of the ones based on optical flow. Wang et al. [55] extracted motion features with kernelized correlation filters using superpixels, color, and spatial features for visual object tracking. Chen et al. [56] employed a combination of principal component pursuit and motion saliency to distinguish the scene foreground in video images. Temporal and spectral residuals were considered to calculate the motion saliency.

Maczyta et al. [57] relied on optical flow for motion saliency map estimation. Zhu et al. [58] proposed an optical flow estimation method using light field superpixels. Kim et al. [59] considered a set of superpixels for object tracking, assuming significant motion in the videos. Ngo et al. [60] applied a dynamic mode decomposition for motion saliency detection and a difference-of-Gaussian filter in the frequency domain to improve the detection. Qiu et al. [61] considered superpixels to enhance motion detection efficiency. Tian et al. [62] used Lucas-Kanade optical flow and Kalman filter to model the regular movement in videos. They also used a template update scheme to ensure timely adaptation to feature changes. In a concise analysis of the aforementioned studies, one can perceive the following:

  • Motion is the most common feature in unsupervised methods, along with spatial, spectral, color, and edge features;

  • The principle of motion continuity is essential when extracting motion saliency maps;

  • As motion features, the optical flow and frame difference are the mostly used;

  • Superpixels and image segmentation have also been used for motion detection;

  • The final output is commonly achieved after some postprocessing accuracy improvement steps.

According to the above descriptions, mainly as to the advantages of the unsupervised systems, this study aimed to propose an unsupervised method for VSOD. The method is independent of the objects and uses only temporal and spatial features, including based on color, motion, contrast, frequency, and edge information. A bank of filters is used to extract the spatial saliency map. Additionally, different features are used to detect and remove the motion of the scene to improve the VSOD performance. The combination of the used features considers both spatial and motion information. In summary, the innovations achieved with the current study are as follows:

  • Several spatial saliency maps containing color, frequency, edge, and frequency information are used to obtain a comprehensive expression of the ROIs in each image frame;

  • Extraction of motion information using optical flow and segmented images;

  • Removal of the scene and camera motion to enhance the accuracy;

  • Integration of spatial and temporal saliency maps using a nonlinear function.

The remainder of this article is organized as follows. Section 2 presents the flowchart of the proposed method and an explanation of its steps. The results of the proposed method and their comparison with related methods are presented in Section 3, and Section 4 presents the conclusions and perspectives of future work.

2 Proposed method

The flowchart of the proposed method is given in Fig. 1. Its first step consists of retrieving several image frames from the input videos. The main blocks of the method are related to image registration, spatial and temporal saliency maps, and the final saliency map. Three extra blocks, namely region selection, frame cropping, and HAR, were used to evaluate the performance of the proposed method.

Fig. 1
figure 1

Flowchart of the proposed method

Fig. 2
figure 2

Two consecutive image frames before and after the image registration step

2.1 Image registration

In video processing and VSOD, it is vital to separate the motion of the sued camera from that of the objects of interest. The motion of the camera can effectively affect motion detection algorithms and, therefore, interfere with detecting moving objects. Many algorithms have been proposed for image registration, and one of the most common and effective is Speeded Up Robust Features (SURF), which is similar to the scale-invariant feature transform (SIFT) but is faster and less complex in computing [63].

SURF uses a Gaussian filter, Hessian matrix, blob detection, and feature matching to select the key points in an image and detect the rotation and shift in these points. Due to the robustness of the SURF algorithm against rotation and translation, various optimizations have been performed, mainly focusing on execution speed. Due to the aforementioned advantages, the SURF operator was implemented in the first step of the proposed method to detect the motion of the scene. Supposing the scene moves because of the motion of the used camera, the motion detection algorithm can distinguish between the motion of the scene and of the moving objects after the image registration step, and the regions of the image frames that are not overlapped can be discarded for further processing. Therefore, it is assumed that the camera moves with the objects of interest and so eliminating these regions does not lead to any performance loss in saliency detection. Figure 2 shows the effect of this step in two image frames of a video containing the motion of a scene. Note that the left border of frame t and the right border of frame \(t+1\), indicated in the frames by black rectangles, were cropped after the registration step.

Table 1 Features and their length used in the spatial saliency detection step of the proposed method

2.2 Spatial and time saliency maps

2.2.1 Spatial saliency

In this step, the spatial saliency map in the image frames, regardless of their sequence, is extracted. For this purpose, a modified version of salient region detection via high dimensional color transform [64] is used. The features employed in the proposed method are classified into spatial, color, texture, and shape features, Table 1.

Fig. 3
figure 3

Spatial saliency of the two consecutive video frames of Fig. 2 obtained by the proposed method

The modifications in the proposed method compared to [64] include: first, the proposed method uses parallel processing to improve speed. Second, Simple Linear Iterative Clustering (SLIC) is used to find superpixels in the original method, which is replaced in the proposed method by the SLIC0 algorithm [65] to refine compactness adaptively. In SLIC, the user chooses the compactness parameter, which is constant for all superpixels. If the video has different smooth properties, The output of the SLIC algorithm includes irregular superpixels in non-smooth regions. SLIC0 is an adaptive method that generates regular-shaped superpixels in the smooth and non-smooth regions. The third is the number of Histogram of Oriented Gradient (HOG) bins, which is 31 in the original method [64], but in the proposed method is 36. Figure 3 shows the spatial saliency of the two iamge frames in Fig. 2 after the image registration step. In this case, the cyclist and bicycle are among the highlighted regions on the saliency map as ROIs, which demonstrates the ability of the suggested spatial saliency algorithm to detect ROIs.

2.2.2 Time saliency

This study uses optical flow to extract temporal changes in a video [66, 67]. Optical flow extracts the motion pattern in a visual scene that results from the movement of scene objects between an observer and a camera. Some advantages of optical flow are different optical flow extraction methods with distinct hypotheses and optimized algorithms in terms of speed and relative robustness to noise. Its relatively high sensitivity to the motion of the scene and used camera and, in most algorithms, the extraction of the border of moving objects and not the whole objects are among the disadvantages of optical flow [68]. In a two-dimensional (2D) grey-scale video, if the motion in x and y directions is \(\Delta x\) and \(\Delta y\), respectively, at \(\Delta t\) time interval, and, assuming that the motion between two consecutive image frames is small, this change can be approximated as [69]:

$$\begin{aligned} \begin{array}{l} I(x + \Delta x,y + \Delta y,t + \Delta t) = \\ \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,I(x,y,t) + \frac{{\partial I}}{{\partial x}}{\hspace{0.55542pt}} \Delta x + \frac{{\partial I}}{{\partial y}}{\hspace{0.55542pt}} \Delta y + \frac{{\partial I}}{{\partial t}}{\hspace{0.55542pt}} \Delta t + higher - order{\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} terms, \\ \end{array} \end{aligned}$$
(1)

where I is the brightness, x and y the coordinates in the first frame, \(\Delta x\) and \(\Delta y\) the changes in the horizontal and vertical directions, i.e., in the motion in the following frame, \(\Delta t\) the time of motion, and \(\partial \) the differential operator, respectively. Since after \(\Delta t\) the brightness of an object is almost the same, (2) can be written as:

$$\begin{aligned} I(x,y,t) = I(x + \Delta x,y + \Delta y,t + \Delta t), \end{aligned}$$
(2)

then, by discarding the higher-order terms, one has:

$$\begin{aligned} \frac{{\partial I}}{{\partial x}}\Delta x + \frac{{\partial I}}{{\partial y}}\Delta y + \frac{{\partial I}}{{\partial t}}\Delta t = 0, \end{aligned}$$
(3)

and, by dividing (3) by \(\Delta t\) and performing some simplifications, it is possible to obtain [69]:

$$\begin{aligned} \frac{{\partial I}}{{\partial x}}{V_x} + \frac{{\partial I}}{{\partial y}}{V_y} + \frac{{\partial I}}{{\partial t}} = 0, \end{aligned}$$
(4)

therefore, (4) can be rewritten as [69]:

$$\begin{aligned} {I_x}{V_x} + {I_y}{V_y} = - {I_t},{\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} or{\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} {\hspace{1.0pt}} \nabla I \cdot \textbf{V} = - {I_t}, \end{aligned}$$
(5)

where \({V_x} and {V_y}\) are the vector of motion, i.e., the optical flow, and \(\frac{{\partial I}}{{\partial x}},\frac{{\partial I}}{{\partial y}} and \frac{{\partial I}}{{\partial t}}\) derivatives or gradients of the image in the x and y, directions and time [69], respectively. Since the number of unknowns in (5) is more than that of considered equations, different optical flow methods simplify and balance the number of equations and unknowns by adding a constraint or hypothesis. After some simulations, the Horn–Schunck [70] method was selected from the current optical flow algorithms, which assumes that the pattern of brightness changes across the video is smooth and uses an optimal iterative method to solve (5). Figure 4 shows the output of the Horn–Schunck method for two consecutive video frames. As can be seen in Fig. 4, the Horn–Schunck method [71] only extracts the boundaries of the moving object, therefore, to obtain the entire moving object, optical flow is combined with color based image segmentation. Thus, the input image is first simplified according to the color based image segmentation method proposed by Bensaci et al. [72] using quantization based on variance [73]. In this quantization, the color variance in the segmented objects is minimized. The dithering technique reduces the quantization error in the neighboring pixels after quantization. Several dithering methods [74] were used in the proposed method. After dithering, the quantized image is corrected using J-image calculations [75].

This correction minimizes the variance in the quantized blocks using the growing region. Figure 5 shows the segmented image frames. Finally, a temporal saliency map is made of the detected segments and their optical flow energies (Fig. 6).

Fig. 4
figure 4

Output obtained by the Horn–Schunck method for the two consecutive video frames of Fig. 2

Fig. 5
figure 5

Output of the time saliency step of the proposed method for the two consecutive frames of Fig. 2

Fig. 6
figure 6

Output of the time saliency step of the proposed method for the two consecutive frames of Fig. 2

2.3 Final saliency map

The most important part of the proposed method is the combination of temporal and spatial saliency maps. Various research suggested different combinations and schemes. Some researchers used multiplication and a weighted linear function of temporal and spatial saliency maps to obtain the final saliency [76,77,78]. In most VSOD research, the weight of the temporal saliency is assumed to be greater than that of the spatial saliency [78, 79]. Here, a different scheme is suggested. First, a nonlinear function containing seven expressions combines the time and saliency maps. The expressions include the multiplicative and weighted summation, power of two, and exponential term of temporal and spatial saliency maps. The suggested function that combines temporal and spatial saliency maps is:

$$\begin{aligned} VSM = {x_1}{S_s} + {x_2}{T_s} + {x_3}S_s^2 + {x_4}T_s^2 + {x_5}{S_s}{T_s} + {x_6}{e^{{S_s}}} + {x_7}{e^{{T_s}}}, \end{aligned}$$
(6)

where \({x_i}\)s are coefficients, VSM the video saliency map, \({S_s}\) the spatial saliency and \({T_s}\) the temporal saliency map. The coefficients of (6), which model the weights of the temporal and spatial saliency parts, were obtained using a multi-objective optimization scheme to satisfy two objective functions, which were designed to maximize the efficiency of the saliency map. The objective function 1 is the ratio of the total energy of the \({10^{st}}\) decile of the necessary image pixels to the 1st decile of the unimportant image pixels, which must be maximized (Algorithm 1). In other words, the weakest VSM values assigned to necessary image pixels must be greater than the highest VSM values assigned to non-important image pixels. The second objective function is the percentage of unimportant pixels in the input image, higher than \({1^{st}}\) decile of essential pixels (Algorithm 2), which must be minimized.

Algorithm 1
figure a

Objective function 1.

Algorithm 2
figure b

Objective function 2.

3 Results

The efficiency of the proposed method was evaluated in two steps. In the first step, unknown parameters of the method and the coefficients of (6) were determined by implementing the proposed VSOD. The effectiveness of the method in increasing the accuracy of HAR systems was addressed in the second step. In this case, the HAR system response was analyzed before and after integrating the proposed method. All simulations were performed on MATLAB R2020b and PyCharm IDE Professional Edition 2020, with a personal computer Intel Core i7-9700 CPU (8 cores, 12M Cache, up to 4.70 GHz) with 16GB of RAM and a GeForce RTX 2070 SUPER 8GB GPU.

3.1 Datasets

Four datasets were used in this study. The first two datasets were used to determine the unknown parameters of (6), and the remaining two were used to investigate the effect of adding the proposed method to current HAR systems. Thus, to determine the parameters of the proposed VSOD method and obtain the unknown coefficients of (6), DAVIS 2016 and 2017 datasetsFootnote 1 were used [75, 87].

DAVIS 2016 and DAVIS 2017 datasets contain a collection of 50 and 150 video frames with their moving object masks, respectively. All videos are in color, and each video’s number of frames and image quality differ. After identifying the unknown parameters of (6), the proposed method was added to the input of current HAR systems as a preprocessing step, and the performance of each HAR system with and without the proposed method was evaluated. Two datasets, UCF101Footnote 2 and HMDB51Footnote 3, were used in this step to provide different videos and a separate file containing the human actions performed in the videos. The UCF101 dataset contains 13,320 videos of 101 human actions, and the HMDB51 dataset contains 6766 videos of 51 actions.

3.2 Proposed method’s parameters

The parameters of the proposed method are used in its second phase (Fig. 1), which includes the image registration step, and the building of the spatial saliency map, temporal saliency map, and final saliency map. A set of different features and feature-matching algorithms were examined for the image registration step, as shown in Table 2. Among the available features, the best response was obtained using the SURF algorithm. This algorithm is fast and can be used in online implementations due to its feature extraction speed. Regarding the matching key points, the rotation was assumed to be 0 (zero) because of the type of the studied videos, and only shift and scale transforms were considered. However, any error in this step can significantly reduce the efficiency of the temporal saliency map. The algorithms studied in the optical flow calculation include the Farneback [88], Horn-Schunck, Lucas-Kanade [89], and Lucas-Kanade derivative of the Gaussian [90] algorithms. The Horn-Shank algorithm was selected based on its robustness against errors in the image registration step. In the color image segmentation step [72], the scale was set to 2, \({\mu _j}\) was assumed equal to 0.4, and \({\sigma _j}\) was equal to 0.03. A Genetic Algorithm (GA) was used to calculate the coefficients of (6).

Table 2 Features studied in the image registration step of the proposed method

In the optimization step, All coefficients were assumed to be between 0 (zero) and 5, and the initial population was set to 200. Two different optimizations were performed. The first was done with objective function 1 (Section 2.3), and the second was performed by combining objective functions 1 and 2 (Section 2.3) with equal weight. In the second optimization, the difference between the two objective functions was used as the objective function due to the different goals of the objective functions, i.e., one should be maximized, and the other should be minimized, as:

$$\begin{aligned} {\mathrm{{F}}_{\mathrm{{cost}}}} = \mathrm{{ 1}} + \mathrm{{objective }}{\mathrm{{~function}}{\,2}} - \mathrm{{objective}}{\mathrm{{~function}}{\,1}}. \end{aligned}$$
(7)

Figure 7 shows the changes in (7) when minimized by the GA and the changes in objective functions 1 and 2 during the iterative process. As can be seen in Fig. 7, the changes in objective function 1 are much higher, and its effect on output is more import. This could be due to the large area of the non-relevant region of the video frames, which makes the changes of the objective function 2 less important. The function obtained in the first optimization process was:

$$\begin{aligned} VSM = 0.9{e^{{S_s}}} + 5{e^{{T_s}}}, \end{aligned}$$
(8)

and the function obtained in the second optimization was:

$$\begin{aligned} VSM = 0.02{S_s} + 0.01{T_s} + 0.8{e^{{S_s}}} + 4.9{e^{{T_s}}}. \end{aligned}$$
(9)

Equations (8) and (9) output is then used as the final saliency maps.

Fig. 7
figure 7

Behaviour of a) (7), b) objective function 1, and c) objective function 2 during the optimization process

3.3 Comparison with other methods

By adding the proposed method to some current HAR methods, they can be trained using the salient objects of video frames detected by the proposed method. The results of these methods were compared with those obtained when trained using random or static region selection. Five different HAR methods were used for comparison: four use arbitrary region and frame selection methods [91,92,93,94], the other uses a temporal saliency map for frames and region selection [95]. All studied HAR models were trained under similar conditions on UCF101 and HMDB51 datasets and then evaluated. In all methods, the number of video frames used for training was 20, and the dimension of the selected frames was \(111 \times 111\). The frames were cropped to \(111 \times 111\) according to the saliency map. Data augmentation algorithms were also used to generate more data for training. The remaining conditions were similar to those suggested in [95]. Table 3 presents the results when the proposed method was added to each studied HAR system using (8) and (9) and applied to the UCF101 and HMDB51 datasets. According to the results, the accuracy of the studied HAR methods improved when the proposed method was added to them and the input frames were selected based on VSOD. The performance improvement of (9) obtained from the summation of the two objective functions was better than that of (8), which was optimized only based on objective function 1.

Table 3 Accuracy obtained by the studied current HAR systems with and without (Baseline) the integration of the proposed method through (8) and (9) (best values in bold)

After identifying (9) as the best choice for producing VSOD, the result of the proposed method was compared to that obtained by the method proposed in [95], which has been used to improve the HAR input. According to Table 4, the proposed method yields better results than the method of [95], which shows that the proposed VSOD is better than the one based on the saliency criterion proposed in that method. The increase in the accuracy achieved by the studied HAR systems with the proposed method is more evident in the case of the HMDB51 dataset. Two reasons can be mentioned for the lower performance improvement achieved with the UCF101 dataset relative to the HMDB51 dataset. First, HAR methods show high performance on the UCF101 dataset, which is difficult to improve. Second, the ease of action detection in the UCF101 dataset decreases the effectiveness of the proposed method. The complexity of the HMDB51 dataset led to a more improvement in performance by adding a preprocessing step. Table 5 shows the 4% improvement achieved by the method suggested in [95] relative to the original method and shows that the proposed method is approximately 1.5 times better than the one proposed in [95].

Table 4 Comparison of the original HAR method (Baseline) and the proposed method, along with the one suggested in [95], in terms of efficiency (best values in bold)
Table 5 Percentage improvement of the proposed method, along with the one suggested in [95], compared to the original HAR method (Baseline) in terms of efficiency (best values in bold)

The total percentage improvement achieved by the proposed method in all studied HAR methods was equal to 97.7%, which is approximately 42.4% more than that the one achieved by the method proposed in [95], which led to an improvement of \(55.3\%\), confirming the superior efficiency of the proposed method. The main reason for the better performance of the proposed method is its spatial-tempora saliency map. In [95], the saliency map uses only a simple gradient operator; however, in the proposed method, a combination of features is used to produce the saliency map. Finally, the proposed method can be added to the current HAR systems that do not select the salient regions in their input to improve their accuracy.

4 Conclusion

This study proposed a VSOD method based on a time-spatial saliency map. A weighted nonlinear combination of two saliency maps is used in the proposed method. A different scheme is used to obtain each saliency map. With an image registration step, the motion of the used camera is detected in the proposed method, and the non-overlapping regions of each of the two consecutive video frames are removed. Spatial, color, texture, and shape based features are used to extract the spatial saliency map. An optical flow algorithm, which is one of the most common approaches used in motion detection, is employed to extract the temporal saliency map after the image registration step. A color-based image segmentation method is used because the selected optical flow method only detects the boundaries of the moving objects in the video frame. Subsequently, the optical flow energy of each detected segment is considered as the temporal saliency of that segment. To maximize the efficiency of the proposed method, which combines temporal and spatial saliency maps, a function with seven degrees of freedom was implemented and optimized on two labeled datasets using two different objective functions. In summary, the main contributions of the proposed method are as follows: first, the use of several spatial saliency maps containing spatial, color, edge, and frequency based features to obtain a comprehensive expression of the ROIs of each image frame and the integration of spatial and temporal saliency maps by a nonlinear function, and second, the removal of the motion of the scene and camera to enhance the detection accuracy. The final VSOD method was added to four different current HAR system as a preprocessing step. The performance of each HAR system was evaluated before and after using the proposed method. The results showed that the proposed method increased the accuracy of the studied HAR systems. Future research can focus on speeding up the preprocessing step or suggesting a saliency map for grey-scale videos. Another future work could be to compare the impact of state-of-the-art VSOD methods on the accuracy of various HAR systems and to study saliency maps in videos with multiple motions.