Strong target constrained video saliency detection
- 109 Downloads
Aiming at the problem of video saliency detection, a strong target constrained video saliency detection method is proposed in this paper. In order to detect the salient region fast and effectively, strong target constraints forcing by the location, scale and color model are introduced into the video saliency detection. First, a target locating strategy for obtaining the location and scale information is proposed by correcting the result of video tracking with the optical flow result and the segmentation result of the last frame. Second, the estimated color model of the target is also calculated by the obtained segmentation results. Finally, the strong target constraints are integrated into the saliency model in the way of extending the significance hypothesis, and a high quality saliency map is obtained, where segmentation is employed for constrained parameters updating. In details, Densecut is initialized by the obtained saliency map to calculate the segmentation result of the last frame. Compared with some state-of-art saliency detection methods, our proposed method performs outstandingly, and the results on DAVIS dataset are significantly improved in terms of accuracy and robustness.
KeywordsSaliency target detection Target tracking Video saliency Video segmentation
As the most common way to connect the outside world for human beings, vision can achieve many image-based functional purposes, such as face recognition, scene segmentation, target tracking and so on, while receiving a large amount of information [1, 2, 3, 4, 5]. The human visual system focuses on the “saliency” area, so the brain burden is greatly reduced by the processing and storage of information in the small part of the “saliency” area. There is no doubt that this way of focusing on the “saliency” area has a unique advantage in visual information processing.
In addition to extensive research in neurobiology [6, 7], cognitive psychology [8, 9] and other fields, the concept of saliency detection [10, 11, 12] also has a very important research value in computer vision field. For example, using salient region as prior information in image segmentation can automatically locate foreground targets, and transform the interactive segmentation algorithm into automatic segmentation algorithm . Otherwise, the extraction of the salient region can reduce the data to be processed, thereby the efficiency of the algorithm can be improved. Color contrast is the main feature of salient region and other features such as texture features, shape features, and space locations. By comparing the features, the algorithm determines the saliency of each region. For feature contrast, there are two main methods: local contrast and global contrast.
The saliency detection algorithm for local contrast is mainly based on the comparison between local regions. When contrasting features, the current region is compared only with some adjacent regions to find out the difference. Therefore, the algorithm tends to produce high saliency at areas such as the edge of the foreground object and noise. This kind of methods have the advantage of high execution efficiency. However, due to the locality of the comparison areas, the algorithm is susceptible to noise and cannot form a salient region with stable connectivity. In order to detect salient regions more accurately, Liu  proposed a series of new features, including multiscale contrast, center-surround histogram, and color spatial distribution, and salient regions are detected by forming a conditional random field to efficiently combine these features.
On the contrary, the global contrast detection algorithm calculates the global contrast relationship of the image. It is obvious that the time complexity of the global contrast algorithm is very high. In applications, the saliency results are usually used as preprocessing operations, and the high time complexity certainly greatly reduces the practicability of the algorithm, even if the effect of the saliency detection is quite excellent. Cheng  proposed the global contrast based saliency detection algorithm by simplifying the color feature. The original three channel color is quantified from 256 values to 12 values, and color with lower probability in the image is ignored. By simplifying the operation, the color contrast in the image can be calculated quickly. This saliency method based on global contrast has achieved quite excellent results.
As an extension of image saliency in the field of video, video saliency [16, 17] has also received extensive attention. According to the features, the current video saliency detection methods can be divided into three categories: methods based on space domain , methods based on time domain [19, 20] and methods based on spatio-temporal domain . It is generally believed that image saliency represents the most salient region in the current image, and the region most likely to be labeled as foreground. For images, salient region usually satisfies two hypotheses: the color difference between saliency area and any other region in the image is larger; the salient region is closer to the image center than the other regions . However, the features of videos are richer than images. The image can be regarded as a single frame of the video, and the target in video often has the motion characteristics. After introducing multiple frames in videos, we can obtain not only the appearance information but also the temporal context information of the target between frames.
2 Calculation of location and scale information
2.1 Optical flow information with contour feature
The optical flow algorithm is used to obtain the optical flow field in each frame. According to the results of the optical flow field, there are two obvious facts in the motion region. On one hand, all pixels in the motion region are consistent and the region has obvious contour, but pixels in the non-moving regions are chaotic, and there is no visible contour of the object. On the other hand, the motion direction of the edge pixels in the motion region is very different from that of the non-moving regions in the neighborhood. Therefore, this paper needs to pretreat the optical flow according to the above facts, so as to obtain more effective optical flow regions .
2.2 Scale-variable KCF with APCE
2.3 Target correction
3 Estimation of target color model
In addition to the location and scale information of the target, the color model of the target is also estimated based on the existing target segmentation results. At first, an interactive segmentation is performed on the key frame, and the accurate target segmentation result is obtained. The color model of the target in key frame almost represents the color model of the target in the whole video. At the same time, it is noticed that the color model of the current frame target is the closest to the target color model of the last frame. Therefore, the color model of the key frame foreground object is used as the basic color model, and color model of the last frame is weighted into it to estimate the color model of the current frame.
4 Computation of strong target constrained video saliency
5 Constrained parameters updating using segmentation
In videos, the feature information that can be extracted is richer than that in the image, because of the relationship between the last frame and the current frame. In this paper,
the segmentation results of the target are extracted frame by frame, so the use of the segmented target information can provide more help for the calculation of video saliency. Due to the space–time context information between frames in video scene is closely related, the segmentation result of the last frame can provide effective information for the determination of target location and scale by correcting the tracking result of the current frame. Simultaneously, saliency results do not have connectivity and obvious boundaries, so existing segmentation results can be used to calculate the color model constraints of the target in the current frame.
The traditional video segmentation method usually simplifies the video segmentation problem into two parts: the extraction of the prior information and the target segmentation. The common video segmentation methods calculate the prior information according to the target color model, the contour constraint, the motion information and the simple saliency . And the Graph-cut is usually selected to perform the segmentation operation. However, only fusing a particular feature or a simple feature cannot quickly extract effective prior information. At the same time, the segmentation method building graph models for all video frame pixels is inefficient.
6 Experimental analysis
6.1 Environment and dataset
For experimental analysis, the Visual Studio 2013 and the OpenCV image library are selected as the development tools and the experiment program is written in C++. The program runs in the hardware environment of Intel (R) Xeon (R) CPU E5-2699 v3, 128 GB RAM.
For dataset, the DAVIS  is selected as the test case. The DAVIS contains 50 test videos, including a variety of challenging video segmentation test sets such as occlusions, motion blur and appearance changes. And the DAVIS also comes with the standard segmentation results of all 50 videos, which are the manually calibrated Ground-Truth. Under this dataset, the saliency detection experiments are carried out on STCVSD (our method), RC , PISA  and CA , and the experimental results are compared and analyzed qualitatively. In addition, the segmentation results that are produced during saliency detection process can also be compared with the state-of-art video segmentation methods including BVS , CVOS , FCP  and FST  on the DAVIS quantitatively. Therefore, DAVIS is chosen as the data set in the following experiments.
6.2 Results and analysis
6.2.1 Verification of scale-variable KCF with APCE
Experiments on KCF algorithm and scale-variable KCF with APCE proposed in Sect. 2.2 are carried out. The improved KCF algorithm has better tracking results compared with the original KCF for the video whose target is from far to the near or from the near to the distant. In Fig. 3, the tracking results of the several frames in car-shadow, drift-straight and motocross-bumps are given. The yellow rectangles in images are groundtruth, the green rectangles are the tracking results of the original KCF, and the black rectangles are the tracking results of our proposed scale-variable KCF with APCE.
The target in car-shadow as Fig. 3a is from the near to the distant. The original KCF is not able to reduce the scale of the tracking box in time when the target is far away, resulting in that green boxes are too large in the second image and the third image in Fig. 3a. As our improved method increases the possible forms of scale change, the scale of the tracking box can be reduced in time, and the target is located more accurately. The target in drift-straight as Fig. 3b, motorbike as Fig. 3c is from far to the near. For the original KCF algorithm, the smaller scale of the tracking box located in the first frame does not adjust in time, so that the target cannot be tracked or only a small part of the target can be located. While our improved KCF can complete enveloping the tracking targets (black rectangles) by adjusting the height and width separately. The original KCF tracking result (green rectangle) of the third image in Fig. 3b is far from the real target, and APCE is used to exclude the situation of losing target in our improved KCF so that our result is still accurate as the black rectangle. It can be proved that our proposed scale-variable KCF with APCE has certain advantages in determining the location and scale of the target.
6.2.2 Target location and scale correction
In Fig. 4, it shows the process and results of information correction for target location and scale, including optical flow binary images, segmentation results of last frame, improved KCF tracking boxes and the correction results. For improved KCF tracking results, the yellow rectangles represent groundtruth, and the black rectangles are our tracking results. In correction process, the red box is the target box obtained directly from the result of the optical flow, which may quite different from the real target. The blue box is the bounding rectangle of last frame segmentation result, and the white box is the final result after correcting through the method mentioned in Sect. 2.3. According to the optical flow results of Fig. 4a, c, the result of optical flow is easy to be affected when there some motion interference in the background, so that the red box is far larger than the real larger. For the segmentation mask of the last frame as shown in Fig. 4b, c, if the segmentation result of the last frame is incomplete, the blue box cannot be accurately located in the current frame to consider the offset of the motion. According to results of improved KCF of Fig. 4b, c, even though the tracking algorithm has been improved, it is still not sensitive enough to the drastic scale change, so the black box in Fig. 4b is larger than the target, and the black box in Fig. 4c is smaller than the target. Finally, the improved KCF tracking results are corrected by combining the optical flow result and the last frame segmentation result, and the appropriate target location and scale corrections (white boxes) are obtained. The final accurate target location and scale information not only provide accurate feature information for saliency detection, but also greatly compress the region to be detected. Most of the background regions are eliminated, and then the accuracy and efficiency of our proposed video saliency detection algorithm are further improved owing to the reduction of redundant information.
6.2.3 Video saliency detection with strong target constraints
184.108.40.206 Qualitative analysis
Accurate foreground color estimation models are obtained through the computation of video saliency, so accurate segmentation results can also be acquired. In Fig. 5, segmentation results (green lines) that are produced during saliency detection are given to compare with the results (blue lines) of global Densecut. According to the segmentation results of bear video sequences as Fig. 5a, we can find that for the videos with small differences between foreground and background, a high quality of segmentation is achieved for our proposed method and it avoids the jitter of segmentation results. Besides, for the video scenes of target occlusion (such as the bus in Fig. 5b and lucia in Fig. 5c and fast moving (such as the car-roundabout in Fig. 5d and paragliding-launch in Fig. 5e), global Densecut without saliency maps is easy to segment the background into the foreground according to the results (blue lines), because of the lack of more accurate target constrained information, such as location, scale, color model and shape.
220.127.116.11 Quantitative analysis
Due to the lack of groundtruth for video saliency, the quantitative analysis is not feasible. The quantitative data of segmentation results in the DAVIS is calculated to compare with the state-of-art video segmentation methods such as BVS , CVOS , FCP , FST , so that the validity of STCVSD can be verified.
For different test videos, the experiments are carried out by selecting the middle frame and the first frame as the key frame respectively. For most videos, selecting the middle frame as the key frame can improve the accuracy of the tracking algorithm and reduce the influence of the far frame on the current frame. There is a certain improvement in the effect of saliency extraction and segmentation.
Averages of quantitative comparisons (the middle frame or the first frame as the key frame)
In this paper, a video saliency detection method based on strong target constraints is proposed by fusing location, scale and color information. Traditional optical flow algorithm is included to extract contour features, KCF tracking method is improved to be scale-variable and APCE is used to enhance accuracy, and these improved algorithms are used to correct location and scale of the target with previous segmentation results. The color model is calculated from the segmentation results of the key frame and the last frame. Finally, these location, scale information and the color model are fused to constrain the saliency calculation. According to experimental results, our proposed video saliency detection method STCVSD can effectively extract the real saliency region in video sequences. For the qualitative results, our method shows a connectivity trend which is better than other saliency detection methods, and the intermediate segmentation results are superior to Densecut without saliency. For the quantitative results, the average IoU of our segmentation results is higher than other state-of-art video segmentation methods including the deep learning based method on DAVIS dataset. As a result, our proposed STCVSD method is verified to be excellent.
This work was supported by the National Natural Science Foundation of China under Grant Nos. 61703317 and 61105006; the Shenzhen Strategic Emerging Industry Development Special Fund under Grant No. JCYJ20170307172130906; the Aerospace Science and Technology Foundation under Grant No. 2018-HT-HZ; the Open Fund of Key Laboratory of Image Processing and Intelligent Control (Huazhong University of Science and Technology), Ministry of Education under Grant No. IPIC2019-01; the Fundamental Research Funds for the Central Universities under Grant Nos. WUT:2018IVB072, 2018IVA110.
Compliance with ethical standards
Conflict of interest
The author(s) declare that they have no conflict of interests.
- 1.Latif A, Rasheed A, Sajid U et al (2019) Content-based image retrieval and feature extraction: a comprehensive review. Math Probl Eng 2019, Article ID 9658350Google Scholar
- 3.Ratyal N, Taj IA, Sajid M et al (2019) Deeply learned pose invariant image analysis with applications in 3D face recognition. Math Probl Eng 2019, Article ID 3547416Google Scholar
- 4.Sajid M, Iqbal Ratyal N, Ali N et al (2019) The impact of asymmetric left and asymmetric right face images on accurate age estimation. Math Probl Eng 2019, Article ID 8041413Google Scholar
- 5.Sajid M, Ali N, Dar SH et al (2018) Data augmentation-assisted makeup-invariant face recognition. Math Probl Eng 2018, Article ID 2850632Google Scholar
- 7.Ayoub N, Gao Z, Chen D, Tobji R, Yao N (2018) Visual saliency detection based on color frequency features under Bayesian framework. Ksii Trans Internet Inf Syst 12:676Google Scholar
- 9.Teuber HL (1965) Physiological psychology. Mcgraw-Hill, New YorkGoogle Scholar
- 10.Wang X, Zhong Y, Xu Y, Zhang L, Xu Y (2017) Saliency-based endmember detection for hyperspectral imagery. In: 2017 IEEE international geoscience and remote sensing symposium (IGARSS), p 984Google Scholar
- 11.Yamazaki T, Hasebe N, Shimizu S (2017) Considerations about saliency map from Wide Angle Fovea image. In: 2017 IEEE 26th international symposium on industrial electronics (ISIE), p 1330Google Scholar
- 12.Zhang J, Li B, Dai Y, Porikli F, He M (2018) Integrated deep and shallow networks for salient object detection. In: IEEE international conference on image processing, p 1537Google Scholar
- 13.Fu Y, Cheng J, Li Z et al (2008) Saliency cuts: an automatic approach to object segmentation. In: International conference on pattern recognition. DBLP, pp 1–4Google Scholar
- 14.Liu T, Sun J, Zheng NN, Tang X, Shum HY (2007) Learning to detect a salient object. In: 2007 IEEE conference on computer vision and pattern recognition, p 1Google Scholar
- 16.Du B, Ma L, Zhuang Y, Chen H, Soomro NQ (2017) Moving target detection via hierarchical spatiotemporal saliency analysis. In: 2017 IEEE international geoscience and remote sensing symposium (IGARSS), p 1840Google Scholar
- 17.Jian M, Qi Q, Dong J, Sun X, Sun Y, Lam KM (2016) Saliency detection using quatemionic distance based weber descriptor and object cues. In: 2016 Asia-Pacific signal and information processing association annual summit and conference (APSIPA), p 1Google Scholar
- 18.Achanta R, Hemami S, Estrada F, Susstrunk S (2009) Frequency-tuned salient region detection. In: 2009 IEEE conference on computer vision and pattern recognition, p 1597Google Scholar
- 19.Li S, Lee MC (2007) Fast visual tracking using motion saliency in video. In: 2007 IEEE international conference on acoustics, speech and signal processing—ICASSP ‘07, p 1073Google Scholar
- 20.Kulshreshtha A, Deshpande AV, Meher SK (2013) Time-frequency-tuned salient region detection and segmentation. In: 2013 3rd IEEE international advance computing conference (IACC), p 1080Google Scholar
- 21.Xue K, Wang X, Ma G, Wang H, Nam D (2015) A video saliency detection method based on spatial and motion information. In: 2015 IEEE international conference on image processing (ICIP), p 412Google Scholar
- 22.Borji A, Cheng MM, Hou Q, Jiang H, Li J (2017) Salient object detection: a survey. Eprint Arxiv 16:3118Google Scholar
- 24.Papazoglou A, Ferrari V (2013) Fast object segmentation in unconstrained video. In: 2013 IEEE international conference on computer vision, p 1777Google Scholar
- 25.Yi-Quan WU, Wen-Yi WU, Zhe P (2009) A fast iterative algorithm of the Otsu threshold based on two-dimensional histogram oblique segmentation. J Eng Graph 30:89Google Scholar
- 26.Delaye A, Anquetil E (2010) Learning spatial relationships in hand-drawn patterns using fuzzy mathematical morphology. In: 2010 International conference of soft computing and pattern recognition, p 162Google Scholar
- 27.Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition, p 511Google Scholar
- 29.Wang M, Liu Y, Huang Z (2017) Large margin object tracking with circulant feature maps. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR)Google Scholar
- 30.Tsai YH, Yang MH, Black MJ (2016) Video segmentation via object flow. In: IEEE conference on computer vision and pattern recognition. IEEE, pp 3899–3908Google Scholar
- 31.Perazzi F, Pont-Tuset J, McWilliams B, Gool LV, Gross M, Sorkine-Hornung A (2016) A benchmark dataset and evaluation methodology for video object segmentation. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), p 724Google Scholar
- 34.Märki N, Perazzi F, Wang O, Sorkine-Hornung A (2016) Bilateral space video segmentation. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), p 743Google Scholar
- 35.Taylor B, Karasev V, Soattoc S (2015) Causal video object segmentation from persistence of occlusions. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), p 4268Google Scholar
- 36.Perazzi F, Wang O, Gross M, Sorkine-Hornung A (2015) Fully connected object proposals for video segmentation. In: 2015 IEEE international conference on computer vision (ICCV), p 3227Google Scholar
- 37.Valipour S, Siam M, Jagersand M, Ray N (2017) Recurrent fully convolutional networks for video segmentation. In: Applications of computer vision, p 29Google Scholar