Abstract
Object tracking is a well-studied problem in computer vision while identifying salient spots of objects in a video is a less explored direction in the literature. Video eye gaze estimation methods aim to tackle a related task but salient spots in those methods are not bounded by objects and tend to produce very scattered, unstable predictions due to the noisy ground truth data. We reformulate the problem of detecting and tracking of salient object spots as a new task called object hotspot tracking. In this paper, we propose to tackle this task jointly with unsupervised video object segmentation, in real-time, with a unified framework to exploit the synergy between the two. Specifically, we propose a Weighted Correlation Siamese Network (WCS-Net) which employs a Weighted Correlation Block (WCB) for encoding the pixel-wise correspondence between a template frame and the search frame. In addition, WCB takes the initial mask/hotspot as guidance to enhance the influence of salient regions for robust tracking. Our system can operate online during inference and jointly produce the object mask and hotspot track-lets at 33 FPS. Experimental results validate the effectiveness of our network design, and show the benefits of jointly solving the hotspot tracking and object segmentation problems. In particular, our method performs favorably against state-of-the-art video eye gaze models in object hotspot tracking, and outperforms existing methods on three benchmark datasets for unsupervised video object segmentation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Project Page: https://github.com/luzhangada/code-for-WCS-Net.
References
Bylinskii, Z., Judd, T., Oliva, A., Torralba, A., Durand, F.: What do different evaluation metrics tell us about saliency models? arXiv preprint arXiv:1604.03605 (2016)
Caelles, S., Maninis, K.K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., Van Gool, L.: One-shot video object segmentation. In: CVPR (2017)
Chen, Y., Pont-Tuset, J., Montes, A., Van Gool, L.: Blazingly fast video object segmentation with pixel-wise metric learning. In: CVPR (2018)
Cheng, J., Tsai, Y.H., Hung, W.C., Wang, S., Yang, M.H.: Fast and accurate online video object segmentation via tracking parts. In: CVPR (2018)
Cheng, J., Tsai, Y.H., Wang, S., Yang, M.H.: SegFlow: joint learning for video object segmentation and optical flow. In: CVPR (2017)
Deng, Z., et al.: R\(^{3}\)Net: recurrent residual refinement network for saliency detection. In: IJCAI (2018)
Ding, H., Cohen, S., Price, B., Jiang, X.: PhraseClick: toward achieving flexible interactive segmentation by phrase and click. In: ECCV (2020)
Ding, H., Jiang, X., Liu, A.Q., Thalmann, N.M., Wang, G.: Boundary-aware feature propagation for scene segmentation. In: ICCV (2019)
Ding, H., Jiang, X., Shuai, B., Liu, A.Q., Wang, G.: Context contrasted feature and gated multi-scale aggregation for scene segmentation. In: CVPR (2018)
Ding, H., Jiang, X., Shuai, B., Liu, A.Q., Wang, G.: Semantic correlation promoted shape-variant context for segmentation. In: CVPR (2019)
Faktor, A., Irani, M.: Video segmentation by non-local consensus voting. In: BMVC (2014)
Ferrari, V., Schmid, C., Civera, J., Leistner, C., Prest, A.: Learning object class detectors from weakly annotated video. In: CVPR (2012)
Gegenfurtner, K.R.: The interaction between vision and eye movements. Perception 45(12), 1333–1357 (2016)
Hariharan, B., Arbeláez, P., Bourdev, L., Maji, S., Malik, J.: Semantic contours from inverse detectors. In: CVPR (2011)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Hu, P., Caba, F., Wang, O., Lin, Z., Sclaroff, S., Perazzi, F.: Temporally distributed networks for fast video semantic segmentation. In: CVPR (2020)
Hu, Y.T., Huang, J.B., Schwing, A.: MaskRNN: instance level video object segmentation. In: Advances in Neural Information Processing Systems (2017)
Hu, Y.T., Huang, J.B., Schwing, A.G.: Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation. In: ECCV (2018)
Huang, X., Shen, C., Boix, X., Zhao, Q.: SALICON: reducing the semantic gap in saliency prediction by adapting deep neural networks. In: ICCV (2015)
Jain, S.D., Xiong, B., Grauman, K.: FusionSeg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In: CVPR (2017)
Jang, W.D., Lee, C., Kim, C.S.: Primary object segmentation in videos via alternate convex optimization of foreground and background distributions. In: CVPR (2016)
Jiang, L., Xu, M., Liu, T., Qiao, M., Wang, Z.: DeepVS: a deep learning based video saliency prediction approach. In: ECCV (2018)
Keuper, M., Andres, B., Brox, T.: Motion trajectory segmentation via minimum cost multicuts. In: CVPR (2015)
Koh, Y.J., Kim, C.S.: Primary object segmentation in videos based on region augmentation and reduction. In: CVPR (2017)
Krähenbühl, P., Koltun, V.: Efficient inference in fully connected CRFs with Gaussian edge potentials. In: Advances in Neural Information Processing Systems, pp. 109–117 (2011)
Lee, Y.J., Kim, J., Grauman, K.: Key-segments for video object segmentation. In: ICCV (2011)
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: SiamRPN++: evolution of Siamese visual tracking with very deep networks. In: CVPR (2019)
Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking with Siamese region proposal network. In: CVPR (2018)
Li, F., Kim, T., Humayun, A., Tsai, D., Rehg, J.M.: Video segmentation by tracking many figure-ground segments. In: ICCV (2013)
Li, S., Seybold, B., Vorobyov, A., Fathi, A., Huang, Q., Jay Kuo, C.C.: Instance embedding transfer to unsupervised video object segmentation. In: CVPR (2018)
Li, S., Seybold, B., Vorobyov, A., Lei, X., Jay Kuo, C.C.: Unsupervised video object segmentation with motion-based bilateral networks. In: ECCV (2018)
Lu, X., Wang, W., Ma, C., Shen, J., Shao, L., Porikli, F.: See more, know more: unsupervised video object segmentation with co-attention Siamese networks. In: CVPR (2019)
Papazoglou, A., Ferrari, V.: Fast object segmentation in unconstrained video. In: ICCV (2013)
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR (2016)
Pont-Tuset, J., Perazzi, F., Caelles, S., Arbeláez, P., Sorkine-Hornung, A., Van Gool, L.: The 2017 DAVIS challenge on video object segmentation. arXiv:1704.00675 (2017)
Rommelse, N.N., Van der Stigchel, S., Sergeant, J.A.: A review on eye movement studies in childhood and adolescent psychiatry. Brain Cogn. 68(3), 391–414 (2008)
Shin Yoon, J., Rameau, F., Kim, J., Lee, S., Shin, S., So Kweon, I.: Pixel-level matching for video object segmentation using convolutional neural networks. In: ICCV (2017)
Siam, M., et al.: Video object segmentation using teacher-student adaptation in a human robot interaction (HRI) setting. In: 2019 International Conference on Robotics and Automation (2019)
Song, H., Wang, W., Zhao, S., Shen, J., Lam, K.M.: Pyramid dilated deeper ConvLSTM for video salient object detection. In: ECCV (2018)
Tan, M., Le, Q.V.: EfficientNet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946 (2019)
Tokmakov, P., Alahari, K., Schmid, C.: Learning motion patterns in videos. In: CVPR (2017)
Tokmakov, P., Alahari, K., Schmid, C.: Learning video object segmentation with visual memory. In: CVPR (2017)
Wang, L., Lu, H., Wang, Y., Feng, M., Ruan, X.: Learning to detect salient objects with image-level supervision. In: CVPR (2017)
Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.: Fast online object tracking and segmentation: a unifying approach. In: CVPR (2019)
Wang, W., Lu, X., Shen, J., Crandall, D.J., Shao, L.: Zero-shot video object segmentation via attentive graph neural networks. In: CVPR (2019)
Wang, W., Shen, J., Guo, F., Cheng, M.M., Borji, A.: Revisiting video saliency: a large-scale benchmark and a new model. In: CVPR (2018)
Wang, W., Shen, J., Porikli, F.: Saliency-aware geodesic video object segmentation. In: CVPR (2015)
Wang, W., et al.: Learning unsupervised video object segmentation through visual attention. In: CVPR (2019)
Wei, Z., et al.: Sequence-to-segments networks for detecting segments in videos. IEEE Trans. Pattern Anal. Mach. Intell. (2019)
Wei, Z., et al.: Sequence-to-segment networks for segment detection. In: Advances in Neural Information Processing Systems, pp. 3507–3516 (2018)
Wug Oh, S., Lee, J.Y., Sunkavalli, K., Joo Kim, S.: Fast video object segmentation by reference-guided mask propagation. In: CVPR (2018)
Yang, L., Wang, Y., Xiong, X., Yang, J., Katsaggelos, A.K.: Efficient video object segmentation via network modulation. In: CVPR (2018)
Yang, Z., Wang, Q., Bertinetto, L., Hu, W., Bai, S., Torr, P.H.S.: Anchor diffusion for unsupervised video object segmentation. In: ICCV (2019)
Yang, Z., et al.: Predicting goal-directed human attention using inverse reinforcement learning. In: CVPR (2020)
Zhang, L., Dai, J., Lu, H., He, Y.: A bi-directional message passing model for salient object detection. In: CVPR (2018)
Zhang, L., Lin, Z., Zhang, J., Lu, H., He, Y.: Fast video object segmentation via dynamic targeting network. In: ICCV (2019)
Zhang, L., Zhang, J., Lin, Z., Lu, H., He, Y.: CapSal: leveraging captioning to boost semantics for salient object detection. In: CVPR (2019)
Acknowledgements
The paper is supported in part by the National Key R&D Program of China under Grant No. 2018AAA0102001 and National Natural Science Foundation of China under grant No. 61725202, U1903215, 61829102, 91538201, 61771088, 61751212 and the Fundamental Research Funds for the Central Universities under Grant No. DUT19GJ201 and Dalian Innovation leader’s support Plan under Grant No. 2018RD07.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, L., Zhang, J., Lin, Z., Měch, R., Lu, H., He, Y. (2020). Unsupervised Video Object Segmentation with Joint Hotspot Tracking. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12359. Springer, Cham. https://doi.org/10.1007/978-3-030-58568-6_29
Download citation
DOI: https://doi.org/10.1007/978-3-030-58568-6_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58567-9
Online ISBN: 978-3-030-58568-6
eBook Packages: Computer ScienceComputer Science (R0)