Video object segmentation based on temporal frame context information fusion and feature enhancement

Hou, Zhiqiang; Li, Fucheng; Wang, Shuiyuan; Dai, Nan; Ma, Sugang; Fan, Jiulun

doi:10.1007/s10489-022-03693-z

Video object segmentation based on temporal frame context information fusion and feature enhancement

Published: 09 July 2022

Volume 53, pages 6496–6510, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Zhiqiang Hou^1,2,
Fucheng Li^1,2,
Shuiyuan Wang^1,2,
Nan Dai^1,2,
Sugang Ma^1,2 &
…
Jiulun Fan³

438 Accesses
2 Citations
Explore all metrics

Abstract

At present, a large number of video object segmentation algorithms only use a small amount of frame information to guide the segmentation of the current frame, but fail to fully exploit the information of the historical frames, which makes the network model difficult for the network model to adapt to complex environmental changes, causing the phenomenon of object drift; at the same time, the mask refinement method is also rough, resulting in blurred edges of the generated mask. To solve this problem, this paper proposes a video object segmentation algorithms based on temporal frame context information fusion and feature enhancement. First, in order to make full use of historical frame information, this paper proposes a temporal frame residual fusion module to adaptively fuse historical frame information. Second, a spatial cascade mask refinement module is established to enhance the spatial information of the shallow features of the backbone network and refine the edge information of the fusion features. The experimental results show that our algorithm achieves the performance (J&F) of 87.4% and 76.6% on DAVIS2016 and DAVIS2017 respectively and the segmentation speed (FPS) also meets the real-time requirements, reaching 26FPS on DAVIS2016 validation set. Contrast to many mainstream algorithms in recent years, it has obvious advantages in performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

ByteTrack: Multi-object Tracking by Associating Every Detection Box

Data Availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Caelles S, Maninis K-K, Pont-Tuset J, Leal-Taixé L, Cremers D, Gool LV (2017) One-shot video object segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 221–230
Voigtlaender P, Leibe B (2017) Online adaptation of convolutional neural networks for the 2017 davis challenge on video object segmentation. In: The 2017 DAVIS challenge on video object segmentation-CVPR workshops, vol 5
Perazzi F, Khoreva A, Benenson R, Schiele B, Sorkine-Hornung A (2017) Learning video object segmentation from static images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2663–2672
Chen Y, Pont-Tuset J, Montes A, Gool LV (2018) Blazingly fast video object segmentation with pixel-wise metric learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1189–1198
Hu Y-T, Huang J-B, Schwing AG (2018) Videomatch: Matching based video object segmentation. In: Proceedings of the European conference on computer vision (ECCV), pp 54–70
Cheng J, Tsai Y-H, Hung W-C, Wang S, Yang M-H (2018) Fast and accurate online video object segmentation via tracking parts. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7415–7424
Li X, Loy CC (2018) Video object segmentation with joint re-identification and attention-aware mask propagation. In: Proceedings of the European conference on computer vision (ECCV), pp 90–105
Wang Q, Zhang L, Bertinetto L, Hu W, Torr PHS (2019) Fast online object tracking and segmentation: A unifying approach. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1328–1338
Chopra S, Hadsell R, LeCun Y (2005) Learning a similarity metric discriminatively, with application to face verification. In: 2005 IEEE Computer society conference on computer vision and pattern recognition (CVPR’05), vol 1. IEEE, pp 539–546
Zeng X, Liao R, Li G u, Xiong Y, Fidler S, Urtasun R (2019) Dmm-net: Differentiable mask-matching network for video object segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3929–3938
Huang W, Gu J, Ma X, Li Y (2020) End-to-end multitask siamese network with residual hierarchical attention for real-time object tracking. Appl Intell 50(6):1908–1921
Article Google Scholar
Yang L, Wang Y, Xiong X, Yang J, Katsaggelos AK (2018) Efficient video object segmentation via network modulation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6499–6507
Oh SW, Lee J-Y, Sunkavalli K, Kim SJ (2018) Fast video object segmentation by reference-guided mask propagation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7376–7385
Wang Z, Xu J, Li L, Zhu F, Shao L (2019) Ranet: Ranking attention network for fast video object segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3978–3987
Johnander J, Danelljan M, Brissman E, Khan FS, Felsberg M (2019) A generative appearance model for end-to-end video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8953–8962
Wang H, Liu W, Xing W (2022) A temporal attention based appearance model for video object segmentation. Appl Intell 52(2):2290–2300
Article Google Scholar
Yin Y, De X u, Wang X, Zhang L (2021) Directional deep embedding and appearance learning for fast video object segmentation. IEEE Transactions on Neural Networks and Learning Systems
Voigtlaender P, Chai Y, Schroff F, Adam H, Leibe B, Chen L-C (2019) Feelvos: Fast end-to-end embedding learning for video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9481– 9490
Fu L, Yu Z, Sun X, Huang J, Wang D, Yu D (2021) Video object segmentation based on motion-aware roi prediction and adaptive reference updating. Expert Syst Appl 167:114153
Article Google Scholar
Oh SW, Lee J-Y, Xu N, Kim SJ (2019) Video object segmentation using space-time memory networks. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9226–9235
Li Y, Shen Z, Shan Y (2020) Fast video object segmentation using the global context module. In: European conference on computer vision. Springer, pp 735–750
Seong H, Hyun J, Kim E (2020) Kernelized memory network for video object segmentation. In: European conference on computer vision. Springer, pp 629–645
Singh KK, Lee JY (2017) Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization. In: 2017 IEEE International conference on computer vision (ICCV). IEEE, pp 3544–3553
Lu X, Wang W, Shen J, Crandall D, Luo J (2020) Zero-shot video object segmentation with co-attention siamese networks. IEEE Transactions on Pattern Analysis and Machine Intelligence
Lu X, Wang W, Danelljan M, Zhou T, Shen J, Gool LV (2020) Video object segmentation with episodic graph memory networks. In: European conference on computer vision. Springer, pp 661–679
Lu X, Wangm W, Shen J, Crandall D, Van Gool L (2021) Segmenting objects from relational visual data. IEEE Transactions on Pattern Analysis and Machine Intelligence
Zhang Y, Wu Z, Peng H, Lin S (2020) A transductive approach for video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6949–6958
Zhang L, Gonzalez-Garcia A, Van De Weijer J, Danelljan M, Khan FS (2019) Learning the model update for siamese trackers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4010–4019
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115(3):211–252
Article MathSciNet Google Scholar
Robinson A, Lawin FJ, Danelljan M, Khan FS, Felsberg M (2020) Learning fast and robust target models for video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7406–7415
Khoreva A, Benenson R, Ilg E, Brox T, Schiele B (2017) Lucid data dreaming for object tracking. In: The DAVIS challenge on video object segmentation
Bao L, Wu B, Liu W (2018) Cnn in mrf Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5977–5986
Xu K, Wen L, Li G, Bo L, Huang Q (2019) Spatiotemporal cnn for video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1379–1388
Xi C, Li Z, Ye Y, Yu G, Shen J, Qi D (2020) State-aware tracker for real-time video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9384–9393
Park H, Yoo J, Jeong S, Venkatesh G, Kwak N (2021) Learning dynamic network using a reuse gate function in semi-supervised video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8405–8414
Yang S, Lu Z, Qi J, Lu H, Wang S, Zhang X (2021) Learning motion-appearance co-attention for zero-shot video object segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 1564–1573
Ji G-P, Fu K, Wu Z, Fan D-P, Shen J, Shao L (2021) Full-duplex strategy for video object segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4922–4933
Sun M, Xiao J, Lim EG, Xie Y, Feng J (2020) Adaptive roi generation for video object segmentation using reinforcement learning. Pattern Recogn 106:107465
Article Google Scholar
Maninis K-K, Caelles S, Chen Y, Pont-Tuset J, Leal-Taixé L, Cremers D, Gool LV (2018) Video object segmentation without temporal information. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(6):1515–1530
Article Google Scholar
Lin H, Qi X, Jia J (2019) Agss-vos: Attention guided single-shot video object segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3949– 3957
Voigtlaender P, Luiten J, Torr PHS, Leibe B (2020) R-cnn: Siam Visual tracking by re-detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6578–6588
Huang X, Xu J, Tai Y-W, Tang C-K (2020) Fast video object segmentation with temporal aggregation network and dynamic template matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 8879– 8889
Ge W, Lu X, Shen J (2021) Video object segmentation using global and instance embedding learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16836–16845
Duarte K, Rawat YS, Shah M (2019) Capsulevos: Semi-supervised video object segmentation using capsule routing. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8480–8489
Ventura C, Bellver M, Girbau A, Salvador A, Marques F, Nieto XG-I (2019) Rvos: End-to-end recurrent network for video object segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5277–5286

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China under grant no. 62072370.

Author information

Authors and Affiliations

School of Computer Science & Technology, Xi’an University of Posts & Telecommunications, Xi’an, 710121, China
Zhiqiang Hou, Fucheng Li, Shuiyuan Wang, Nan Dai & Sugang Ma
Key Laboratory of Network Data Analysis and Intelligent Processing of Shaanxi Province, Xi’an, 710121, China
Zhiqiang Hou, Fucheng Li, Shuiyuan Wang, Nan Dai & Sugang Ma
School of Communications and Information Engineering, Xi’an University of Posts & Telecommunications, Xi’an, 710121, China
Jiulun Fan

Authors

Zhiqiang Hou
View author publications
You can also search for this author in PubMed Google Scholar
Fucheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Shuiyuan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Nan Dai
View author publications
You can also search for this author in PubMed Google Scholar
Sugang Ma
View author publications
You can also search for this author in PubMed Google Scholar
Jiulun Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fucheng Li.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hou, Z., Li, F., Wang, S. et al. Video object segmentation based on temporal frame context information fusion and feature enhancement. Appl Intell 53, 6496–6510 (2023). https://doi.org/10.1007/s10489-022-03693-z

Download citation

Accepted: 29 April 2022
Published: 09 July 2022
Issue Date: March 2023
DOI: https://doi.org/10.1007/s10489-022-03693-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video object segmentation based on temporal frame context information fusion and feature enhancement

Abstract

Access this article

Similar content being viewed by others

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

ByteTrack: Multi-object Tracking by Associating Every Detection Box

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Video object segmentation based on temporal frame context information fusion and feature enhancement

Abstract

Access this article

Similar content being viewed by others

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

ByteTrack: Multi-object Tracking by Associating Every Detection Box

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation