Abstract
In previous video matting methods, there are some problems that require additional auxiliary information and lack of temporal consistency. To solve these problems, we propose a novel video matting framework (STMI-Net) based on temporal-spatial information mining and aggregation. This framework doesn’t require any auxiliary information and adopts a double decoder network structure, specifically, one decoder is composed of the recurrent network, which can make full use of the temporal information in the video frames to ensure the temporal coherence in results; and the other decoder is composed of the convolution network, which deeply restores the frame-by-frame spatial features to achieve the spatial continuity in results. By aggregating these two parts of the information at the global level, our model achieves 0.0066 MSE on the VideoMatte240K dataset, which surpasses the RVM baseline by 13%; and achieves 0.0047 MSE on PPM-100 portrait matting dataset, which surpasses the MG baseline by 26.5%. We also implement an ablation study to demonstrate the specific functions of the temporal decoder and the spatial decoder in our model.
Similar content being viewed by others
Data availability
The data that support the findings of this study are available from relevant references (we have cited the relevant references in the introduction of the dataset and have obtained the permission of the relevant authors) but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of the authors of relevant references.
References
Wang J, Cohen MF et al (2008) Image and video matting: a survey. Foundations and Trends® in Computer Graphics and Vision 3(2):97–175
Mahmoud M, Baltruˇsaitis T, Robinson P, Riek L (2011) 3d corpus of spontaneous complex mental states. In: Conference on affective computing and intelligent interaction. ACII 2011. Lecture notes in computer science, vol 6974
Ke Z, Li K, Zhou Y et al (2020) Is a green screen really necessary for real-time portrait matting? Conference on computer vision and pattern recognition (CVPR). IEEE ArXiv: abs/2011.11961
Lin S, Yang L, Saleemi I et al (2022) Robust high-resolution video matting with temporal guidance. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. arXiv:2108.11515
Seong H, Seoung O, Brian P, Euntai K, Lee J (2022) One-trimap video matting. ECCV. https://doi.org/10.1007/978-3-031-19818-2_25
Chen LC, Papandreou G, Schroff F et al (2017) Rethinking atrous convolution for semantic image segmentation. IEEE Conference on Computer Vision & pattern recognition. arXiv preprint arXiv:1706.05587
Howard A, Sandler M, Chu G et al (2019) Searching for MobileNetV3. IEEE/CVF international conference on computer vision (ICCV). https://doi.org/10.48550/arXiv.1905.02244
Liu Y, Li Q, Yuan Y, Du Q, Wang Q (2021) ABNet: adaptive balanced network for multi-scale object detection in remote sensing imagery. IEEE Trans Geosci Remote Sens 60:1–14
Wang Q, Liu Y, Xiong Z, Yuan Y (2022) Hybrid feature aligned network for salient object detection in optical remote sensing imagery. IEEE Trans Geosci Remote Sens 60:1–15
Lu X, Wang W, Ma C, Shen J, Shao L, Porikli F (2019) See more, know more: unsupervised video object segmentation with co-attention Siamese networks. CVPR arXiv:2001.06810
Ge W, Lu X, Shen J (2021) Video object segmentation using global and instance embedding learning[C]. Computer vision and pattern recognition. IEEE, pp 16831–16840. https://doi.org/10.1109/CVPR46437.2021.01656
Lu X, Wang W, Shen J et al (2021) Segmenting objects from relational visual data. IEEE Trans Pattern Anal Mach Intell 44:7885–7897
Wang W, Lu X, Shen J et al (2020) Zero-shot video object segmentation via attentive graph neural networks[C]//International conference on computer vision. IEEE, pp 9235–9244. https://doi.org/10.1109/ICCV.2019.00933
Lu X, Wang W, Shen J et al (2020) Zero-shot video object segmentation with co-attention Siamese networks[C]. IEEE Trans Pattern Anal Mach Intell 44(4):2228–2242. https://doi.org/10.1109/TPAMI.2020.3040258
Wang J, Cohen M (2007) Optimized color sampling for robust matting. IEEE conference on computer vision & pattern recognition. IEEE Computer Society, pp 1–8. https://doi.org/10.1109/CVPR.2007.383006
Gastal E, Oliveira M (2010) Shared sampling for real-time alpha matting. Comput Graph Forum, vol 29, no 2. Proceedings of Eurographics, pp 575–584
He K, Rhemann C, Rother C et al (2011) A global sampling method for alpha matting. IEEE Conference on Computer Vision & Pattern Recognition, pp 2049–2056. https://doi.org/10.1109/CVPR.2011.5995495
Sun J, Jia J, Tang C et al (2004) Poisson matting. ACM Trans Graph 23(3):315–321. https://doi.org/10.1145/1015706.1015721
Levin A (2006) A closed form solution to natural image matting. IEEE Computer Society, pp 61–68. https://doi.org/10.1109/CVPR.2006.18
Chen Q, Li D, Tang C (2013) KNN matting. IEEE Trans Pattern Anal Mach Intell 35(9):2175–2188. https://doi.org/10.1109/TPAMI.2013.18
Xu N, Price B, Cohen S, Huang T (2017) Deep image matting. IEEE Conf Comput Vis Pattern Recognit arXiv:1703.03872
Lutz S, Amplianitis K, Smolic A (2018) ΑlphaGAN: generative adversarial networks for natural image matting. British Machine Vision Conference arXiv:1807.10088
Chen Q, Ge T, Xu Y, Zhang Z, Yang X, Gai K (2018) Semantic human matting. Multi-media arXiv:1809.01354
Sengupta S, Jayaram V, Curless B et al (2020) Background matting: the world is your green screen. Comput Vis Pattern Recogn (CVPR), pp 2288–2297. https://doi.org/10.1109/CVPR42600.2020.00236
Lin S, Ryabtsev A, Sengupta S et al (2020) Real-time high-resolution background matting. IEEE Conference on Computer Vision & Pattern Recognition, pp 8758–8767. https://doi.org/10.1109/CVPR46437.2021.00865
Sun Y, Wang G, Gu Q et al (2021) Deep video matting via spatio-temporal alignment and aggregation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6971–6980. https://doi.org/10.1109/CVPR46437.2021.00690
Shi X, Chen Z, Wang H et al (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. MIT Press, pp 802–810
Dai J, Qi H, Xiong Y et al (2017) Deformable convolutional networks. Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 764–773. https://doi.org/10.1109/ICCV.2017.89
Chao P, Zhang X, Gang Y et al (2017) Large Kernel matters — improve semantic segmentation by global convolutional network. IEEE conference on computer vision and pattern recognition (CVPR), pp 1743–1751. https://doi.org/10.1109/CVPR.2017.189
Erofeev M, Gitman Y, Vatolin D, Fedorov A, Wang J (2015) Perceptually motivated benchmark for video matting. In: BMVC. https://doi.org/10.5244/C.29.99
Wang T et al (2021) Video matting via consistency-regularized graph neural networks. IEEE/CVF International Conference on Computer Vision (ICCV), pp 4882–4891. https://doi.org/10.1109/ICCV48922.2021.00486
Yu Q, Zhang J, Zhang H et al (2020) Mask guided matting via progressive refinement network. Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 1154–1163. https://doi.org/10.1109/CVPR46437.2021.00121
Funding
This work is supported by the Youth Innovation Talent Support Program of Harbin University of Commerce (No. 2020CX39).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Consent for participate
We guarantee that all the authors have been involved with this work.
Consent for publication
We guarantee that all the authors approved the manuscript and agreed to its submission.
Conflict of interest
We (The authors) declare that we have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ma, Z., Yao, G. Temporal-spatial information mining and aggregation for video matting. Multimed Tools Appl 83, 29221–29237 (2024). https://doi.org/10.1007/s11042-023-16747-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16747-2