Skip to main content
Log in

Temporal-spatial information mining and aggregation for video matting

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In previous video matting methods, there are some problems that require additional auxiliary information and lack of temporal consistency. To solve these problems, we propose a novel video matting framework (STMI-Net) based on temporal-spatial information mining and aggregation. This framework doesn’t require any auxiliary information and adopts a double decoder network structure, specifically, one decoder is composed of the recurrent network, which can make full use of the temporal information in the video frames to ensure the temporal coherence in results; and the other decoder is composed of the convolution network, which deeply restores the frame-by-frame spatial features to achieve the spatial continuity in results. By aggregating these two parts of the information at the global level, our model achieves 0.0066 MSE on the VideoMatte240K dataset, which surpasses the RVM baseline by 13%; and achieves 0.0047 MSE on PPM-100 portrait matting dataset, which surpasses the MG baseline by 26.5%. We also implement an ablation study to demonstrate the specific functions of the temporal decoder and the spatial decoder in our model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

The data that support the findings of this study are available from relevant references (we have cited the relevant references in the introduction of the dataset and have obtained the permission of the relevant authors) but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of the authors of relevant references.

References

  1. Wang J, Cohen MF et al (2008) Image and video matting: a survey. Foundations and Trends® in Computer Graphics and Vision 3(2):97–175

    Article  CAS  Google Scholar 

  2. Mahmoud M, Baltruˇsaitis T, Robinson P, Riek L (2011) 3d corpus of spontaneous complex mental states. In: Conference on affective computing and intelligent interaction. ACII 2011. Lecture notes in computer science, vol 6974

  3. Ke Z, Li K, Zhou Y et al (2020) Is a green screen really necessary for real-time portrait matting? Conference on computer vision and pattern recognition (CVPR). IEEE ArXiv: abs/2011.11961

  4. Lin S, Yang L, Saleemi I et al (2022) Robust high-resolution video matting with temporal guidance. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. arXiv:2108.11515

  5. Seong H, Seoung O, Brian P, Euntai K, Lee J (2022) One-trimap video matting. ECCV. https://doi.org/10.1007/978-3-031-19818-2_25

  6. Chen LC, Papandreou G, Schroff F et al (2017) Rethinking atrous convolution for semantic image segmentation. IEEE Conference on Computer Vision & pattern recognition. arXiv preprint arXiv:1706.05587

  7. Howard A, Sandler M, Chu G et al (2019) Searching for MobileNetV3. IEEE/CVF international conference on computer vision (ICCV). https://doi.org/10.48550/arXiv.1905.02244

  8. Liu Y, Li Q, Yuan Y, Du Q, Wang Q (2021) ABNet: adaptive balanced network for multi-scale object detection in remote sensing imagery. IEEE Trans Geosci Remote Sens 60:1–14

    CAS  Google Scholar 

  9. Wang Q, Liu Y, Xiong Z, Yuan Y (2022) Hybrid feature aligned network for salient object detection in optical remote sensing imagery. IEEE Trans Geosci Remote Sens 60:1–15

    Google Scholar 

  10. Lu X, Wang W, Ma C, Shen J, Shao L, Porikli F (2019) See more, know more: unsupervised video object segmentation with co-attention Siamese networks. CVPR arXiv:2001.06810

  11. Ge W, Lu X, Shen J (2021) Video object segmentation using global and instance embedding learning[C]. Computer vision and pattern recognition. IEEE, pp 16831–16840. https://doi.org/10.1109/CVPR46437.2021.01656

  12. Lu X, Wang W, Shen J et al (2021) Segmenting objects from relational visual data. IEEE Trans Pattern Anal Mach Intell 44:7885–7897

    Article  Google Scholar 

  13. Wang W, Lu X, Shen J et al (2020) Zero-shot video object segmentation via attentive graph neural networks[C]//International conference on computer vision. IEEE, pp 9235–9244. https://doi.org/10.1109/ICCV.2019.00933

  14. Lu X, Wang W, Shen J et al (2020) Zero-shot video object segmentation with co-attention Siamese networks[C]. IEEE Trans Pattern Anal Mach Intell 44(4):2228–2242. https://doi.org/10.1109/TPAMI.2020.3040258

    Article  Google Scholar 

  15. Wang J, Cohen M (2007) Optimized color sampling for robust matting. IEEE conference on computer vision & pattern recognition. IEEE Computer Society, pp 1–8. https://doi.org/10.1109/CVPR.2007.383006

  16. Gastal E, Oliveira M (2010) Shared sampling for real-time alpha matting. Comput Graph Forum, vol 29, no 2. Proceedings of Eurographics, pp 575–584

  17. He K, Rhemann C, Rother C et al (2011) A global sampling method for alpha matting. IEEE Conference on Computer Vision & Pattern Recognition, pp 2049–2056. https://doi.org/10.1109/CVPR.2011.5995495

  18. Sun J, Jia J, Tang C et al (2004) Poisson matting. ACM Trans Graph 23(3):315–321. https://doi.org/10.1145/1015706.1015721

    Article  Google Scholar 

  19. Levin A (2006) A closed form solution to natural image matting. IEEE Computer Society, pp 61–68. https://doi.org/10.1109/CVPR.2006.18

  20. Chen Q, Li D, Tang C (2013) KNN matting. IEEE Trans Pattern Anal Mach Intell 35(9):2175–2188. https://doi.org/10.1109/TPAMI.2013.18

    Article  PubMed  Google Scholar 

  21. Xu N, Price B, Cohen S, Huang T (2017) Deep image matting. IEEE Conf Comput Vis Pattern Recognit arXiv:1703.03872

  22. Lutz S, Amplianitis K, Smolic A (2018) ΑlphaGAN: generative adversarial networks for natural image matting. British Machine Vision Conference arXiv:1807.10088

  23. Chen Q, Ge T, Xu Y, Zhang Z, Yang X, Gai K (2018) Semantic human matting. Multi-media arXiv:1809.01354

  24. Sengupta S, Jayaram V, Curless B et al (2020) Background matting: the world is your green screen. Comput Vis Pattern Recogn (CVPR), pp 2288–2297. https://doi.org/10.1109/CVPR42600.2020.00236

  25. Lin S, Ryabtsev A, Sengupta S et al (2020) Real-time high-resolution background matting. IEEE Conference on Computer Vision & Pattern Recognition, pp 8758–8767. https://doi.org/10.1109/CVPR46437.2021.00865

  26. Sun Y, Wang G, Gu Q et al (2021) Deep video matting via spatio-temporal alignment and aggregation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6971–6980. https://doi.org/10.1109/CVPR46437.2021.00690

  27. Shi X, Chen Z, Wang H et al (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. MIT Press, pp 802–810

  28. Dai J, Qi H, Xiong Y et al (2017) Deformable convolutional networks. Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 764–773. https://doi.org/10.1109/ICCV.2017.89

  29. Chao P, Zhang X, Gang Y et al (2017) Large Kernel matters — improve semantic segmentation by global convolutional network. IEEE conference on computer vision and pattern recognition (CVPR), pp 1743–1751. https://doi.org/10.1109/CVPR.2017.189

  30. Erofeev M, Gitman Y, Vatolin D, Fedorov A, Wang J (2015) Perceptually motivated benchmark for video matting. In: BMVC. https://doi.org/10.5244/C.29.99

  31. Wang T et al (2021) Video matting via consistency-regularized graph neural networks. IEEE/CVF International Conference on Computer Vision (ICCV), pp 4882–4891. https://doi.org/10.1109/ICCV48922.2021.00486

  32. Yu Q, Zhang J, Zhang H et al (2020) Mask guided matting via progressive refinement network. Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, pp 1154–1163. https://doi.org/10.1109/CVPR46437.2021.00121

Download references

Funding

This work is supported by the Youth Innovation Talent Support Program of Harbin University of Commerce (No. 2020CX39).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhiwei Ma.

Ethics declarations

Consent for participate

We guarantee that all the authors have been involved with this work.

Consent for publication

We guarantee that all the authors approved the manuscript and agreed to its submission.

Conflict of interest

We (The authors) declare that we have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, Z., Yao, G. Temporal-spatial information mining and aggregation for video matting. Multimed Tools Appl 83, 29221–29237 (2024). https://doi.org/10.1007/s11042-023-16747-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16747-2

Keywords

Navigation