Instance as Identity: A Generic Online Paradigm for Video Instance Segmentation

Zhu, Feng; Yang, Zongxin; Yu, Xin; Yang, Yi; Wei, Yunchao

doi:10.1007/978-3-031-19818-2_30

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13689))

Included in the following conference series:

European Conference on Computer Vision

1843 Accesses
3 Citations

Abstract

Modeling temporal information for both detection and tracking in a unified framework has been proved a promising solution to video instance segmentation (VIS). However, how to effectively incorporate the temporal information into an online model remains an open problem. In this work, we propose a new online VIS paradigm named Instance As Identity (IAI), which models temporal information for both detection and tracking in an efficient way. In detail, IAI employs a novel identification module to predict identification number for tracking instances explicitly. For passing temporal information cross frame, IAI utilizes an association module which combines current features and past embeddings. Notably, IAI can be integrated with different image models. We conduct extensive experiments on three VIS benchmarks. IAI outperforms all the online competitors on YouTube-VIS-2019 (ResNet-101 41.9 mAP) and YouTube-VIS-2021 (ResNet-50 37.7 mAP). Surprisingly, on the more challenging OVIS, IAI achieves SOTA performance (20.3 mAP). Code is available at https://github.com/zfonemore/IAI.

F. Zhu—Work done during an internship at Baidu.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Athar, A., Mahadevan, S., Os̆ep, A., Leal-Taixé, L., Leibe, B.: STEm-Seg: Spatio-temporal embeddings for instance segmentation in videos. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 158–177. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8_10
Chapter Google Scholar
Bertasius, G., Torresani, L.: Classifying, segmenting, and tracking object instances in video with mask propagation. In: CVPR (2020)
Google Scholar
Bertasius, G., Torresani, L., Shi, J.: Object detection in video with spatiotemporal sampling networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 342–357. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01258-8_21
Chapter Google Scholar
Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: YOLACT: real-time instance segmentation. In: ICCV (2019)
Google Scholar
Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: YOLACT++: better real-time instance segmentation. TPAMI (2020)
Google Scholar
Cao, J., Anwer, R.M., Cholakkal, H., Khan, F.S., Pang, Y., Shao, L.: SipMask: spatial information preservation for fast image and video instance segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 1–18. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_1
Chapter Google Scholar
Cao, J., Anwer, R.M., Cholakkal, H., Khan, F.S., Pang, Y., Shao, L.: SipMask: spatial information preservation for fast image and video instance segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 1–18. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_1
Chapter Google Scholar
Chen, H., Sun, K., Tian, Z., Shen, C., Huang, Y., Yan, Y.: BlendMask: top-down meets bottom-up for instance segmentation. In: CVPR (2020)
Google Scholar
Chen, K., et al.: Hybrid task cascade for instance segmentation. In: CVPR (2019)
Google Scholar
Chen, K., et al.: MMDetection: Open MMLAB detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
Chen, Y., Pont-Tuset, J., Montes, A., Van Gool, L.: Blazingly fast video object segmentation with pixel-wise metric learning. In: CVPR (2018)
Google Scholar
Cheng, T., Wang, X., Huang, L., Liu, W.: Boundary-preserving mask R-CNN. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 660–676. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_39
Chapter Google Scholar
Dai, J., et al.: Deformable convolutional networks. In: ICCV (2017)
Google Scholar
Fang, Y., et al.: Instances as queries. In: ICCV (2021)
Google Scholar
Fu, Y., Yang, L., Liu, D., Huang, T.S., Shi, H.: CompFeat: comprehensive feature aggregation for video instance segmentation. In: AAAI (2021)
Google Scholar
He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Hwang, S., Heo, M., Oh, S.W., Kim, S.J.: Video instance segmentation using inter-frame communication transformers. arXiv preprint arXiv:2106.03299 (2021)
Kuhn, H.W.: The Hungarian method for the assignment problem. Naval Res. Logist. Q. 2(1–2), 83–97 (1955)
Article MathSciNet Google Scholar
Kuznetsova, A., et al.: The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. IJCV. 128, 1956–1981 (2020)
Article Google Scholar
Li, M., Li, S., Li, L., Zhang, L.: Spatial feature calibration and temporal fusion for effective one-stage video instance segmentation. In: CVPR (2021)
Google Scholar
Lin, H., Wu, R., Liu, S., Lu, J., Jia, J.: Video instance segmentation with a propose-reduce paradigm (2021)
Google Scholar
Lin, J., Gan, C., Han, S.: TSM: temporal shift module for efficient video understanding. In: ICCV (2019)
Google Scholar
Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: ICCV (2017)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, D., Cui, Y., Tan, W., Chen, Y.: Sg-net: Spatial granularity network for one-stage video instance segmentation. In: CVPR (2021)
Google Scholar
Oh, S.W., Lee, J.Y., Xu, N., Kim, S.J.: Video object segmentation using space-time memory networks. In: ICCV (2019)
Google Scholar
Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., Sorkine-Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. In: CVPR (2016)
Google Scholar
Pont-Tuset, J., et al.: The 2017 Davis challenge on video object segmentation. arXiv:1704.00675 (2017)
Qi, J., et al.: Occluded video instance segmentation. arXiv preprint arXiv:2102.01558 (2021)
Tian, Z., Shen, C., Chen, H.: Conditional convolutions for instance segmentation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 282–298. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_17
Chapter Google Scholar
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: ICCV (2019)
Google Scholar
Voigtlaender, P., Chai, Y., Schroff, F., Adam, H., Leibe, B., Chen, L.C.: FEELVOS: fast end-to-end embedding learning for video object segmentation. In: CVPR (2019)
Google Scholar
Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: CVPR (2021)
Google Scholar
Xu, N., et al.: YouTube-vis dataset 2021 version (2021). https://youtube-vos.org/dataset/vis
Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV (2019)
Google Scholar
Yang, L., Wang, Y., Xiong, X., Yang, J., Katsaggelos, A.K.: Efficient video object segmentation via network modulation. In: CVPR (2018)
Google Scholar
Yang, S., et al.: Crossover learning for fast online video instance segmentation. In: ICCV (2021)
Google Scholar
Yang, S., et al.: Tracking instances as queries. arXiv preprint arXiv:2106.11963 (2021)
Yang, Z., Wei, Y., Yang, Y.: Collaborative video object segmentation by foreground-background integration. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 332–348. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_20
Chapter Google Scholar
Yang, Z., Wei, Y., Yang, Y.: Associating objects with transformers for video object segmentation. In: NeurIPS (2021)
Google Scholar
Yang, Z., Wei, Y., Yang, Y.: Collaborative video object segmentation by multi-scale foreground-background integration. TPAMI (2021)
Google Scholar
Zhang, R., Tian, Z., Shen, C., You, M., Yan, Y.: Mask encoding for single shot instance segmentation. In: CVPR (2020)
Google Scholar

Download references

Acknowledgment

This work was supported in part by the National NSF of China (No. 62120106009), the Fundamental Research Funds for the Central Universities (No. K22RC00010).

Author information

Authors and Affiliations

Baidu Research, Beijing, China
Feng Zhu
ReLER, Centre for Artificial Intelligence, University of Technology Sydney, Sydney, Australia
Feng Zhu
Australian Artificial Intelligence Institute, University of Technology Sydney, Sydney, Australia
Feng Zhu & Xin Yu
CCAI, College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Zongxin Yang & Yi Yang
Institute of Information Science, Beijing Jiaotong University, Beijing, China
Yunchao Wei
Beijing Key Laboratory of Advanced Information Science and Network, Beijing, China
Yunchao Wei

Authors

Feng Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Zongxin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xin Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yunchao Wei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunchao Wei .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 17973 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, F., Yang, Z., Yu, X., Yang, Y., Wei, Y. (2022). Instance as Identity: A Generic Online Paradigm for Video Instance Segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13689. Springer, Cham. https://doi.org/10.1007/978-3-031-19818-2_30

Download citation

DOI: https://doi.org/10.1007/978-3-031-19818-2_30
Published: 22 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19817-5
Online ISBN: 978-3-031-19818-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Instance as Identity: A Generic Online Paradigm for Video Instance Segmentation