Fast Video Instance Segmentation via Recurrent Encoder-Based Transformers

Thawakar, Omkar; Rivkind, Alexandre; Ahissar, Ehud; Khan, Fahad Shahbaz

doi:10.1007/978-3-031-44237-7_25

Omkar Thawakar¹⁵,
Alexandre Rivkind¹⁶,
Ehud Ahissar¹⁶ &
…
Fahad Shahbaz Khan¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14184))

Included in the following conference series:

International Conference on Computer Analysis of Images and Patterns

471 Accesses

Abstract

State-of-the-art transformers-based video instance segmentation (VIS) frameworks typically utilize attention-based encoders to compute multi-scale spatio-temporal features to capture target appearance deformations. However, such an attention computation is computationally expensive, thereby hampering the inference speed. In this work, we introduce a VIS framework that utilizes a light-weight recurrent-CNN encoder to learn multi-scale spatio-temporal features from the standard attention encoders through knowledge distillation. The light-weight recurrent encoder effectively learns multi-scale spatio-temporal features and achieves improved VIS performance by reducing the over-fitting as well as increasing the inference speed. Our extensive experiments on the popular Youtube-VIS 2019 benchmark reveal the merits of the proposed framework over the baseline. Compared to the recent SeqFormer, our proposed Recurrent SeqFormer improves the inference speed by two-fold while also improving the VIS performance from 45.1% to 45.8% in terms of overall average precision. Our code and models are available at https://github.com/OmkarThawakar/Recurrent-Seqformer

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Athar, A., Mahadevan, S., Osep, A., Leal-Taixé, L., Leibe, B.: Stem-seg: spatio-temporal embeddings for instance segmentation in videos. In: ECCV (2020)
Google Scholar
Bertasius, G., Torresani, L.: Classifying, segmenting, and tracking object instances in video with mask propagation. In: CVPR (2020)
Google Scholar
Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: YOLACT: real-time instance segmentation. In: ICCV (2019)
Google Scholar
Cao, J., Anwer, R.M., Cholakkal, H., Khan, F.S., Pang, Y., Shao, L.: Sipmask: spatial information preservation for fast image and video instance segmentation. In: ECCV (2020)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (2020)
Google Scholar
Chen, G., Choi, W., Yu, X., Han, T., Chandraker, M.: Learning efficient object detection models with knowledge distillation. In: NeurIPS, vol. 30 (2017)
Google Scholar
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)
Google Scholar
Fu, Y., Yang, L., Liu, D., Huang, T.S., Shi, H.: Compfeat: comprehensive feature aggregation for video instance segmentation. In: AAAI (2021)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask r-cnn. In: ICCV (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Google Scholar
Heo, M., Hwang, S., Oh, S.W., Lee, J.Y., Kim, S.J.: Vita: video instance segmentation via object token association. In: NeurIPS (2022)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: arXiv preprint arXiv:1503.02531 (2015)
Hwang, S., Heo, M., Oh, S.W., Kim, S.J.: Video instance segmentation using inter-frame communication transformers. NeurIPS 34, 13352–13363 (2021)
Google Scholar
Kang, Z., Zhang, P., Zhang, X., Sun, J., Zheng, N.: Instance-conditional knowledge distillation for object detection. In: NeurIPS, vol. 34, pp. 16468–16480 (2021)
Google Scholar
Ke, L., Li, X., Danelljan, M., Tai, Y.W., Tang, C.K., Yu, F.: Prototypical cross-attention networks for multiple object tracking and segmentation. In: NeurIPS (2021)
Google Scholar
Koner, R., et al.: Instanceformer: an online video instance segmentation framework. In: ECCV (2022)
Google Scholar
Li, M., Li, S., Li, L., Zhang, L.: Spatial feature calibration and temporal fusion for effective one-stage video instance segmentation. In: CVPR (2021)
Google Scholar
Lin, T., et al.: Microsoft coco: Common objects in context. In: ECCV (2014)
Google Scholar
Liu, D., Cui, Y., Tan, W., Chen, Y.: Sg-net: spatial granularity network for one-stage video instance segmentation. In: CVPR (2021)
Google Scholar
Liu, Y., Chen, K., Liu, C., Qin, Z., Luo, Z., Wang, J.: Structured knowledge distillation for semantic segmentation. In: CVPR, pp. 2604–2613 (2019)
Google Scholar
Papernot, N., McDaniel, P., Wu, X., Jha, S., Swami, A.: Distillation as a defense to adversarial perturbations against deep neural networks. In: 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597. IEEE (2016)
Google Scholar
Rivkind, A., Ram, O., Assa, E., Kreiserman, M., Ahissar, E.: Visual hyperacuity with moving sensor and recurrent neural computations. In: ICLR (2021)
Google Scholar
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: hints for thin deep nets. In: arXiv preprint arXiv:1412.6550 (2014)
Thawakar, O., et al.: Video instance segmentation via multi-scale spatio-temporal split attention transformer. In: ECCV (2022)
Google Scholar
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: ICCV (2019)
Google Scholar
Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: CVPR (2021)
Google Scholar
Wu, J., Jiang, Y., Bai, S., Zhang, W., Bai, X.: Seqformer: sequential transformer for video instance segmentation. In: ECCV, pp. 553–569. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_32
Wu, J., Liu, Q., Jiang, Y., Bai, S., Yuille, A., Bai, X.: In defense of online models for video instance segmentation. In: ECCV (2022)
Google Scholar
Yang, C., Zhou, H., An, Z., Jiang, X., Xu, Y., Zhang, Q.: Cross-image relational knowledge distillation for semantic segmentation. In: CVPR (2022)
Google Scholar
Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV (2019)
Google Scholar
Yang, S., Fang, Y., Wang, X., Li, Y., Fang, C., Shan, Y., Feng, B., Liu, W.: Crossover learning for fast online video instance segmentation. In: ICCV (2021)
Google Scholar
Yang, S., et al.: Temporally efficient vision transformer for video instance segmentation. In: CVPR (2022)
Google Scholar
Zagoruyko, S., Komodakis, N.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928 (2016)
Zhang, L., Ma, K.: Improve object detection with feature-based knowledge distillation: towards accurate and efficient detectors. In: ICLR (2021)
Google Scholar
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: deformable transformers for end-to-end object detection. In: ICLR (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

MBZUAI, Abu Dhabi, UAE
Omkar Thawakar & Fahad Shahbaz Khan
Weizmann Institute of Science, Rehovot, Israel
Alexandre Rivkind & Ehud Ahissar

Authors

Omkar Thawakar
View author publications
You can also search for this author in PubMed Google Scholar
Alexandre Rivkind
View author publications
You can also search for this author in PubMed Google Scholar
Ehud Ahissar
View author publications
You can also search for this author in PubMed Google Scholar
Fahad Shahbaz Khan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Omkar Thawakar .

Editor information

Editors and Affiliations

Cyprus University of Technology, Limassol, Cyprus
Nicolas Tsapatsoulis
Cyprus University of Technology/CYENS Center of Excellence, Limassol, Cyprus
Andreas Lanitis
The University of New Mexico, Albuquerque, NM, USA
Marios Pattichis
University of Cyprus/CYENS Center of Excellence, Nicosia, Cyprus
Constantinos Pattichis
University of Cyprus/KIOS Center of Excellence, Nicosia, Cyprus
Christos Kyrkou
Cyprus University of Technology, Limassol, Cyprus
Efthyvoulos Kyriacou
Cyprus University of Technology/CYENS Center of Excellence, Limassol, Cyprus
Zenonas Theodosiou
CYENS Center of Excellence, Nicosia, Cyprus
Andreas Panayides

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Thawakar, O., Rivkind, A., Ahissar, E., Khan, F.S. (2023). Fast Video Instance Segmentation via Recurrent Encoder-Based Transformers. In: Tsapatsoulis, N., et al. Computer Analysis of Images and Patterns. CAIP 2023. Lecture Notes in Computer Science, vol 14184. Springer, Cham. https://doi.org/10.1007/978-3-031-44237-7_25

Download citation

DOI: https://doi.org/10.1007/978-3-031-44237-7_25
Published: 20 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44236-0
Online ISBN: 978-3-031-44237-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fast Video Instance Segmentation via Recurrent Encoder-Based Transformers