Skip to main content

Fast Video Instance Segmentation via Recurrent Encoder-Based Transformers

  • Conference paper
  • First Online:
Computer Analysis of Images and Patterns (CAIP 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14184))

Included in the following conference series:

  • 471 Accesses

Abstract

State-of-the-art transformers-based video instance segmentation (VIS) frameworks typically utilize attention-based encoders to compute multi-scale spatio-temporal features to capture target appearance deformations. However, such an attention computation is computationally expensive, thereby hampering the inference speed. In this work, we introduce a VIS framework that utilizes a light-weight recurrent-CNN encoder to learn multi-scale spatio-temporal features from the standard attention encoders through knowledge distillation. The light-weight recurrent encoder effectively learns multi-scale spatio-temporal features and achieves improved VIS performance by reducing the over-fitting as well as increasing the inference speed. Our extensive experiments on the popular Youtube-VIS 2019 benchmark reveal the merits of the proposed framework over the baseline. Compared to the recent SeqFormer, our proposed Recurrent SeqFormer improves the inference speed by two-fold while also improving the VIS performance from 45.1% to 45.8% in terms of overall average precision. Our code and models are available at https://github.com/OmkarThawakar/Recurrent-Seqformer

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Athar, A., Mahadevan, S., Osep, A., Leal-Taixé, L., Leibe, B.: Stem-seg: spatio-temporal embeddings for instance segmentation in videos. In: ECCV (2020)

    Google Scholar 

  2. Bertasius, G., Torresani, L.: Classifying, segmenting, and tracking object instances in video with mask propagation. In: CVPR (2020)

    Google Scholar 

  3. Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: YOLACT: real-time instance segmentation. In: ICCV (2019)

    Google Scholar 

  4. Cao, J., Anwer, R.M., Cholakkal, H., Khan, F.S., Pang, Y., Shao, L.: Sipmask: spatial information preservation for fast image and video instance segmentation. In: ECCV (2020)

    Google Scholar 

  5. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (2020)

    Google Scholar 

  6. Chen, G., Choi, W., Yu, X., Han, T., Chandraker, M.: Learning efficient object detection models with knowledge distillation. In: NeurIPS, vol. 30 (2017)

    Google Scholar 

  7. Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)

    Google Scholar 

  8. Fu, Y., Yang, L., Liu, D., Huang, T.S., Shi, H.: Compfeat: comprehensive feature aggregation for video instance segmentation. In: AAAI (2021)

    Google Scholar 

  9. He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask r-cnn. In: ICCV (2017)

    Google Scholar 

  10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)

    Google Scholar 

  11. Heo, M., Hwang, S., Oh, S.W., Lee, J.Y., Kim, S.J.: Vita: video instance segmentation via object token association. In: NeurIPS (2022)

    Google Scholar 

  12. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: arXiv preprint arXiv:1503.02531 (2015)

  13. Hwang, S., Heo, M., Oh, S.W., Kim, S.J.: Video instance segmentation using inter-frame communication transformers. NeurIPS 34, 13352–13363 (2021)

    Google Scholar 

  14. Kang, Z., Zhang, P., Zhang, X., Sun, J., Zheng, N.: Instance-conditional knowledge distillation for object detection. In: NeurIPS, vol. 34, pp. 16468–16480 (2021)

    Google Scholar 

  15. Ke, L., Li, X., Danelljan, M., Tai, Y.W., Tang, C.K., Yu, F.: Prototypical cross-attention networks for multiple object tracking and segmentation. In: NeurIPS (2021)

    Google Scholar 

  16. Koner, R., et al.: Instanceformer: an online video instance segmentation framework. In: ECCV (2022)

    Google Scholar 

  17. Li, M., Li, S., Li, L., Zhang, L.: Spatial feature calibration and temporal fusion for effective one-stage video instance segmentation. In: CVPR (2021)

    Google Scholar 

  18. Lin, T., et al.: Microsoft coco: Common objects in context. In: ECCV (2014)

    Google Scholar 

  19. Liu, D., Cui, Y., Tan, W., Chen, Y.: Sg-net: spatial granularity network for one-stage video instance segmentation. In: CVPR (2021)

    Google Scholar 

  20. Liu, Y., Chen, K., Liu, C., Qin, Z., Luo, Z., Wang, J.: Structured knowledge distillation for semantic segmentation. In: CVPR, pp. 2604–2613 (2019)

    Google Scholar 

  21. Papernot, N., McDaniel, P., Wu, X., Jha, S., Swami, A.: Distillation as a defense to adversarial perturbations against deep neural networks. In: 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597. IEEE (2016)

    Google Scholar 

  22. Rivkind, A., Ram, O., Assa, E., Kreiserman, M., Ahissar, E.: Visual hyperacuity with moving sensor and recurrent neural computations. In: ICLR (2021)

    Google Scholar 

  23. Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: hints for thin deep nets. In: arXiv preprint arXiv:1412.6550 (2014)

  24. Thawakar, O., et al.: Video instance segmentation via multi-scale spatio-temporal split attention transformer. In: ECCV (2022)

    Google Scholar 

  25. Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: ICCV (2019)

    Google Scholar 

  26. Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: CVPR (2021)

    Google Scholar 

  27. Wu, J., Jiang, Y., Bai, S., Zhang, W., Bai, X.: Seqformer: sequential transformer for video instance segmentation. In: ECCV, pp. 553–569. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_32

  28. Wu, J., Liu, Q., Jiang, Y., Bai, S., Yuille, A., Bai, X.: In defense of online models for video instance segmentation. In: ECCV (2022)

    Google Scholar 

  29. Yang, C., Zhou, H., An, Z., Jiang, X., Xu, Y., Zhang, Q.: Cross-image relational knowledge distillation for semantic segmentation. In: CVPR (2022)

    Google Scholar 

  30. Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV (2019)

    Google Scholar 

  31. Yang, S., Fang, Y., Wang, X., Li, Y., Fang, C., Shan, Y., Feng, B., Liu, W.: Crossover learning for fast online video instance segmentation. In: ICCV (2021)

    Google Scholar 

  32. Yang, S., et al.: Temporally efficient vision transformer for video instance segmentation. In: CVPR (2022)

    Google Scholar 

  33. Zagoruyko, S., Komodakis, N.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928 (2016)

  34. Zhang, L., Ma, K.: Improve object detection with feature-based knowledge distillation: towards accurate and efficient detectors. In: ICLR (2021)

    Google Scholar 

  35. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: deformable transformers for end-to-end object detection. In: ICLR (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Omkar Thawakar .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Thawakar, O., Rivkind, A., Ahissar, E., Khan, F.S. (2023). Fast Video Instance Segmentation via Recurrent Encoder-Based Transformers. In: Tsapatsoulis, N., et al. Computer Analysis of Images and Patterns. CAIP 2023. Lecture Notes in Computer Science, vol 14184. Springer, Cham. https://doi.org/10.1007/978-3-031-44237-7_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44237-7_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44236-0

  • Online ISBN: 978-3-031-44237-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics