Abstract
State-of-the-art transformers-based video instance segmentation (VIS) frameworks typically utilize attention-based encoders to compute multi-scale spatio-temporal features to capture target appearance deformations. However, such an attention computation is computationally expensive, thereby hampering the inference speed. In this work, we introduce a VIS framework that utilizes a light-weight recurrent-CNN encoder to learn multi-scale spatio-temporal features from the standard attention encoders through knowledge distillation. The light-weight recurrent encoder effectively learns multi-scale spatio-temporal features and achieves improved VIS performance by reducing the over-fitting as well as increasing the inference speed. Our extensive experiments on the popular Youtube-VIS 2019 benchmark reveal the merits of the proposed framework over the baseline. Compared to the recent SeqFormer, our proposed Recurrent SeqFormer improves the inference speed by two-fold while also improving the VIS performance from 45.1% to 45.8% in terms of overall average precision. Our code and models are available at https://github.com/OmkarThawakar/Recurrent-Seqformer
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Athar, A., Mahadevan, S., Osep, A., Leal-Taixé, L., Leibe, B.: Stem-seg: spatio-temporal embeddings for instance segmentation in videos. In: ECCV (2020)
Bertasius, G., Torresani, L.: Classifying, segmenting, and tracking object instances in video with mask propagation. In: CVPR (2020)
Bolya, D., Zhou, C., Xiao, F., Lee, Y.J.: YOLACT: real-time instance segmentation. In: ICCV (2019)
Cao, J., Anwer, R.M., Cholakkal, H., Khan, F.S., Pang, Y., Shao, L.: Sipmask: spatial information preservation for fast image and video instance segmentation. In: ECCV (2020)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: ECCV (2020)
Chen, G., Choi, W., Yu, X., Han, T., Chandraker, M.: Learning efficient object detection models with knowledge distillation. In: NeurIPS, vol. 30 (2017)
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: CVPR (2022)
Fu, Y., Yang, L., Liu, D., Huang, T.S., Shi, H.: Compfeat: comprehensive feature aggregation for video instance segmentation. In: AAAI (2021)
He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask r-cnn. In: ICCV (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Heo, M., Hwang, S., Oh, S.W., Lee, J.Y., Kim, S.J.: Vita: video instance segmentation via object token association. In: NeurIPS (2022)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. In: arXiv preprint arXiv:1503.02531 (2015)
Hwang, S., Heo, M., Oh, S.W., Kim, S.J.: Video instance segmentation using inter-frame communication transformers. NeurIPS 34, 13352–13363 (2021)
Kang, Z., Zhang, P., Zhang, X., Sun, J., Zheng, N.: Instance-conditional knowledge distillation for object detection. In: NeurIPS, vol. 34, pp. 16468–16480 (2021)
Ke, L., Li, X., Danelljan, M., Tai, Y.W., Tang, C.K., Yu, F.: Prototypical cross-attention networks for multiple object tracking and segmentation. In: NeurIPS (2021)
Koner, R., et al.: Instanceformer: an online video instance segmentation framework. In: ECCV (2022)
Li, M., Li, S., Li, L., Zhang, L.: Spatial feature calibration and temporal fusion for effective one-stage video instance segmentation. In: CVPR (2021)
Lin, T., et al.: Microsoft coco: Common objects in context. In: ECCV (2014)
Liu, D., Cui, Y., Tan, W., Chen, Y.: Sg-net: spatial granularity network for one-stage video instance segmentation. In: CVPR (2021)
Liu, Y., Chen, K., Liu, C., Qin, Z., Luo, Z., Wang, J.: Structured knowledge distillation for semantic segmentation. In: CVPR, pp. 2604–2613 (2019)
Papernot, N., McDaniel, P., Wu, X., Jha, S., Swami, A.: Distillation as a defense to adversarial perturbations against deep neural networks. In: 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597. IEEE (2016)
Rivkind, A., Ram, O., Assa, E., Kreiserman, M., Ahissar, E.: Visual hyperacuity with moving sensor and recurrent neural computations. In: ICLR (2021)
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: hints for thin deep nets. In: arXiv preprint arXiv:1412.6550 (2014)
Thawakar, O., et al.: Video instance segmentation via multi-scale spatio-temporal split attention transformer. In: ECCV (2022)
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: ICCV (2019)
Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: CVPR (2021)
Wu, J., Jiang, Y., Bai, S., Zhang, W., Bai, X.: Seqformer: sequential transformer for video instance segmentation. In: ECCV, pp. 553–569. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19815-1_32
Wu, J., Liu, Q., Jiang, Y., Bai, S., Yuille, A., Bai, X.: In defense of online models for video instance segmentation. In: ECCV (2022)
Yang, C., Zhou, H., An, Z., Jiang, X., Xu, Y., Zhang, Q.: Cross-image relational knowledge distillation for semantic segmentation. In: CVPR (2022)
Yang, L., Fan, Y., Xu, N.: Video instance segmentation. In: ICCV (2019)
Yang, S., Fang, Y., Wang, X., Li, Y., Fang, C., Shan, Y., Feng, B., Liu, W.: Crossover learning for fast online video instance segmentation. In: ICCV (2021)
Yang, S., et al.: Temporally efficient vision transformer for video instance segmentation. In: CVPR (2022)
Zagoruyko, S., Komodakis, N.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928 (2016)
Zhang, L., Ma, K.: Improve object detection with feature-based knowledge distillation: towards accurate and efficient detectors. In: ICLR (2021)
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: deformable transformers for end-to-end object detection. In: ICLR (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Thawakar, O., Rivkind, A., Ahissar, E., Khan, F.S. (2023). Fast Video Instance Segmentation via Recurrent Encoder-Based Transformers. In: Tsapatsoulis, N., et al. Computer Analysis of Images and Patterns. CAIP 2023. Lecture Notes in Computer Science, vol 14184. Springer, Cham. https://doi.org/10.1007/978-3-031-44237-7_25
Download citation
DOI: https://doi.org/10.1007/978-3-031-44237-7_25
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44236-0
Online ISBN: 978-3-031-44237-7
eBook Packages: Computer ScienceComputer Science (R0)