TASFormer: Task-Aware Image Segmentation Transformer

Yudin, Dmitry; Khorin, Aleksandr; Zemskova, Tatiana; Ovchinnikova, Darya

doi:10.1007/978-981-99-8073-4_24

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14451))

Included in the following conference series:

International Conference on Neural Information Processing

579 Accesses

Abstract

In image segmentation tasks for real-world applications, the number of semantic categories can be very large, and the number of objects in them can vary greatly. In this case, the multi-channel representation of the output mask for the segmentation model is inefficient. In this paper we explore approaches to overcome such a problem by using a single-channel output mask and additional input information about the desired class for segmentation. We call this information task embedding and we learn it in the process of the neural network model training. In our case, the number of tasks is equal to the number of segmentation categories. This approach allows us to build universal models that can be conveniently extended to an arbitrary number of categories without changing the architecture of the neural network. To investigate this idea we developed a transformer neural network segmentation model named TASFormer. We demonstrated that the highest quality results for task-aware segmentation are obtained using adapter technology as part of the model. To evaluate the quality of segmentation, we introduce a binary intersection over union (bIoU) metric, which is an adaptation of the standard mIoU for the models with a single-channel output. We analyze its distinguishing properties and use it to compare modern neural network methods. The experiments were carried out on the universal ADE20K dataset. The proposed TASFormer-based approach demonstrated state-of-the-art segmentation quality on it. The software implementation of the TASFormer method and the bIoU metric is publicly available at www.github.com/subake/TASFormer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Botach, A., Zheltonozhskii, E., Baskin, C.: End-to-end referring video object segmentation with multimodal transformers. In: Proceedings of the IEEE/CVF CVPR (2022)
Google Scholar
Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: thing and stuff classes in context. In: Proceedings of the IEEE/CVF CVPR, pp. 1209–1218 (2018)
Google Scholar
Chen, Z., et al.: Vision transformer adapter for dense predictions. In: The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=plKu2GByCNW
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF CVPR. pp, 1290–1299 (2022)
Google Scholar
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE/CVF CVPR, pp. 3213–3223 (2016)
Google Scholar
Ding, H., Liu, C., Wang, S., Jiang, X.: Vlt: vision-language transformer and query generation for referring segmentation. IEEE Trans. Patt. Anal. Mach. Intell. (2022)
Google Scholar
Dong, N., Xing, E.P.: Few-shot semantic segmentation with prototype learning. In: BMVC. vol. 3 (2018)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021), https://openreview.net/forum?id=YicbFdNTTy
Gavrilyuk, K., Ghodrati, A., Li, Z., Snoek, C.G.: Actor and action video segmentation from a sentence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5958–5966 (2018)
Google Scholar
Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=lL3lnMbR4WU
Jain, J., Li, J., Chiu, M.T., Hassani, A., Orlov, N., Shi, H.: Oneformer: one transformer to rule universal image segmentation. In: Proceedings of the IEEE/CVF CVPR, pp. 2989–2998 (2023)
Google Scholar
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: Mdetr-modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF ICCV, pp. 1780–1790 (2021)
Google Scholar
Karimi Mahabadi, R., Ruder, S., Dehghani, M., Henderson, J.: Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. In: Annual Meeting of the Association for Computational Linguistics (2021)
Google Scholar
Kirilenko, D., Kovalev, A.K., Solomentsev, Y., Melekhin, A., Yudin, D.A., Panov, A.I.: Vector symbolic scene representation for semantic place recognition. In: 2022 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2022)
Google Scholar
Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)
Liang, F., et al.: Open-vocabulary semantic segmentation with mask-adapted clip. In: Proceedings of the IEEE/CVF CVPR, pp. 7061–7070 (2023)
Google Scholar
Liu, S., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv:2303.05499 (2023)
Liu, W., Zhang, C., Lin, G., Liu, F.: CRCNet: few-shot segmentation with cross-reference and region-global conditional networks. Int. J. Comput. Vision 130(12), 3140–3157 (2022)
Article Google Scholar
Liu, Z., et al..: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF ICCV, pp. 10012–10022 (2021)
Google Scholar
Lu, K., Grover, A., Abbeel, P., Mordatch, I.: Pretrained transformers as universal computation engines. Proc. of the AAAI Conf. Artif. Intell. 36, 7628–7636 (06 2022). https://doi.org/10.1609/aaai.v36i7.20729
Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts. In: Proceedings of the IEEE/CVF CVPR, pp. 7086–7096 (2022)
Google Scholar
McCormac, J., Handa, A., Leutenegger, S., Davison, A.J.: Scenenet RGB-D: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? In: Proceedings. of the IEEE/CVF ICCV, pp. 2697–2706 (2017). https://doi.org/10.1109/ICCV.2017.292
Pan, H., Hong, Y., Sun, W., Jia, Y.: Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes. IEEE Trans. Intell. Transp. Syst. 24(3), 3448–3460 (2022)
Google Scholar
Rakelly, K., Shelhamer, E., Darrell, T., Efros, A., Levine, S.: Conditional networks for few-shot semantic segmentation. ICLR Workshop (2018)
Google Scholar
Schlegel, K., Neubert, P., Protzel, P.: A comparison of vector symbolic architectures. Artif. Intell. Rev. 55, 4523–4555 (2021). https://doi.org/10.1007/s10462-021-10110-3
Seo, S., Lee, J.-Y., Han, B.: URVOS: unified referring video object segmentation network with a large-scale benchmark. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV, pp. 208–223. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_13
Chapter Google Scholar
Shaban, A., Bansal, S., Liu, Z., Essa, I., Boots, B.: One-shot learning for semantic segmentation. In: British Machine Vision Conference (2017)
Google Scholar
Shepel, I., Adeshkin, V., Belkin, I., Yudin, D.A.: Occupancy grid generation with dynamic obstacle segmentation in stereo images. IEEE Trans. Intell. Transp. Syst. 23(9), 14779–14789 (2021)
Article Google Scholar
Song, S., P., S., Lichtenberg, Xiao, J.: Sun RGB-D: A RGB-D scene understanding benchmark suite. Proceedings of the IEEE CVPR (2015). https://rgbd.cs.princeton.edu/
Wang, W., et al.: Pvt v2: improved baselines with pyramid vision transformer. Comput. Vis. Media 8(3), 415–424 (2022)
Article Google Scholar
Wang, W., et al.: Image as a foreign language: beit pretraining for vision and vision-language tasks. In: CVPR, pp. 19175–19186 (2023)
Google Scholar
Wu, C., Lin, Z., Cohen, S., Bui, T., Maji, S.: Phrasecut: language-based image segmentation in the wild. In: Proceedings of the IEEE/CVF CVPR, pp. 10216–10225 (2020)
Google Scholar
Wu, J., Jiang, Y., Sun, P., Yuan, Z., Luo, P.: Language as queries for referring video object segmentation. In: Proceedings of the IEEE/CVF CVPR, pp. 4974–4984 (2022)
Google Scholar
Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2. https://github.com/facebookresearch/detectron2 (2019)
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: simple and efficient design for semantic segmentation with transformers. Adv. Neural. Inf. Process. Syst. 34, 12077–12090 (2021)
Google Scholar
Xu, J., Xiong, Z., Bhattacharyya, S.P.: Pidnet: a real-time semantic segmentation network inspired by PID controllers. In: CVPR, pp. 19529–19539 (2023)
Google Scholar
Xu, M., Zhang, Z., Wei, F., Hu, H., Bai, X.: Side adapter network for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF CVP, pp. 2945–2954 (2023)
Google Scholar
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II, pp. 69–85. Springer International Publishing, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5
Chapter Google Scholar
Yuan, Y., Chen, X., Wang, J.: Object-contextual representations for semantic segmentation. In: ECCV (2020)
Google Scholar
Zhang, C., et al.: Faster segment anything: towards lightweight SAM for mobile applications (2023)
Google Scholar
Zhou, B., et al.: Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vis. 127(3), 302–321 (2019)
Article Google Scholar
Zou, X., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF CVPR, pp. 15116–15127 (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

AIRI (Artificial Intelligence Research Institute), Moscow, Russia
Dmitry Yudin
Moscow Institute of Physics and Technology, Moscow, Russia
Dmitry Yudin, Aleksandr Khorin, Tatiana Zemskova & Darya Ovchinnikova

Authors

Dmitry Yudin
View author publications
You can also search for this author in PubMed Google Scholar
Aleksandr Khorin
View author publications
You can also search for this author in PubMed Google Scholar
Tatiana Zemskova
View author publications
You can also search for this author in PubMed Google Scholar
Darya Ovchinnikova
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dmitry Yudin .

Editor information

Editors and Affiliations

Central South University, Changsha, China
Biao Luo
Chinese Academy of Sciences, Beijing, China
Long Cheng
Zhejiang University, Hangzhou, China
Zheng-Guang Wu
Guangdong University of Technology, Guangzhou, China
Hongyi Li
UNSW Sydney, Sydney, NSW, Australia
Chaojie Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yudin, D., Khorin, A., Zemskova, T., Ovchinnikova, D. (2024). TASFormer: Task-Aware Image Segmentation Transformer. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Lecture Notes in Computer Science, vol 14451. Springer, Singapore. https://doi.org/10.1007/978-981-99-8073-4_24

Download citation

DOI: https://doi.org/10.1007/978-981-99-8073-4_24
Published: 15 November 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8072-7
Online ISBN: 978-981-99-8073-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics