Skip to main content

TASFormer: Task-Aware Image Segmentation Transformer

  • Conference paper
  • First Online:
Neural Information Processing (ICONIP 2023)

Abstract

In image segmentation tasks for real-world applications, the number of semantic categories can be very large, and the number of objects in them can vary greatly. In this case, the multi-channel representation of the output mask for the segmentation model is inefficient. In this paper we explore approaches to overcome such a problem by using a single-channel output mask and additional input information about the desired class for segmentation. We call this information task embedding and we learn it in the process of the neural network model training. In our case, the number of tasks is equal to the number of segmentation categories. This approach allows us to build universal models that can be conveniently extended to an arbitrary number of categories without changing the architecture of the neural network. To investigate this idea we developed a transformer neural network segmentation model named TASFormer. We demonstrated that the highest quality results for task-aware segmentation are obtained using adapter technology as part of the model. To evaluate the quality of segmentation, we introduce a binary intersection over union (bIoU) metric, which is an adaptation of the standard mIoU for the models with a single-channel output. We analyze its distinguishing properties and use it to compare modern neural network methods. The experiments were carried out on the universal ADE20K dataset. The proposed TASFormer-based approach demonstrated state-of-the-art segmentation quality on it. The software implementation of the TASFormer method and the bIoU metric is publicly available at www.github.com/subake/TASFormer.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Botach, A., Zheltonozhskii, E., Baskin, C.: End-to-end referring video object segmentation with multimodal transformers. In: Proceedings of the IEEE/CVF CVPR (2022)

    Google Scholar 

  2. Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: thing and stuff classes in context. In: Proceedings of the IEEE/CVF CVPR, pp. 1209–1218 (2018)

    Google Scholar 

  3. Chen, Z., et al.: Vision transformer adapter for dense predictions. In: The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=plKu2GByCNW

  4. Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF CVPR. pp, 1290–1299 (2022)

    Google Scholar 

  5. Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE/CVF CVPR, pp. 3213–3223 (2016)

    Google Scholar 

  6. Ding, H., Liu, C., Wang, S., Jiang, X.: Vlt: vision-language transformer and query generation for referring segmentation. IEEE Trans. Patt. Anal. Mach. Intell. (2022)

    Google Scholar 

  7. Dong, N., Xing, E.P.: Few-shot semantic segmentation with prototype learning. In: BMVC. vol. 3 (2018)

    Google Scholar 

  8. Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021), https://openreview.net/forum?id=YicbFdNTTy

  9. Gavrilyuk, K., Ghodrati, A., Li, Z., Snoek, C.G.: Actor and action video segmentation from a sentence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5958–5966 (2018)

    Google Scholar 

  10. Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=lL3lnMbR4WU

  11. Jain, J., Li, J., Chiu, M.T., Hassani, A., Orlov, N., Shi, H.: Oneformer: one transformer to rule universal image segmentation. In: Proceedings of the IEEE/CVF CVPR, pp. 2989–2998 (2023)

    Google Scholar 

  12. Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: Mdetr-modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF ICCV, pp. 1780–1790 (2021)

    Google Scholar 

  13. Karimi Mahabadi, R., Ruder, S., Dehghani, M., Henderson, J.: Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. In: Annual Meeting of the Association for Computational Linguistics (2021)

    Google Scholar 

  14. Kirilenko, D., Kovalev, A.K., Solomentsev, Y., Melekhin, A., Yudin, D.A., Panov, A.I.: Vector symbolic scene representation for semantic place recognition. In: 2022 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2022)

    Google Scholar 

  15. Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)

  16. Liang, F., et al.: Open-vocabulary semantic segmentation with mask-adapted clip. In: Proceedings of the IEEE/CVF CVPR, pp. 7061–7070 (2023)

    Google Scholar 

  17. Liu, S., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv:2303.05499 (2023)

  18. Liu, W., Zhang, C., Lin, G., Liu, F.: CRCNet: few-shot segmentation with cross-reference and region-global conditional networks. Int. J. Comput. Vision 130(12), 3140–3157 (2022)

    Article  Google Scholar 

  19. Liu, Z., et al..: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF ICCV, pp. 10012–10022 (2021)

    Google Scholar 

  20. Lu, K., Grover, A., Abbeel, P., Mordatch, I.: Pretrained transformers as universal computation engines. Proc. of the AAAI Conf. Artif. Intell. 36, 7628–7636 (06 2022). https://doi.org/10.1609/aaai.v36i7.20729

  21. Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts. In: Proceedings of the IEEE/CVF CVPR, pp. 7086–7096 (2022)

    Google Scholar 

  22. McCormac, J., Handa, A., Leutenegger, S., Davison, A.J.: Scenenet RGB-D: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? In: Proceedings. of the IEEE/CVF ICCV, pp. 2697–2706 (2017). https://doi.org/10.1109/ICCV.2017.292

  23. Pan, H., Hong, Y., Sun, W., Jia, Y.: Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes. IEEE Trans. Intell. Transp. Syst. 24(3), 3448–3460 (2022)

    Google Scholar 

  24. Rakelly, K., Shelhamer, E., Darrell, T., Efros, A., Levine, S.: Conditional networks for few-shot semantic segmentation. ICLR Workshop (2018)

    Google Scholar 

  25. Schlegel, K., Neubert, P., Protzel, P.: A comparison of vector symbolic architectures. Artif. Intell. Rev. 55, 4523–4555 (2021). https://doi.org/10.1007/s10462-021-10110-3

  26. Seo, S., Lee, J.-Y., Han, B.: URVOS: unified referring video object segmentation network with a large-scale benchmark. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV, pp. 208–223. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_13

    Chapter  Google Scholar 

  27. Shaban, A., Bansal, S., Liu, Z., Essa, I., Boots, B.: One-shot learning for semantic segmentation. In: British Machine Vision Conference (2017)

    Google Scholar 

  28. Shepel, I., Adeshkin, V., Belkin, I., Yudin, D.A.: Occupancy grid generation with dynamic obstacle segmentation in stereo images. IEEE Trans. Intell. Transp. Syst. 23(9), 14779–14789 (2021)

    Article  Google Scholar 

  29. Song, S., P., S., Lichtenberg, Xiao, J.: Sun RGB-D: A RGB-D scene understanding benchmark suite. Proceedings of the IEEE CVPR (2015). https://rgbd.cs.princeton.edu/

  30. Wang, W., et al.: Pvt v2: improved baselines with pyramid vision transformer. Comput. Vis. Media 8(3), 415–424 (2022)

    Article  Google Scholar 

  31. Wang, W., et al.: Image as a foreign language: beit pretraining for vision and vision-language tasks. In: CVPR, pp. 19175–19186 (2023)

    Google Scholar 

  32. Wu, C., Lin, Z., Cohen, S., Bui, T., Maji, S.: Phrasecut: language-based image segmentation in the wild. In: Proceedings of the IEEE/CVF CVPR, pp. 10216–10225 (2020)

    Google Scholar 

  33. Wu, J., Jiang, Y., Sun, P., Yuan, Z., Luo, P.: Language as queries for referring video object segmentation. In: Proceedings of the IEEE/CVF CVPR, pp. 4974–4984 (2022)

    Google Scholar 

  34. Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2. https://github.com/facebookresearch/detectron2 (2019)

  35. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: simple and efficient design for semantic segmentation with transformers. Adv. Neural. Inf. Process. Syst. 34, 12077–12090 (2021)

    Google Scholar 

  36. Xu, J., Xiong, Z., Bhattacharyya, S.P.: Pidnet: a real-time semantic segmentation network inspired by PID controllers. In: CVPR, pp. 19529–19539 (2023)

    Google Scholar 

  37. Xu, M., Zhang, Z., Wei, F., Hu, H., Bai, X.: Side adapter network for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF CVP, pp. 2945–2954 (2023)

    Google Scholar 

  38. Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II, pp. 69–85. Springer International Publishing, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5

    Chapter  Google Scholar 

  39. Yuan, Y., Chen, X., Wang, J.: Object-contextual representations for semantic segmentation. In: ECCV (2020)

    Google Scholar 

  40. Zhang, C., et al.: Faster segment anything: towards lightweight SAM for mobile applications (2023)

    Google Scholar 

  41. Zhou, B., et al.: Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vis. 127(3), 302–321 (2019)

    Article  Google Scholar 

  42. Zou, X., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF CVPR, pp. 15116–15127 (2023)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dmitry Yudin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yudin, D., Khorin, A., Zemskova, T., Ovchinnikova, D. (2024). TASFormer: Task-Aware Image Segmentation Transformer. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Lecture Notes in Computer Science, vol 14451. Springer, Singapore. https://doi.org/10.1007/978-981-99-8073-4_24

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8073-4_24

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8072-7

  • Online ISBN: 978-981-99-8073-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics