Abstract
In image segmentation tasks for real-world applications, the number of semantic categories can be very large, and the number of objects in them can vary greatly. In this case, the multi-channel representation of the output mask for the segmentation model is inefficient. In this paper we explore approaches to overcome such a problem by using a single-channel output mask and additional input information about the desired class for segmentation. We call this information task embedding and we learn it in the process of the neural network model training. In our case, the number of tasks is equal to the number of segmentation categories. This approach allows us to build universal models that can be conveniently extended to an arbitrary number of categories without changing the architecture of the neural network. To investigate this idea we developed a transformer neural network segmentation model named TASFormer. We demonstrated that the highest quality results for task-aware segmentation are obtained using adapter technology as part of the model. To evaluate the quality of segmentation, we introduce a binary intersection over union (bIoU) metric, which is an adaptation of the standard mIoU for the models with a single-channel output. We analyze its distinguishing properties and use it to compare modern neural network methods. The experiments were carried out on the universal ADE20K dataset. The proposed TASFormer-based approach demonstrated state-of-the-art segmentation quality on it. The software implementation of the TASFormer method and the bIoU metric is publicly available at www.github.com/subake/TASFormer.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Botach, A., Zheltonozhskii, E., Baskin, C.: End-to-end referring video object segmentation with multimodal transformers. In: Proceedings of the IEEE/CVF CVPR (2022)
Caesar, H., Uijlings, J., Ferrari, V.: Coco-stuff: thing and stuff classes in context. In: Proceedings of the IEEE/CVF CVPR, pp. 1209–1218 (2018)
Chen, Z., et al.: Vision transformer adapter for dense predictions. In: The Eleventh International Conference on Learning Representations (2023). https://openreview.net/forum?id=plKu2GByCNW
Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF CVPR. pp, 1290–1299 (2022)
Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE/CVF CVPR, pp. 3213–3223 (2016)
Ding, H., Liu, C., Wang, S., Jiang, X.: Vlt: vision-language transformer and query generation for referring segmentation. IEEE Trans. Patt. Anal. Mach. Intell. (2022)
Dong, N., Xing, E.P.: Few-shot semantic segmentation with prototype learning. In: BMVC. vol. 3 (2018)
Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021), https://openreview.net/forum?id=YicbFdNTTy
Gavrilyuk, K., Ghodrati, A., Li, Z., Snoek, C.G.: Actor and action video segmentation from a sentence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5958–5966 (2018)
Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=lL3lnMbR4WU
Jain, J., Li, J., Chiu, M.T., Hassani, A., Orlov, N., Shi, H.: Oneformer: one transformer to rule universal image segmentation. In: Proceedings of the IEEE/CVF CVPR, pp. 2989–2998 (2023)
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: Mdetr-modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF ICCV, pp. 1780–1790 (2021)
Karimi Mahabadi, R., Ruder, S., Dehghani, M., Henderson, J.: Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. In: Annual Meeting of the Association for Computational Linguistics (2021)
Kirilenko, D., Kovalev, A.K., Solomentsev, Y., Melekhin, A., Yudin, D.A., Panov, A.I.: Vector symbolic scene representation for semantic place recognition. In: 2022 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2022)
Kirillov, A., et al.: Segment anything. arXiv:2304.02643 (2023)
Liang, F., et al.: Open-vocabulary semantic segmentation with mask-adapted clip. In: Proceedings of the IEEE/CVF CVPR, pp. 7061–7070 (2023)
Liu, S., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv:2303.05499 (2023)
Liu, W., Zhang, C., Lin, G., Liu, F.: CRCNet: few-shot segmentation with cross-reference and region-global conditional networks. Int. J. Comput. Vision 130(12), 3140–3157 (2022)
Liu, Z., et al..: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF ICCV, pp. 10012–10022 (2021)
Lu, K., Grover, A., Abbeel, P., Mordatch, I.: Pretrained transformers as universal computation engines. Proc. of the AAAI Conf. Artif. Intell. 36, 7628–7636 (06 2022). https://doi.org/10.1609/aaai.v36i7.20729
Lüddecke, T., Ecker, A.: Image segmentation using text and image prompts. In: Proceedings of the IEEE/CVF CVPR, pp. 7086–7096 (2022)
McCormac, J., Handa, A., Leutenegger, S., Davison, A.J.: Scenenet RGB-D: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? In: Proceedings. of the IEEE/CVF ICCV, pp. 2697–2706 (2017). https://doi.org/10.1109/ICCV.2017.292
Pan, H., Hong, Y., Sun, W., Jia, Y.: Deep dual-resolution networks for real-time and accurate semantic segmentation of traffic scenes. IEEE Trans. Intell. Transp. Syst. 24(3), 3448–3460 (2022)
Rakelly, K., Shelhamer, E., Darrell, T., Efros, A., Levine, S.: Conditional networks for few-shot semantic segmentation. ICLR Workshop (2018)
Schlegel, K., Neubert, P., Protzel, P.: A comparison of vector symbolic architectures. Artif. Intell. Rev. 55, 4523–4555 (2021). https://doi.org/10.1007/s10462-021-10110-3
Seo, S., Lee, J.-Y., Han, B.: URVOS: unified referring video object segmentation network with a large-scale benchmark. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision – ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV, pp. 208–223. Springer International Publishing, Cham (2020). https://doi.org/10.1007/978-3-030-58555-6_13
Shaban, A., Bansal, S., Liu, Z., Essa, I., Boots, B.: One-shot learning for semantic segmentation. In: British Machine Vision Conference (2017)
Shepel, I., Adeshkin, V., Belkin, I., Yudin, D.A.: Occupancy grid generation with dynamic obstacle segmentation in stereo images. IEEE Trans. Intell. Transp. Syst. 23(9), 14779–14789 (2021)
Song, S., P., S., Lichtenberg, Xiao, J.: Sun RGB-D: A RGB-D scene understanding benchmark suite. Proceedings of the IEEE CVPR (2015). https://rgbd.cs.princeton.edu/
Wang, W., et al.: Pvt v2: improved baselines with pyramid vision transformer. Comput. Vis. Media 8(3), 415–424 (2022)
Wang, W., et al.: Image as a foreign language: beit pretraining for vision and vision-language tasks. In: CVPR, pp. 19175–19186 (2023)
Wu, C., Lin, Z., Cohen, S., Bui, T., Maji, S.: Phrasecut: language-based image segmentation in the wild. In: Proceedings of the IEEE/CVF CVPR, pp. 10216–10225 (2020)
Wu, J., Jiang, Y., Sun, P., Yuan, Z., Luo, P.: Language as queries for referring video object segmentation. In: Proceedings of the IEEE/CVF CVPR, pp. 4974–4984 (2022)
Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2. https://github.com/facebookresearch/detectron2 (2019)
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: simple and efficient design for semantic segmentation with transformers. Adv. Neural. Inf. Process. Syst. 34, 12077–12090 (2021)
Xu, J., Xiong, Z., Bhattacharyya, S.P.: Pidnet: a real-time semantic segmentation network inspired by PID controllers. In: CVPR, pp. 19529–19539 (2023)
Xu, M., Zhang, Z., Wei, F., Hu, H., Bai, X.: Side adapter network for open-vocabulary semantic segmentation. In: Proceedings of the IEEE/CVF CVP, pp. 2945–2954 (2023)
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T.L.: Modeling context in referring expressions. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II, pp. 69–85. Springer International Publishing, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_5
Yuan, Y., Chen, X., Wang, J.: Object-contextual representations for semantic segmentation. In: ECCV (2020)
Zhang, C., et al.: Faster segment anything: towards lightweight SAM for mobile applications (2023)
Zhou, B., et al.: Semantic understanding of scenes through the ade20k dataset. Int. J. Comput. Vis. 127(3), 302–321 (2019)
Zou, X., et al.: Generalized decoding for pixel, image, and language. In: Proceedings of the IEEE/CVF CVPR, pp. 15116–15127 (2023)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Yudin, D., Khorin, A., Zemskova, T., Ovchinnikova, D. (2024). TASFormer: Task-Aware Image Segmentation Transformer. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Lecture Notes in Computer Science, vol 14451. Springer, Singapore. https://doi.org/10.1007/978-981-99-8073-4_24
Download citation
DOI: https://doi.org/10.1007/978-981-99-8073-4_24
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8072-7
Online ISBN: 978-981-99-8073-4
eBook Packages: Computer ScienceComputer Science (R0)