Skip to main content
Log in

CAE-GReaT: Convolutional-Auxiliary Efficient Graph Reasoning Transformer for Dense Image Predictions

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) are two primary frameworks for current semantic image recognition tasks in the community of computer vision. The general consensus is that both CNNs and ViT have their latent strengths and weaknesses, e.g., CNNs are good at extracting local features but difficult to aggregate long-range feature dependencies, while ViT is good at aggregating long-range feature dependencies but poorly represents in local features. In this paper, we propose an auxiliary and integrated network architecture, named Convolutional-Auxiliary Efficient Graph Reasoning Transformer (CAE-GReaT), which joints strengths of both CNNs and ViT into a uniform framework. CAE-GReaT stands on the shoulders of the advanced graph reasoning transformer and employs an internal auxiliary convolutional branch to enrich the local feature representations. Besides, to reduce the computational costs in graph reasoning, we also propose an efficient information diffusion strategy. Compared to the existing ViT models, CAE-GReaT not only has the advantage of a purposeful interaction pattern (via the graph reasoning branch), but also can capture fine-grained heterogeneous feature representations (via the auxiliary convolutional branch). Extensive experiments are implemented on three challenging dense image prediction tasks, i.e., semantic segmentation, instance segmentation, and panoptic segmentation. Results demonstrate that CAE-GReaT can achieve consistent performance gains on the state-of-the-art baselines with a slightly computational cost.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data Availability

The datasets generated during and/or analysed during the current study are available in the: https://www.cityscapes-dataset.com/, https://groups.csail.mit.edu/vision/datasets/ADE20K/, https://cocodataset.org, repository.

Notes

  1. https://github.com/open-mmlab.

References

  • Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12), 2481–2495.

    Article  Google Scholar 

  • Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: The long-document transformer.

  • Bertasius, G., Torresani, L., Yu, S. X., & Shi, J. (2017). Convolutional random walk networks for semantic image segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Cai, Z., & Vasconcelos, N. (2019). Cascade R-CNN: High quality object detection and instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5), 1483–1498.

    Article  Google Scholar 

  • Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In: European conference on computer vision (ECCV).

  • Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., & Zhou, Y. (2021). Transunet: Transformers make strong encoders for medical image segmentation.

  • Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFS. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 834–848.

    Article  Google Scholar 

  • Chen, L. C., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethinking Atrous convolution for semantic image segmentation.

  • Chen, L. C., Zhu, Y., Papandreou, G., Schroff, F., & Adam, H. (2018). Encoder-decoder with Atrous separable convolution for semantic image segmentation. In: European conference on computer vision (ECCV).

  • Chen, Y., Rohrbach, M., Yan, Z., Shuicheng, Y., Feng, J., & Kalantidis, Y. (2019). Graph-based global reasoning networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Chen, Z., Duan, Y., Wang, W., He, J., Lu, T., Dai, J., & Qiao, Y. (2022). Vision transformer adapter for dense predictions.

  • Cheng, B., Schwing, A., & Kirillov, A. (2021). Per-pixel classification is not all you need for semantic segmentation. In: Advances in neural information processing systems (NeurIPS).

  • Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating long sequences with sparse transformers.

  • Choe, J., Lee, S., & Shim, H. (2020). Attention-based dropout layer for weakly supervised single object localization and semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(12), 4256–4271.

    Article  Google Scholar 

  • Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., Belanger, D. (2021). Rethinking attention with performers. In: International conference on learning representations (ICLR).

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Deng, J., Ding, N., Jia, Y., Frome, A., Murphy, K., Bengio, S., Li, Y., Neven, H., & Adam, H. (2014). Large-scale object classification using label relation graphs. In: European conference on computer vision (ECCV).

  • Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding.

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., & Uszkoreit, J. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In: International conference on learning representations (ICLR).

  • Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., & Lu, H. (2019). Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV).

  • Gao, S. H., Cheng, M. M., Zhao, K., Zhang, X. Y., Yang, M. H., & Torr, P. (2019). Res2net: A new multi-scale backbone architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(2), 652–662.

    Article  Google Scholar 

  • Ghiasi, G., Lin, T. Y., & Le, Q. V. (2018). Dropblock: A regularization method for convolutional networks. In: Advances in neural information processing systems (NeurIPS).

  • Gulati, A., Qin, J., Chiu, C. C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., & Pang, R. (2020). Conformer: Convolution-augmented transformer for speech recognition.

  • Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y., & Xu, C. (2022). CMT: Convolutional neural networks meet vision transformers. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • He, J., Deng, Z., Zhou, L., Wang, Y., & Qiao, Y. (2019). Adaptive pyramid context network for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2015). Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1904–1916.

    Article  Google Scholar 

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Ho, J., Kalchbrenner, N., Weissenborn, D., & Salimans, T. (2019). Axial attention in multidimensional transformers.

  • Hou, Q., Zhang, L., Cheng, M. M., & Feng, J. (2020). Strip pooling: Rethinking spatial pooling for scene parsing. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Hou, R., Chang, H., Ma, B., Shan, S., & Chen, X. (2019). Cross attention network for few-shot classification. In: Advances in neural information processing systems (NeurIPS).

  • Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., & Liu, W. (2019). Ccnet: Criss-cross attention for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning (ICML).

  • Jain, P., Wu, Z., Wright, M., Mirhoseini, A., Gonzalez, J. E., & Stoica, I. (2021). Representing long-range context for graph neural networks with global attention. In: Advances in neural information processing systems (NeurIPS).

  • Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. In: International conference on learning representations (ICLR).

  • Kirillov, A., Girshick, R., He, K., & Dollár, P. (2019). Panoptic feature pyramid networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Kitaev, N., Kaiser, Ł., & Levskaya, A. (2020). Reformer: The efficient transformer. In: International conference on learning representations (ICLR).

  • LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.

    Article  Google Scholar 

  • Li, J., Xia, X., Li, W., Li, H., Wang, X., Xiao, X., Wang, R., Zheng, M., & Pan, X. (2022). Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios.

  • Li, Q., Han, Z., & Wu, X. M. (2018). Deeper insights into graph convolutional networks for semi-supervised learning. In: Association for the advancement of artificial intelligence (AAAI).

  • Li, S., Gao, Z., & He, X. (2021). Superpixel-guided iterative learning from noisy labels for medical image segmentation. In: International conference on medical image computing and computer-assisted intervention (MICCAI).

  • Li, X., Wang, W., Hu, X., & Yang, J. (2019). Selective kernel networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) (pp. 510–519).

  • Li, Y., & Gupta, A. (2018). Beyond grids: Learning graph representations for visual recognition. In: Advances in neural information processing systems (NeurIPS).

  • Li, Y., Mao, H., Girshick, R., & He, K. (2022). Exploring plain vision transformer backbones for object detection.

  • Li, Y., Zhao, H., Qi, X., Wang, L., Li, Z., Sun, J., & Jia, J. (2021). Fully convolutional networks for panoptic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Li, Z., Wang, W., Xie, E., Yu, Z., Anandkumar, A., Alvarez, J. M., Luo, P., & Lu, T. (2022). Panoptic segformer: Delving deeper into panoptic segmentation with transformers. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Liang, X., Hu, Z., Zhang, H., Lin, L., & Xing, E.P. (2018). Symbolic graph reasoning meets convolutions. In: Advances in Neural Information Processing Systems (NeurIPS).

  • Liang, Y., Ge, C., Tong, Z., Song, Y., Wang, J., & Xie, P. (2022). Not all patches are what you need: Expediting vision transformers via token reorganizations. In: International conference on learning representations (ICLR).

  • Lin, G., Milan, A., Shen, C., & Reid, I. (2017). Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Lin, L., Gao, Y., Gong, K., Wang, M., & Liang, X. (2020). Graphonomy: Universal image parsing via graph reasoning and transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  • Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In: European Conference on Computer Vision (ECCV).

  • Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Lu, J., Yao, J., Zhang, J., Zhu, X., Xu, H., Gao, W., Xu, C., Xiang, T., & Zhang, L. (2021). Soft: Softmax-free transformer with linear complexity. In: Advances in neural information processing systems (NeurIPS).

  • Meinhardt, T., Kirillov, A., Leal-Taixe, L., & Feichtenhofer, C. (2021). Trackformer: Multi-object tracking with transformers.

  • Mottaghi, R., Chen, X., Liu, X., Cho, N. G., Lee, S. W., Fidler, S., Urtasun, R., & Yuille, A. (2014). The role of context for object detection and semantic segmentation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Niu, J., Alroobaea, R., Baqasah, A. M., & Kansal, L. (2022). Implementation of network information security monitoring system based on adaptive deep detection. Journal of Intelligent Systems, 31(1), 454–465.

    Article  Google Scholar 

  • Pan, X., Ge, C., Lu, R., Song, S., Chen, G., Huang, Z., & Huang, G. (2022). On the integration of self-attention and convolution. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., & Desmaison, A. (2019). Pytorch: An imperative style, high-performance deep learning library. In: Advances in neural information processing systems (NeurIPS).

  • Peng, C., Zhang, X., Yu, G., Luo, G., & Sun, J. (2017). Large kernel matters–improve semantic segmentation by global convolutional network. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Plath, N., Toussaint, M., & Nakajima, S. (2009). Multi-class image segmentation using conditional random fields and global classification. In: International conference on machine learning (ICML).

  • Prakash, A., Chitta, K., & Geiger, A. (2021). Multi-modal fusion transformer for end-to-end autonomous driving. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Redmon, J., & Farhadi, A. (2017). Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Rong, Y., Bian, Y., Xu, T., Xie, W., Wei, Y., Huang, W., & Huang, J. (2020). Self-supervised graph transformer on large-scale molecular data. In: Advances in neural information processing systems (NeurIPS).

  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer assisted intervention (MICCAI).

  • Sabour, S., Frosst, N., & Hinton, G.E. (2017). Dynamic routing between capsules. In: Advances in neural information processing systems (NeurIPS).

  • Shrivastava, A., Gupta, A., & Girshick, R. (2016). Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Strudel, R., Garcia, R., Laptev, I., & Schmid, C. (2021). Segmenter: Transformer for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV).

  • Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. In: Association for the advancement of artificial intelligence (AAAI).

  • Tian, Z., He, T., Shen, C., & Yan, Y. (2019). Decoders matter for semantic segmentation: Data-dependent decoding enables flexible feature aggregation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. In: International conference on machine learning (ICML).

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In: Advances in neural information processing systems (NeurIPS).

  • Wang, H., Zhu, Y., Adam, H., Yuille, A., & Chen, L.C. (2021). Max-deeplab: End-to-end panoptic segmentation with mask transformers. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., & Chen, L. C. (2020). Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In: European conference on computer vision (ECCV).

  • Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y., Tan, M., Wang, X., & Liu, W. (2020). Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(10), 3349–3364.

  • Wang, S., Li, B.Z., Khabsa, M., Fang, H., & Ma, H. (2020). Linformer: Self-attention with linear complexity.

  • Wang, T., Huang, J., Zhang, H., & Sun, Q. (2020). Visual commonsense R-CNN. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Wang, X., & Gupta, A. (2018). Videos as space-time region graphs. In: European conference on computer vision (ECCV).

  • Wang, X., Zhang, R., Kong, T., Li, L., & Shen, C. (2020). Solov2: Dynamic and fast instance segmentation. In: Advances in neural information processing systems (NeurIPS).

  • Wang, Y., Huang, R., Song, S., Huang, Z., & Huang, G. (2021). Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition. In: Advances in neural information processing systems (NeurIPS).

  • Wang, Y., Xu, Z., Wang, X., Shen, C., Cheng, B., Shen, H., & Xia, H. (2021). End-to-end video instance segmentation with transformers. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., & Zhang, L. (2021). CVT: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV).

  • Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M., & Luo, P. (2021). Segformer: Simple and efficient design for semantic segmentation with transformers. In: Advances in neural information processing systems (NeurIPS).

  • Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Yan, R., Xie, L., Tang, J., Shu, X., & Tian, Q. (2020). Social adaptive module for weakly-supervised group activity recognition. In: European conference on computer vision (ECCV).

  • Yang, J., Li, C., Zhang, P., Dai, X., Xiao, B., Yuan, L., & Gao, J. (2021). Focal self-attention for local-global interactions in vision transformers. In: Advances in neural information processing systems (NeurIPS).

  • Yang, Z., Dai, Z., Salakhutdinov, R., & Cohen, W. W. (2017). Breaking the softmax bottleneck: A high-rank RNN language model.

  • Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., & Liu, T. Y. (2021). Do transformers really perform badly for graph representation? In: Advances in neural information processing systems (NeurIPS).

  • Yu, F., & Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. In: International conference on learning representations (ICLR).

  • Yu, Q., Wang, H., Kim, D., Qiao, S., Collins, M., Zhu, Y., Adam, H., Yuille, A., & Chen, L. C. (2022). Cmt-deeplab: Clustering mask transformers for panoptic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Yuan, Y., Chen, X., & Wang, J. (2020). Object-contextual representations for semantic segmentation. In: European conference on computer vision (ECCV).

  • Zaremba, W., Sutskever, I., & Vinyals, O. (2014). Recurrent neural network regularization.

  • Zhang, D., Tang, J., & Cheng, K. T. (2022). Graph reasoning transformer for image parsing. In: ACM international conference on multimedia (MM).

  • Zhang, D., Zhang, H., Tang, J., Hua, X.S., & Sun, Q. (2020). Causal intervention for weakly-supervised semantic segmentation. In: Advances in neural information processing systems (NeurIPS).

  • Zhang, D., Zhang, H., Tang, J., Hua, X.S., & Sun, Q. (2021). Self-regulation for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV).

  • Zhang, D., Zhang, H., Tang, J., Wang, M., Hua, X., & Sun, Q. (2020). Feature pyramid transformer. In: European conference on computer vision (ECCV).

  • Zhang, H., Dana, K., Shi, J., Zhang, Z., Wang, X., Tyagi, A., & Agrawal, A. (2018). Context encoding for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Lin, H., Zhang, Z., Sun, Y., He, T., Mueller, J., Manmatha, R., Li, M., & Smola, A. (2022). Resnest: Split-attention networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR) workshops.

  • Zhang, H., Zhang, H., Wang, C., & Xie, J. (2019). Co-occurrent features in semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhang, M., & He, Y. (2020). Accelerating training of transformer-based language models with progressive layer dropping. In: Advances in neural information processing systems (NeurIPS).

  • Zhang, W., Pang, J., Chen, K., & Loy, C.C. (2021). K-net: Towards unified image segmentation. In: Advances in neural information processing systems (NeurIPS).

  • Zhao, H., Puig, X., Zhou, B., Fidler, S., & Torralba, A. (2017). Open vocabulary scene parsing. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV).

  • Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P. H., & Zhang, L. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017). Scene parsing through ade20k dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhou, H. Y., Guo, J., Zhang, Y., Yu, L., Wang, L., & Yu, Y. (2021). nnformer: Interleaved transformer for volumetric segmentation.

  • Zhu, Z., Xu, M., Bai, S., Huang, T., & Bai, X. (2019). Asymmetric non-local neural networks for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV).

Download references

Acknowledgements

This work was supported in part by the National Key Research and Development Program of China under Grant 2018AAA0102002, the National Natural Science Foundation of China under Grant 61925204, the National Natural Science Foundation of China/HKSAR Research Grants Council Joint Research Scheme under Grant N_HKUST627/20.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinhui Tang.

Ethics declarations

Conflict of interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence this work. There is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of this manuscript.

Additional information

Communicated by Oliver Zendel.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, D., Lin, Y., Tang, J. et al. CAE-GReaT: Convolutional-Auxiliary Efficient Graph Reasoning Transformer for Dense Image Predictions. Int J Comput Vis 132, 1502–1520 (2024). https://doi.org/10.1007/s11263-023-01928-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01928-1

Keywords

Navigation