Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12349)


Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is \(3.8\times \) parameter-efficient and \(27\times \) computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes.


Bottom-up panoptic segmentation Self-attention 



We thank Niki Parmar for discussion and support; Ashish Vaswani, Xuhui Jia, Raviteja Vemulapalli, Zhuoran Shen for their insightful comments and suggestions; Maxwell Collins and Blake Hechtman for technical support. This work is supported by Google Faculty Research Award and NSF 1763705.

Supplementary material

504439_1_En_7_MOESM1_ESM.pdf (27 mb)
Supplementary material 1 (pdf 27664 KB)


  1. 1.
    Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (2016)Google Scholar
  2. 2.
    Ackley, D.H., Hinton, G.E., Sejnowski, T.J.: A learning algorithm for boltzmann machines. Cogn. Sci. 9(1), 147–169 (1985)CrossRefGoogle Scholar
  3. 3.
    Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv:1409.0473 (2014)
  4. 4.
    Bai, M., Urtasun, R.: Deep watershed transform for instance segmentation. In: CVPR (2017)Google Scholar
  5. 5.
    Ballard, D.H.: Generalizing the hough transform to detect arbitrary shapes. Pattern Recogn. 3, 111–122 (1981)CrossRefGoogle Scholar
  6. 6.
    Bello, I., Zoph, B., Vaswani, A., Shlens, J., Le, Q.V.: Attention augmented convolutional networks. In: ICCV (2019)Google Scholar
  7. 7.
    Bonde, U., Alcantarilla, P.F., Leutenegger, S.: Towards bounding-box free panoptic segmentation. arXiv:2002.07705 (2020)
  8. 8.
    Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. In: ICLR (2019)Google Scholar
  9. 9.
    Buades, A., Coll, B., Morel, J.M.: A non-local algorithm for image denoising. In: CVPR (2005)Google Scholar
  10. 10.
    Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: ICASSP (2016)Google Scholar
  11. 11.
    Chen, L.C., et al.: Searching for efficient multi-scale architectures for dense image prediction. In: NeurIPS (2018)Google Scholar
  12. 12.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Semantic image segmentation with deep convolutional nets and fully connected CRFs. In: ICLR (2015)Google Scholar
  13. 13.
    Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE TPAMI (2017)Google Scholar
  14. 14.
    Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv:1706.05587 (2017)
  15. 15.
    Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018). Scholar
  16. 16.
    Chen, Q., Cheng, A., He, X., Wang, P., Cheng, J.: SpatialFlow: bridging all tasks for panoptic segmentation. arXiv:1910.08787 (2019)
  17. 17.
    Chen, Y., Kalantidis, Y., Li, J., Yan, S., Feng, J.: A\(\hat{\,}\) 2-nets: double attention networks. In: NeurIPS (2018)Google Scholar
  18. 18.
    Cheng, B., et al.: Panoptic-deeplab. In: ICCV COCO + Mapillary Joint Recognition Challenge Workshop (2019)Google Scholar
  19. 19.
    Cheng, B., et al.: Panoptic-deeplab: a simple, strong, and fast baseline for bottom-up panoptic segmentation. In: CVPR (2020)Google Scholar
  20. 20.
    Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: CVPR (2017)Google Scholar
  21. 21.
    Chorowski, J.K., Bahdanau, D., Serdyuk, D., Cho, K., Bengio, Y.: Attention-based models for speech recognition. In: NeurIPS (2015)Google Scholar
  22. 22.
    Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: CVPR (2016)Google Scholar
  23. 23.
    Dai, J., et al.: Deformable convolutional networks. In: ICCV (2017)Google Scholar
  24. 24.
    Dai, Z., Yang, Z., Yang, Y., Carbonell, J.G., Le, Q., Salakhutdinov, R.: Transformer-XL: Attentive language models beyond a fixed-length context. In: ACL (2019)Google Scholar
  25. 25.
    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018)
  26. 26.
    Fu, J., et al.: Dual attention network for scene segmentation. In: CVPR (2019)Google Scholar
  27. 27.
    Gao, H., Zhu, X., Lin, S., Dai, J.: Deformable kernels: adapting effective receptive fields for object deformation. arXiv:1910.02940 (2019)
  28. 28.
    Gao, N., et al.: SSAP: single-shot instance segmentation with affinity pyramid. In: ICCV (2019)Google Scholar
  29. 29.
    Goyal, P., et al.: Accurate, large minibatch SGD: training imagenet in 1 hour. arXiv:1706.02677 (2017)
  30. 30.
    He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)Google Scholar
  31. 31.
    He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)Google Scholar
  32. 32.
    Ho, J., Kalchbrenner, N., Weissenborn, D., Salimans, T.: Axial attention in multidimensional transformers. arXiv:1912.12180 (2019)
  33. 33.
    Holschneider, M., Kronland-Martinet, R., Morlet, J., Tchamitchian, P.: A real-time algorithm for signal analysis with the help of the wavelet transform. In: Combes, J.M., Grossmann, A., Tchamitchian, P. (eds.) Wavelets, pp. 286–297. Springer, Heidelberg (1990). Scholar
  34. 34.
    Howard, A., et al.: Searching for mobilenetv3. In: ICCV (2019)Google Scholar
  35. 35.
    Howard, A.G., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861 (2017)
  36. 36.
    Hu, H., Gu, J., Zhang, Z., Dai, J., Wei, Y.: Relation networks for object detection. In: CVPR (2018)Google Scholar
  37. 37.
    Hu, H., Zhang, Z., Xie, Z., Lin, S.: Local relation networks for image recognition. In: ICCV (2019)Google Scholar
  38. 38.
    Huang, C.A., et al.: Music transformer: Generating music with long-term structure. In: ICLR (2019)Google Scholar
  39. 39.
    Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W.: CCNet: criss-cross attention for semantic segmentation. In: ICCV (2019)Google Scholar
  40. 40.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML (2015)Google Scholar
  41. 41.
    Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. In: BMVC (2014)Google Scholar
  42. 42.
    Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: CVPR (2018)Google Scholar
  43. 43.
    Keuper, M., Levinkov, E., Bonneel, N., Lavoué, G., Brox, T., Andres, B.: Efficient decomposition of image and mesh graphs by lifted multicuts. In: ICCV (2015)Google Scholar
  44. 44.
    Kirillov, A., Girshick, R., He, K., Dollár, P.: Panoptic feature pyramid networks. In: CVPR (2019)Google Scholar
  45. 45.
    Kirillov, A., He, K., Girshick, R., Rother, C., Dollár, P.: Panoptic segmentation. In: CVPR (2019)Google Scholar
  46. 46.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NeurIPS (2012)Google Scholar
  47. 47.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  48. 48.
    Leibe, B., Leonardis, A., Schiele, B.: Combined object categorization and segmentation with an implicit shape model. In: Workshop on Statistical Learning in Computer Vision, ECCV (2004)Google Scholar
  49. 49.
    Li, J., Raventos, A., Bhargava, A., Tagawa, T., Gaidon, A.: Learning to fuse things and stuff. arXiv:1812.01192 (2018)
  50. 50.
    Li, Q., Qi, X., Torr, P.H.: Unifying training and inference for panoptic segmentation. arXiv:2001.04982 (2020)
  51. 51.
    Li, X., Zhao, H., Han, L., Tong, Y., Yang, K.: GFF: gated fully fusion for semantic segmentation. arXiv:1904.01803 (2019)
  52. 52.
    Li, Y., Chen, X., Zhu, Z., Xie, L., Huang, G., Du, D., Wang, X.: Attention-guided unified network for panoptic segmentation. In: CVPR (2019)Google Scholar
  53. 53.
    Li, Y., et al.: Neural architecture search for lightweight non-local networks. In: CVPR (2020)Google Scholar
  54. 54.
    Liang, J., Homayounfar, N., Ma, W.C., Xiong, Y., Hu, R., Urtasun, R.: PolyTransform: deep polygon transformer for instance segmentation. arXiv:1912.02801 (2019)
  55. 55.
    Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017)Google Scholar
  56. 56.
    Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). Scholar
  57. 57.
    Liu, C., et al.: Auto-deeplab: Hierarchical neural architecture search for semantic image segmentation. In: CVPR (2019)Google Scholar
  58. 58.
    Liu, L., et al.: On the variance of the adaptive learning rate and beyond. In: ICLR (2020)Google Scholar
  59. 59.
    Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path aggregation network for instance segmentation. In: CVPR (2018)Google Scholar
  60. 60.
    Liu, Y., et al.: Affinity derivation and graph merge for instance segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 708–724. Springer, Cham (2018). Scholar
  61. 61.
    Liu1, H., et al.: An end-to-end network for panoptic segmentation. In: CVPR (2019)Google Scholar
  62. 62.
    Neuhold, G., Ollmann, T., Rota Bulo, S., Kontschieder, P.: The mapillary vistas dataset for semantic understanding of street scenes. In: ICCV (2017)Google Scholar
  63. 63.
    Neven, D., Brabandere, B.D., Proesmans, M., Gool, L.V.: Instance segmentation by jointly optimizing spatial embeddings and clustering bandwidth. In: CVPR (2019)Google Scholar
  64. 64.
    Papandreou, G., Kokkinos, I., Savalle, P.A.: Modeling local and global deformations in deep learning: epitomic convolution, multiple instance learning, and sliding window detection. In: CVPR (2015)Google Scholar
  65. 65.
    Parmar, N., Ramachandran, P., Vaswani, A., Bello, I., Levskaya, A., Shlens, J.: Stand-alone self-attention in vision models. In: NeurIPS (2019)Google Scholar
  66. 66.
    Parmar, N., et al.: Image transformer. In: ICML (2018)Google Scholar
  67. 67.
    Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters-improve semantic segmentation by global convolutional network. In: CVPR (2017)Google Scholar
  68. 68.
    Porzi, L., Bulò, S.R., Colovic, A., Kontschieder, P.: Seamless scene segmentation. In: CVPR (2019)Google Scholar
  69. 69.
    Qi, H., et al.: Deformable convolutional networks - COCO detection and segmentation challenge 2017 entry. In: ICCV COCO Challenge Workshop (2017)Google Scholar
  70. 70.
    Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. IJCV 115, 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  71. 71.
    Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: CVPR (2018)Google Scholar
  72. 72.
    Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. In: NAACL (2018)Google Scholar
  73. 73.
    Shen, Z., Zhang, M., Zhao, H., Yi, S., Li, H.: Efficient attention: attention with linear complexities. arXiv:1812.01243 (2018)
  74. 74.
    Shensa, M.J.: The discrete wavelet transform: wedding the a trous and mallat algorithms. IEEE Trans. Signal Process. 40(10), 2464–2482 (1992)CrossRefGoogle Scholar
  75. 75.
    Sifre, L.: Rigid-motion scattering for image classification. Ph.D. thesis (2014)Google Scholar
  76. 76.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
  77. 77.
    Sofiiuk, K., Barinova, O., Konushin, A.: AdaptiS: adaptive instance selection network. In: ICCV (2019)Google Scholar
  78. 78.
    Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: CVPR (2016)Google Scholar
  79. 79.
    Uhrig, J., Rehder, E., Fröhlich, B., Franke, U., Brox, T.: Box2pix: single-shot instance segmentation by assigning pixels to object boxes. In: IEEE Intelligent Vehicles Symposium (IV) (2018)Google Scholar
  80. 80.
    Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)Google Scholar
  81. 81.
    Vincent, L., Soille, P.: Watersheds in digital spaces: an efficient algorithm based on immersion simulations. IEEE TPAMI (1991)Google Scholar
  82. 82.
    Wang, H., Kembhavi, A., Farhadi, A., Yuille, A.L., Rastegari, M.: Elastic: improving CNNs with dynamic scaling policies. In: CVPR (2019)Google Scholar
  83. 83.
    Wang, J., et al.: Deep high-resolution representation learning for visual recognition. arXiv:1908.07919 (2019)
  84. 84.
    Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)Google Scholar
  85. 85.
    Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv:1609.08144 (2016)
  86. 86.
    Xie, C., Wu, Y., Maaten, L.v.d., Yuille, A.L., He, K.: Feature denoising for improving adversarial robustness. In: CVPR (2019)Google Scholar
  87. 87.
    Xiong, Y., et al.: UPSNet: a unified panoptic segmentation network. In: CVPR (2019)Google Scholar
  88. 88.
    Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML (2015)Google Scholar
  89. 89.
    Yang, T.J., et al.: DeeperLab: single-shot image parser. arXiv:1902.05093 (2019)
  90. 90.
    Yang, Y., Li, H., Li, X., Zhao, Q., Wu, J., Lin, Z.: SOGNet: scene overlap graph network for panoptic segmentation. arXiv:1911.07527 (2019)
  91. 91.
    Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks. arXiv:1805.08318 (2018)
  92. 92.
    Zhang, M., Lucas, J., Ba, J., Hinton, G.E.: Lookahead optimizer: k steps forward, 1 step back. In: NeurIPS (2019)Google Scholar
  93. 93.
    Zhang, R.: Making convolutional networks shift-invariant again. In: ICML (2019)Google Scholar
  94. 94.
    Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: CVPR (2017)Google Scholar
  95. 95.
    Zhu, X., Cheng, D., Zhang, Z., Lin, S., Dai, J.: An empirical study of spatial attention mechanisms in deep networks. In: ICCV, pp. 6688–6697 (2019)Google Scholar
  96. 96.
    Zhu, X., Hu, H., Lin, S., Dai, J.: Deformable ConvNets v2: more deformable, better results. In: CVPR (2019)Google Scholar
  97. 97.
    Zhu, Y., et al.: Improving semantic segmentation via video propagation and label relaxation. In: CVPR (2019)Google Scholar
  98. 98.
    Zhu, Z., Xu, M., Bai, S., Huang, T., Bai, X.: Asymmetric non-local neural networks for semantic segmentation. In: CVPR (2019)Google Scholar
  99. 99.
    Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. In: ICLR (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Johns Hopkins UniversityBaltimoreUSA
  2. 2.Google ResearchSeattleUSA
  3. 3.Google ResearchLos AngelesUSA

Personalised recommendations