Skip to main content
Log in

Hybrid first and second order attention Unet for building segmentation in remote sensing images

Science China Information Sciences Aims and scope Submit manuscript

Cite this article


Recently, building segmentation (BS) has drawn significant attention in remote sensing applications. Convolutional neural networks (CNNs) have become the mainstream analysis approach in this field owing to their powerful representative ability. However, owing to the variation in building appearance, designing an effective CNN architecture for BS still remains a challenging task. Most of CNN-based BS methods mainly focus on deep or wide network architectures, neglecting the correlation among intermediate features. To address this problem, in this paper we propose a hybrid first and second order attention network (HFSA) that explores both the global mean and the inner-product among different channels to adaptively rescale intermediate features. As a result, the HFSA can not only make full use of first order feature statistics, but also incorporate the second order feature statistics, which leads to more representative feature. We conduct a series of comprehensive experiments on three widely used aerial building segmentation data sets and one satellite building segmentation data set. The experimental results show that our newly developed model achieves better segmentation performance over state-of-the-art models in terms of both quantitative and qualitative results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions


  1. Jensen J R, Cowen D C. Remote sensing of urban suburban infrastructure and socio-economic attributes. Photogramm Eng Remote Sens, 1999, 65: 611–622

    Google Scholar 

  2. Yuan J. Learning building extraction in aerial scenes with convolutional networks. IEEE Trans Pattern Anal Mach Intell, 2018, 40: 2793–2798

    Article  Google Scholar 

  3. Liow Y T, Pavlidis T. Use of shadows for extracting buildings in aerial images. Comput Vision Graph Image Process, 1990, 49: 242–277

    Article  MATH  Google Scholar 

  4. Ok A O. Automated detection of buildings from single VHR multispectral images using shadow information and graph cuts. ISPRS J Photogrammetry Remote Sens, 2013, 86: 21–40

    Article  Google Scholar 

  5. Inglada J. Automatic recognition of man-made objects in high resolution optical remote sensing images by SVM classification of geometric image features. ISPRS J Photogrammetry Remote Sens, 2007, 62: 236–248

    Article  Google Scholar 

  6. Karantzalos K, Paragios N. Recognition-driven two-dimensional competing priors toward automatic and accurate building detection. IEEE Trans Geosci Remote Sens, 2009, 47: 133–144

    Article  Google Scholar 

  7. Kim T, Muller J. Development of a graph-based approach for building detection. Image Vision Comput, 1999, 17: 3–14

    Article  Google Scholar 

  8. Li E, Femiani J, Xu S, et al. Robust rooftop extraction from visible band images using higher order CRF. IEEE Trans Geosci Remote Sens, 2015, 53: 4483–4495

    Article  Google Scholar 

  9. Yang H L, Yuan J, Lunga D, et al. Building extraction at scale using convolutional neural network: mapping of the united states. IEEE J Sel Top Appl Earth Observ Remote Sens, 2018, 11: 2600–2614

    Article  Google Scholar 

  10. Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems, 2012. 1097–1105

  11. Zhou Q, Wang Y, Liu J, et al. An open-source project for real-time image semantic segmentation. Sci China Inf Sci, 2019, 62: 227101

    Article  Google Scholar 

  12. Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. 3431–3440

  13. Wang W, Gao W, Hu Z Y. Effectively modeling piecewise planar urban scenes based on structure priors and CNN. Sci China Inf Sci, 2019, 62: 029102

    Article  Google Scholar 

  14. Ronneberger O, Fischer P, Brox T. Unet: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015. Berlin: Springer, 2015. 234–241

    Google Scholar 

  15. Lu Y H, Zhen M M, Fang T. Multi-view based neural network for semantic segmentation on 3D scenes. Sci China Inf Sci, 2019, 62: 229101

    Article  Google Scholar 

  16. Badrinarayanan V, Kendall A, Cipolla R. SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 2481–2495

    Article  Google Scholar 

  17. Geng Q C, Zhou Z, Cao X C. Survey of recent progress in semantic image segmentation with CNNs. Sci China Inf Sci, 2018, 61: 051101

    Article  MathSciNet  Google Scholar 

  18. Haut J M, Paoletti M E, Plaza J, et al. Visual attention-driven hyperspectral image classification. IEEE Trans Geosci Remote Sens, 2019, 57: 8065–8080

    Article  Google Scholar 

  19. He N, Fang L, Li S, et al. Remote sensing scene classification using multilayer stacked covariance pooling. IEEE Trans Geosci Remote Sens, 2018, 56: 6899–6910

    Article  Google Scholar 

  20. He N, Fang L, Li S, et al. Skip-connected covariance network for remote sensing scene classification. IEEE Trans Neural Netw Learn Syst, 2019. doi:

  21. Lin T Y, Maji S. Improved bilinear pooling with CNNs. In: Proceedings of British Machine Vision Conference (BMVC), 2017

  22. Lin T Y, RoyChowdhury A, Maji S. Bilinear CNN models for fine-grained visual recognition. In: Proceedings of Internation Conference of Computer Vision (ICCV), 2015. 1449–1457

  23. Mnih V. Machine learning for aerial image labeling. Dissertation for Ph.D. Degree. Toronto: University of Toronto, 2013

    Google Scholar 

  24. Ji S, Wei S, Lu M. Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans Geosci Remote Sens, 2019, 57: 574–586

    Article  Google Scholar 

  25. Maggiori E, Tarabalka Y, Charpiat G, et al. Can semantic labeling methods generalize to any city? The inria aerial image labeling benchmark. In: Proceedings of IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, 2017. 3226–3229

Download references


This work was supported in part by National Natural Science Foundation of China (Grant Nos. 61922029, 61771192), National Natural Science Foundation of China for International Cooperation and Exchanges (Grant No. 61520106001), and Huxiang Young Talents Plan Project of Hunan Province (Grant No. 2019RS2016).

Author information

Authors and Affiliations


Corresponding author

Correspondence to Leyuan Fang.

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

He, N., Fang, L. & Plaza, A. Hybrid first and second order attention Unet for building segmentation in remote sensing images. Sci. China Inf. Sci. 63, 140305 (2020).

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: