Skip to main content

Semantic Segmentation of Aerial Images Using Binary Space Partitioning

  • Conference paper
  • First Online:
KI 2021: Advances in Artificial Intelligence (KI 2021)


The semantic segmentation of aerial images enables many useful applications such as tracking city growth, tracking deforestation, or automatically creating and updating maps. However, gathering enough training data to train a proper model for the automated analysis of aerial images is usually too labor-intensive and thus too expensive in most cases. Therefore, domain adaptation techniques are often necessary to be able to adapt existing models or to transfer knowledge from existing datasets to new unlabeled aerial images. Modern adaptation approaches make use of complex architectures involving many model components, losses and loss weights. These approaches are hard to apply in practice since their hyperparameters are hard to optimize for a given adaptation problem. This complexity is the result of trying to separate domain-invariant elements, e.g., structures and shapes, from domain-specific elements, e.g., textures. In this paper, we present a novel model for semantic segmentation, which not only achieves state-of-the-art performance on aerial images, but also inherently learns separate feature representations for shapes and textures. Our goal is to provide a model which can serve as the basis for future domain adaptation approaches which are simpler but still effective. Through end-to-end training our deep learning model learns to map aerial images to feature representations which can be decoded into binary space partitioning trees, a resolution-independent representation of the semantic segmentation, which can then be rendered into a pixelwise semantic segmentation in a differentiable way.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
USD 64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions


  1. Chang, W.L., Wang, H.P., Peng, W.H., Chiu, W.C.: All about structure: adapting structural information across domains for boosting semantic segmentation, In: CVPR. pp. 1900–1909 (2019)

    Google Scholar 

  2. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)

    Article  Google Scholar 

  3. Chen, L.C., Papandreou, G., Schroff, F., Adam, H.: Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 (2017)

  4. Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11211, pp. 833–851. Springer, Cham (2018).

    Chapter  Google Scholar 

  5. Chen, Z., Tagliasacchi, A., Zhang, H.: BSP-NET: generating compact meshes via binary space partitioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 45–54 (2020)

    Google Scholar 

  6. Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1251–1258 (2017)

    Google Scholar 

  7. Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, vol. 9901, pp. 424–432. Springer, Cham (2016).

    Chapter  Google Scholar 

  8. Fuchs, H., Kedem, Z.M., Naylor, B.F.: On visible surface generation by a priori tree structures. In: Proceedings of the 7th Annual Conference on Computer Graphics and Interactive Techniques, pp. 124–133 (1980)

    Google Scholar 

  9. Girard, N., Charpiat, G., Tarabalka, Y.: Aligning and updating cadaster maps with aerial images by multi-task, multi-resolution deep learning. In: Jawahar, C.V., Li, H., Mori, G., Schindler, K. (eds.) ACCV 2018. LNCS, vol. 11365, pp. 675–690. Springer, Cham (2019).

    Chapter  Google Scholar 

  10. Gkioxari, G., Malik, J., Johnson, J.: Mesh R-CNN. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9785–9795 (2019)

    Google Scholar 

  11. Gómez, J.A., Patiño, J.E., Duque, J.C., Passos, S.: Spatiotemporal modeling of urban growth using machine learning. Remote Sens. 12(1), 109 (2020)

    Article  Google Scholar 

  12. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

    Google Scholar 

  13. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  14. ISPRS: 2D Semantic Labeling - ISPRS (2020). Accessed 28 Jan 2020

  15. Jégou, S., Drozdzal, M., Vazquez, D., Romero, A., Bengio, Y.: The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 11–19 (2017)

    Google Scholar 

  16. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  17. Kirillov, A., He, K., Girshick, R., Rother, C., Dollár, P.: Panoptic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9404–9413 (2019)

    Google Scholar 

  18. Kirillov, A., Wu, Y., He, K., Girshick, R.: PointRend: image segmentation as rendering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9799–9808 (2020)

    Google Scholar 

  19. Lee, S.H., Han, K.J., Lee, K., Lee, K.J., Oh, K.Y., Lee, M.J.: Classification of landscape affected by deforestation using high-resolution remote sensing data and deep-learning techniques. Remote Sens. 12(20), 3372 (2020)

    Article  Google Scholar 

  20. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR, pp. 3431–3440 (2015)

    Google Scholar 

  21. Loshchilov, I., Hutter, F.: SGDR: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)

  22. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  23. Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: Learning 3d reconstruction in function space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4460–4470 (2019)

    Google Scholar 

  24. Papandreou, G., Chen, L.C., Murphy, K., Yuille, A.: Weakly-and semi-supervised learning of a DCNN for semantic image segmentation. arXiv:1502.02734 (2015)

  25. Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015).

    Chapter  Google Scholar 

  26. Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: Mobilenetv 2: Inverted residuals and linear bottlenecks. In: CVPR, pp. 4510–4520 (2018)

    Google Scholar 

  27. Sanglard, F.: Game Engine Black Book: DOOM v1.1. Sanglard, Fabien (2019)

    Google Scholar 

  28. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  29. Sofiiuk, K., Barinova, O., Konushin, A.: Adaptis: adaptive instance selection network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7355–7363 (2019)

    Google Scholar 

  30. Takikawa, T., Acuna, D., Jampani, V., Fidler, S.: Gated-SCNN: gated shape CNNs for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5229–5238 (2019)

    Google Scholar 

  31. Tatarchenko, M., Dosovitskiy, A., Brox, T.: Octree generating networks: efficient convolutional architectures for high-resolution 3D outputs. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2088–2096 (2017)

    Google Scholar 

  32. Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.G.: Pixel2Mesh: generating 3D mesh models from single RGB images. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 52–67 (2018)

    Google Scholar 

  33. Wu, H., Zhang, J., Huang, K., Liang, K., Yu, Y.: FastFCN: rethinking dilated convolution in the backbone for semantic segmentation. arXiv preprint arXiv:1903.11816 (2019)

  34. Xu, B., Wang, N., Chen, T., Li, M.: Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853 (2015)

  35. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)

  36. Zhu, Y., et al.: Improving semantic segmentation via video propagation and label relaxation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8856–8865 (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Daniel Gritzner .

Editor information

Editors and Affiliations



A Dataset Class Distributions

The class distributions of the datasets are shown in Fig. 6. While Vaihingen is closer in size to Buxtehude and Nienburg, its class distribution is more similar to Potsdam. Hannover, which is the largest city, has more buildings than the rest. The most low vegetation can be found in Buxtehude and Nienburg. Potsdam has a surprising amount of clutter. Cars and clutter are rare across all datasets. While there are many individual cars, there are still only few pixels showing cars since the area of each car in these images is small.

Fig. 6.
figure 6

Class distribution of the datasets

B Hyperparameters and Model Details

The hyperparameters used for training BSPSegNet and additional model details can be found in Table 5. The hyperparameters for the other models’ training can be found in Table 6. The random search ranges for shape and texture features of BSPSegNet were from 2 to \(n-1\) where n is the total number of parameters of the BSP trees inner node and leaf nodes respectively. The search range for \(\lambda _C\) was from .25 to 16 and the search range for \(\lambda _R\) was from 1 to 8. The random search for the learning rates was performed by randomly choosing an exponent \(k \in [-5, -1]\) and then setting the learning rate to \(10^k\). There was a \(50\%\) chance to set the minimum learning rate to 0.

The random translations for augmentation were chosen s.t. the patches could be samples from the entire image. The random shearings were uniformly randomly sampled from \([-16^\circ , 16^\circ ]\) (individually for both axes) while the random rotations were sampled in the same way from \([-45^\circ , 45^\circ ]\). When the augmentation caused sampling outside the image, reflection padding was used. We used bilinear filtering for sampling. For the ground truth patches, when interpolating between pixels, we used the coefficients of the bilinear interpolation as weights for a vote to find the discrete class of each output pixel.

Table 5. Hyperparameters used for training BSPSegNet.
Table 6. Hyperparameters used for training DeepLabv3+, FCN, and U-Net.

C Metrics

We use two metrics to evaluate the performance of the models we trained. First, we are using the pixel level accuracy. Given a ground truth segmentation \(y: L \rightarrow C\) mapping pixel locations \(l \in L\) to classes \(c \in C\) and a prediction \(\hat{y}: L \times C \rightarrow \mathbb {R}\) of the same segmentation, we first compute \(\hat{y}^*: L \rightarrow C\) using the equation

$$\begin{aligned} \hat{y}^*(l) = \mathop {\text {arg max}}\limits _{c\in C} \hat{y}(l,c). \end{aligned}$$

The accuracy is then defined as

$$\begin{aligned} Acc(y, \hat{y}^*) = \frac{|\{l\in L ~|~ y(l) = \hat{y}^*(l)\}|}{|L|}, \end{aligned}$$

i.e., the fraction of pixel positions whose class has been predicted correctly.

As a second metric we compute the mean intersection-over-union (mIoU). The intersection-over-union (IoU) for a given class \(c\in C\) is defined as

$$\begin{aligned} IoU(y, \hat{y}^*, c) = \frac{|\{l\in L ~|~ y(l)=c \wedge \hat{y}^*(l)=c\}|}{|\{l\in L ~|~ y(l)=c \vee \hat{y}^*(l)=c\}|}, \end{aligned}$$

i.e., it is the number of pixel positions \(l\in L\), which both segmentations assign to class c (intersection), over the number of pixel positions, which at least one segmentation assigns to class c (union). The mIoU is then defined as

$$\begin{aligned} mIoU(y, \hat{y}^*) = \frac{1}{|C|} \sum _{c\in C} IoU(y, \hat{y}^*, c), \end{aligned}$$

i.e., the mean IoU over all classes \(c\in C\).

Both metrics assign values in [0, 1] to every pair of segmentations, with higher values meaning that the two segmentations are more alike. This, in turn, means values as close to 1 as possible are desirable as that means that the predicted segmentation is close to the ground truth segmentation.

Note: We do not use eroded boundaries when computing these metrics as opposed to metrics used by some other researchers, e.g., as (partially) used in the benchmark results of the ISPRS Vaihingen 2D Semantic Labeling Test.

D Additional Sample Images

Figure 7 shows additional prediction samples, similarly to Fig. 4.

Fig. 7.
figure 7

Extended version of Fig. 4.

E Confidence

Figure 8 shows only the prediction confidence at boundaries, without also encoding the accuracy, i.e., it shows the absolute difference between the highest two predicted class probabilities. Again, BSPSegNet produces sharper, less blurry boundaries. This is due to BSPSegNet not including any kind of upsampling. Instead, the resolution-independent BSP trees are merely rendered at the appropriate resolution.

The boundary is defined as those pixels which have at least one direct neighbor with a different class.

When computing \(1 - A\), with A being the area below the curve in Fig. 8, for all ten runs for all models, we found values between \(69.7\%\) and \(87\%\) for BSPSegNet, while we found values between \(10.7\%\) and \(47\%\) for all the other models. Even the worst BSPSegNet run had significantly sharper boundaries between segments than the best run of any other model.

Fig. 8.
figure 8

Similar to the right-hand side of Fig. 5 but the x-axis shows the prediction confidence (highest probability vs. second highest probability).

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gritzner, D., Ostermann, J. (2021). Semantic Segmentation of Aerial Images Using Binary Space Partitioning. In: Edelkamp, S., Möller, R., Rueckert, E. (eds) KI 2021: Advances in Artificial Intelligence. KI 2021. Lecture Notes in Computer Science(), vol 12873. Springer, Cham.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-87625-8

  • Online ISBN: 978-3-030-87626-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics