Automatic Detection of Specific Constructions on a Large Scale Using Deep Learning in Very High Resolution Airborne Imagery

In the High Modernism period, from around 1914 to 1970, many system halls in steel construction were manufactured to meet the increasing demand in industry, commerce, and agriculture, among other areas. However, these types of buildings have not been the focus of any research in the field of construction history, generating a lack of knowledge regarding their construction types, distribution, and related context to enable statements on the ability and worthiness of historical monument listings. This paper proposes a methodology for the automatic detection of these buildings using aerial imagery. For this purpose, Deep Learning techniques for two tasks are evaluated: semantic segmentation and object detection. Different state-of-the-art software architectures are extensively reviewed and assessed through a series of experiments to determine which features and hyper-parameters produce the best results. Based on our experiments, the height information from nDSM improved the results by refining the detections and reducing the number of false negatives and false positives. Moreover, the Focal Loss helped boost the detections by tuning its hyper-parameter γ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\gamma$$\end{document}, where object detection algorithms showed high sensitivity to this value. Semantic segmentation models outperformed their counterparts for object detection, with U-Net and EfficientNet B3 as the backbone, the one with the best results with a DetectionRate\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Detection\ Rate$$\end{document} of up to 93%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$93\%$$\end{document}.


Introduction
Hall buildings for industry, commerce, agriculture, and transport infrastructures are characteristic of the High Modernism period from around 1914-1970, as they offered new spatial solutions for the various requirements of industrial production conditions and logistics. Hall constructions with span widths between 10 m and 25 m dominate the existing structures and cover a wide range of the functions outlined above. From around 1914 on-wards, more and more types and system halls in steel construction were industrially manufactured to meet this broad demand, but the construction history and variety of types have not been the focus of research.
This frequent type of building has received little attention in the field of construction history. Furthermore, it eludes evaluation in terms of construction culture by traditional art studies criteria. Thus, in the discourse of historical monument protection, there is also a lack of fundamental knowledge about the plurality of construction types, their distribution, and site-related context. This is essential for enabling statements on the ability and worthiness of historical monument listings. Beyond the positioning of system halls in steel construction in the field of design and construction history, the question regarding their spread and context localization is still open. This can be overcome by the development of methodologies for the automated aerial photograph-based recording and classification of these buildings. These building types are hardly quantifiable using classical methods because of the large amount of data (e.g. Google Earth aerial images and 3D views) from different locations to be analysed by a specialist, resulting in a very time-consuming task. However, based on the manufacturer and catalogue information, these buildings had productionrelated standardisation in terms of construction types, materials, and dimensions. This intrinsic property makes them suitable for deriving specific attributes of system halls for automatic identification using aerial photographs.
The detection of specific types of buildings is usually performed manually by visual inspection of aerial photos or field visits to different locations, which is a time-consuming process, requires deep knowledge about these buildings and is not scalable for bigger areas. On the other hand, automatic approaches have the main advantage that they can cover larger areas without too much supervision. These approaches are based on Machine learning (ML) (Roschlaub et al. 2020;Bandam et al. 2022) or Deep Learning (DL) (Solovyev 2020;Sun et al. 2021) algorithms trained upon aerial imagery to learn how to detect specific buildings.
The main difference between ML and DL algorithms for building detection is the need for a feature extraction stage previous to the ML model training, which is usually performed by a specialist that provides the most suitable set of features based on his/her knowledge about these buildings (e.g. geometric dimensions, appearance, texture, colour, and shape). Their DL counterparts perform an end-to-end learning during training, where the algorithm learns the task specific features. However, a large set of samples (dataset) is necessary to achieve a good performance and to generalise well for unseen areas during training. DL building detection can be performed by semantic segmentation, object detection and instance segmentation architectures. While semantic segmentation refers to the task of assigning a class label to each pixel in an image (i.e. "background" and "building" in our case), object detection aims to find the building in an image by providing the coordinates of a rectangle enclosing the building. Instance Segmentation can be viewed as a mixture of semantic segmentation and object detection, where objects are detected and delineated, whereas instances of the same class are differentiated giving the advantage of counting the objects present in the image and being robust against occlusions between objects of the same class.
Many different methods have been proposed in the literature, which consist of two main steps: building footprint extraction and roof type classification. For instance, Roschlaub et al. (2020) detect buildings based on manually defined threshold values for buildings colours and heights, from RGB images and Normalised Digital Surface Models (nDSM). Then, the detected buildings are compared with cadastral maps to find undocumented buildings in Germany. Taha and Ibrahim (2022) used Random Forest (RF) (Breiman 2001) for building detection in Greater Cairo trained on the fusion of spectral, height, and textural features. Different feature fusion techniques were assessed, and wavelet-PCA obtained the best results with an overall accuracy (OA) of 97% . On the other hand, Norman et al. (2021) applied an object-based image analysis (OBIA) approach based on multi-resolution segmentation and Support Vector Machines (SVM) (Cristianini and Shawe-Taylor 2000) for urban building detection in Malaysia. The best configuration of parameters and features was selected achieving an OA of 93% . Bandam et al. (2022) proposed a methodology for the classification of building types (e.g. residential and industrial) in Germany based on a RF trained using OpenStreetMap data (e.g. building footprints) and other sources (e.g. building height and land cover data), obtaining low error rates of up to 3.6% for residential buildings compared with official statistics. These approaches are based on the manual tuning of parameters or ML models, which usually require a specialist with deep knowledge of each application and manual tuning/ selection of its parameters.
On the other hand, DL approaches learn to extract similar knowledge from the data. He et al. (2021) performed automatic extraction of buildings using a modified U-Net (Ronneberger et al. 2015) for semantic segmentation and Conditional Random Fields (CRF) (Lafferty et al. 2001) as post-processing. The U-Net model was modified by adding Batch Normalisation layers and replacing ReLU with ELU, producing an F1-score of 80% with a gain of up to 4% against the base U-Net. In Solovyev (2020), an ensemble of pre-trained neural networks [EfficientNet (Tan and Le 2019), ResNet (He et al. (2016)), DenseNet (Huang et al. 2017), and Inception ResNet v2 (Szegedy et al. 2017)] is used for roof material classification from aerial imagery and roof polygons. The output of these networks is combined with contextual data about neighbour buildings to refine the results. Sun et al. (2021) employed Generative Adversarial Networks (GANs) (Goodfellow et al. 2020) to improve building extraction based on DL architectures for semantic segmentation [DeepLab (Chen et al. 2018), U-Net, and Fully Convolutional Networks (FCN) (Maggiori et al. 2016)] using GF-2 imagery from the southern Qinghai-Tibet plateau. In GANs, two networks, the generator (G) and discriminator (D), are trained simultaneously in an adversarial manner. During training, G learns to produce images (fake images) similar to a given reference (real image) to fool D, whereas D learns to distinguish between the real and fake images generated by G. The new feature vectors extracted by the Discriminator managed to increase the accuracy of each method for high intra-class variability scenarios (e.g. buildings in forests, wastelands, or urban areas), resulting in mIoU (mean Intersection over Union) values of up to 76% . Zhou et al. (2019) used Mask R-CNN  for building detection and segmentation from airborne very high-resolution (VHR) images. Mask R-CNN managed to detect buildings at different scales outperforming FCN by 15% , and producing very accurate building edges.
In this context, we identified three main methods for building detection based on DL architectures: semantic segmentation, object detection, and instance segmentation. Semantic segmentation architectures typically rely on an encoder-decoder structure with mainly convolutional layers. The encoder extracts representative attributes while reducing the input image, whereas the decoder reconstructs the input image, preserving the spatial localization of the extracted attributes. Among the SOTA architectures for semantic segmentation, U-Net and DeepLab variants have been successfully employed for building segmentation and can accurately delineate building edges. U-Net uses a lower upsampling factor than DeepLab for reconstruction, preserving finer details. On the other hand, DeepLab extracts features using atrous convolutions and spatial pyramids, capturing more information at different scales.
Object detection methods can be categorised into two main groups: one-and two-stage. One-stage detectors are based on a single feed-forward FCN trained to directly provide bounding boxes and object classification. YOLO (Redmon et al. 2016) and its variants were among the first architectures proposed, where their focus was to obtain a trade-off between fast inference and high detection accuracy. RetinaNet (Lin et al. 2020) proposed a loss function, to overcome the problem of the extreme foreground-background imbalance in object detection, called Focal Loss, while using a lightweight backbone such as MobileNets (Howard et al. 2017) to improve the processing speed. In contrast, two-stage methods split the detection process into region proposal and classification stages. First, several object candidates are proposed, known as regions of interest (RoIs), following heuristics such as selective search or region proposal models [e.g. RPN (Ren et al. 2017)]. These candidates are then classified, and the RoIs are refined using a Convolutional Neural Network (CNN) trained for classification and regression [e.g. R-CNN , Faster-R-CNN (Ren et al. 2017)].
Instance segmentation is based on four main approaches: the classification of mask proposals (Hariharan et al. 2014), detection plus segmentation , segmentation plus clustering (Kirillov et al. 2017), and dense sliding windows (Pinheiro et al. 2015). In the first method, masks candidates ) are proposed, and features are extracted from these masks. These masks are then classified, and refined to improve the object edges. In the second method, architectures similar to two-stage detectors are modified to generate an object mask. For instance, Mask R-CNN extends Faster R-CNN by including an object mask prediction branch in the object bounding box recognition branch. In the third method, semantic segmentation is used in conjunction with clustering methods to distinguish between object instances and background. In Kirillov et al. (2017), two independent branches are employed to produce the per-pixel class and edge scores. Edge scores are used to extract superpixels, which are merged based on the class scores to generate the object instances. Finally, in the fourth method, a dense sliding window is applied over the entire image, where each window passes through a model with heads for segmentation and object detection. In Pinheiro et al. (2015), these heads are parallel branches to predict a segmentation mask for an object located at the centre and an object score to determine whether an object exists in the window.
All the aforementioned methods focus to extract buildings for their later classification. However, to the best of our knowledge, there are not any related works for the direct detection of a specific type of building. In this context, this work aims to contribute to the understanding of how stateof-the-art (SOTA) DL methods perform for this specific task, employing VHR remote sensing imagery. Specifically, our analysis focuses on DL semantic segmentation and object detection methods for building detection. Instance segmentation methods will be explored in the future as their main advantages such as counting instances of the same class and being robust against occlusions are not critical requirements in our application (system halls are not occluded between them in aerial images).
We thus propose a methodology for the automatic detection of system halls of the High Modernism period from aerial imagery based on DL architectures. The aerial imagery is composed of Digital Orthophotos (DOP) and nDSM from different states in Germany. DL SOTA semantic segmentation and object detection architectures are extensively assessed in terms of accuracy metrics and visual predictions to determine the best model for automatic detection. Later, these models are employed to increase the training dataset with new detections in an active loop to further improve their performance.
To this end, the research questions posed are: • Is it possible to detect a specific building type from highresolution DOP? • Is height information from nDSM important for this specific building detection task? • Which method is more suitable for building detection: semantic segmentation or object detection methods? • How are building detection methods influenced by the spatial resolution of the images for training?

Dataset
The dataset employed in this work is composed of three main parts: locations, images, and labels. First, the locations (Heinrich et al. 2022) of many system halls of a manufacturer from the former German Democratic Republic (GDR) were found by visual inspection using aerial photos and 3D views from Google Earth from different states in East Germany (see Fig. 1). Then, image tiles were collected to cover all these locations (green dots in Fig. 1). A total of 60 image tiles were used from different states in Germany: Berlin (5), Brandenburg (8), Mecklenburg-Vorpommern (25), Sachsen (13), Sachsen-Anhalt (7), and Thüringen (2). These tiles have sizes of 5 K × 5 K or 10 K × 10 K pixels, with 20 cm spatial resolution. Each image tile contains two types of information: spectral and height (see Fig. 2). The spectral information is obtained from DOP with four channels: Red, Green, Blue, and Infrared. On the other hand, the height information comes from 20 cm spatial resolution nDSM created by the subtraction between Digital Surface Models (DSM) and Digital Elevation Models (DEM), which can be image-based or LiDAR-based depending on the state. DOP, DEM and DSM were obtained from the geoportals 1234 of each state. Differences were observed between the LiDAR-based nDSM collected from different states in Germany. The point density of the LiDAR data varied from 4 to 12 points∕m 2 , where regular grids were used for resampling to generate the DSM and DEM. The grid size also varies depending on the state. For instance, in Thüringen, grids of 1 m × 1 m were used, whereas in Saxony-Anhalt, it was 20 cm × 20 cm. The labels to train the DL algorithms for each task semantic segmentation and object detection were created manually using the Quantum GIS (QGIS) software 5 to create vector (shapefiles with polygons) and raster (TIF images) files. For semantic segmentation, polygons were delineated considering the occlusion with different objects such as trees or shadows. Then, the polygons were rasterized to produce a label image with the same size as its corresponding DOP and nDSM. For object detection, OBBs were created automatically based on the aforementioned polygons and recorded the four vertices for precise localization. Among the different types of system halls, we focussed on two specific types: KT 60 L and Ruhland. Figure 3 presents an illustration 6 (first row) and 20 cm spatial resolution DOP (second and third rows) of each type. Our objective is to know where a system hall is located and not to differentiate between specific types. Thus, our dataset only comprises two classes: "system hall" and "background".

Building Detection Using Deep Learning
Building detection using DL architectures can be performed by semantic segmentation, object detection, or instance segmentation algorithms. Semantic segmentation aims to find each pixel that belongs to the building of interest. Then, the localization of the building is obtained by the contour of these pixels (see Fig. 4b). FCN is usually employed for this purpose, where only convolutional layers are used to extract representative features while the feature maps size is reduced. Then, the final feature map is re-sampled to the original size to produce the segmentation outcome.
On the other hand, object detection algorithms directly provide the coordinates of a building in an image. These coordinates could be of a bounding box (BB) or an oriented bounding box (OBB) (see Fig. 4c). While a BB can be represented by the x and y coordinates of a vertex and its width and height, an OBB requires one more parameter, a rotation angle, which indicates the orientation of the BB respect to one of its vertices. Also, a OBB could be defined by its four vertices, which is a notation highly employed to train different methods for object detection. The instance segmentation outcome provides both the pixel-wise object mask and the object bounding box. This richer and more structured outcome makes it possible to differentiate between object instances to count and monitor them, even in scenarios with occlusions between them. In the following section, a brief description of each DL architecture employed in this paper is provided. These architectures were selected based on the SOTA approaches as well as their suitability to solve these tasks.

U-Net
Originally proposed by Ronneberger et al. (2015) for biomedical image segmentation, U-Net is an FCN with two main paths: contraction and expansion. During the contraction path (Fig. 5, left side), the spatial resolution is reduced while the number of features is increased. This is performed by a series of convolutional blocks, composed of convolutional layers with ReLU (Rectified Linear Unit) as an activation function, and a max pooling layer. Then, in the expansion path (Fig. 5, right side), the feature maps are gradually up-sampled to the original input size by a series of transpose convolutions, where the feature maps are concatenated with high-resolution features from the contracting path. In this way, U-Net extracts representative features while keeping its precise location.
The main idea behind U-Net can be extrapolated to other architectures by using any CNN in the contraction path. Thus, pre-trained CNNs can be used as robust feature extractors in this path, called backbones. In this work, we use three different backbones based on their good performance as feature extractors (Elharrouss et al. 2022): VGG16 (Simonyan and Zisserman 2015), ResNet50 (He et al. 2016) and Effi-cientNet (Tan and Le 2019).

DeepLabV3+
DeepLabV3+ (Chen et al. 2018) is the latest version of the DeepLab series developed by Google. It achieved SOTA results in the PASCAL VOC 2012 semantic segmentation benchmark. DeepLabV3+ proposed to include an encoderdecoder structure in the model (see Fig. 6). The encoder is composed of the DeepLabV3 architecture with an Atrous Spatial Pyramid Pooling (ASPP) module, which reduces the feature maps size by a factor of 16. To return to the original input size, the decoder applies two bi-linear up-samplings by a factor of 4 each. Then, after the first upsampling, the feature maps are concatenated with their corresponding lowlevel features. The upsampling in two steps produced better results than applying directly a bi-linear upsampling by a factor of 16.
In the original paper, two versions are proposed based on the network backbone: ResNet and Xception (Chollet (2017)). In this work, we use ResNet50 as the backbone.

YOLOv5
YOLOv5 ("You Only Look Once") developed by Ultralytics 7 (Jocher et al. 2022) is one of the most famous object detection algorithms due to its speed and accuracy. YOLOv5 divides each image into a grid system, where each cell is responsible for detecting objects within itself. As it is a onestage object detector, it has three main parts: Backbone, Neck and Head. The Backbone is in charge of extracting important features from the given input image. In YOLOv5, the CSP-Net (Cross Stage Partial Networks) (Wang et al. 2020) is used as backbone. Then, the model's Neck is mainly used to generate feature pyramids, which help models to generalise well on object scaling. PANet, proposed by Liu et al. (2018), is employed as the model's Neck. Finally, the model Head is mainly used to perform the final detection part. It applies anchor boxes on features and generates final output vectors with class probabilities, objectness scores, and bounding boxes (see Fig. 7). (2020)) is a classification-based rotation detection technique for arbitrary-oriented object detection. Thus, they consider the prediction of the object angle as a classification problem to better limit the prediction results. CSL can be used with different object detection algorithms as backbone. In our experiments, we use YOLOv5 as backbone with CSL for OBB detection.

Methodology
The methodology employed for the automatic detection of system halls is divided in two main stages (see Fig. 8): Development and Search. In the Development stage, two mutually exclusive sets are created, training and testing, from a dataset containing images with their respective labels for supervised training of the algorithms. The DL detection algorithm is trained upon the training set, varying different hyper-parameters to search for the configuration that provides the best results. Once the algorithm is trained, the testing set is used to assess its generalisation quantitatively (e.g. accuracy metrics) and qualitatively (e.g. visual inspection). Then, in the Search stage, images from unknown locations, hereafter referred as to Blind Test (i.e. without known labels), are utilised to search for system halls. All newly detected buildings are assessed manually to discard false positives, while true positives are included in the dataset to further improve the performance of the detection algorithm. However, it is possible to have some false negatives (i.e. not detected buildings) in this process. For that reason, this stage could be repeated many times to diminish the number of false negatives. The main focus of this work is to report the results related to the Development stage by assessing different models. Then, a model is selected and applied to the Blind Test set, where new detections can be found, and will be added to the dataset to improve the trained models in the future. For the blind test set, we collected 55 image tiles from different states: Berlin (23), Brandenburg (26), and others (6) (orange dots in Fig. 1).

Experimental Protocol
The experimental protocol followed was designed to evaluate the performance of semantic segmentation and object detection DL methods for different spatial resolutions (20, 40, 60, 80, and 100 cm), as well as to measure the effect of including height information to train the models (DOP and DOP + nDSM ). The 60 image tiles in the dataset were randomly split in three mutually exclusive sets: train (26), validation (11), and test (18). Then, each image tile was resampled to each spatial resolution, and patches of 512 × 512 were extracted with a stride of 256, generating an overlap between patches of 50% . For testing, the prediction for the entire image tile was reconstructed based on its patches; however, the assessment of the results was performed at the patch level.
As baseline method, RF was selected for pixel-wise semantic segmentation. Four RF models were trained: RF pixel , RF spatial , RF texture , and RF hybrid , only with DOP. RF pixel was trained upon feature vectors containing only each pixel value (Red, Green, Blue, and Infrared). RF spatial used feature vectors composed of all pixel values in a neighbourhood of 5 × 5 . RF texture employed feature vectors formed by 13 Haralick 8 (Haralick et al. (1973)) texture features  computed in windows of 13 × 13 , averaged across four directions ( 0 • , 45 • , 90 • and 135 • ) for rotational invariance. The neighbourhood sizes for RF spatial and RF texture were selected after a set of initial experiments seeking to find a trade-off between the performance and computational cost. The larger size for RF texture was selected to increase the spatial context to capture more pixel intensity variations for the grey-level co-occurrence matrix (GLCM). Finally, RF hybrid concatenates the features from the RF spatial and RF texture models. All RF were trained with 300 trees with a maximum depth of 10 and weights computed based on class distribution to deal with unbalanced datasets. These weights are extremely important, especially for coarser resolutions, where the number of samples for the class "system hall" diminishes exponentially with the spatial resolution.
For semantic segmentation, DeepLabV3+ and U-Net with VGG16, ResNet50, and EfficientNet B3 as backbones, pre-trained with ImageNet 9 (Deng et al. 2009) dataset, were selected. The models were trained upon patches of 512 × 512 with DOP and DOP + nDSM , where the ratio between the number of negative (i.e. patches without the presence of the class of interest) and positive (i.e. patches with the presence of the class of interest) patches was 2.0. The implementation of these models was based on the segmentation models 10 package for Python. The training configuration of all models was the following: Adam (Kingma and Ba (2015)) optimizer with a learning rate of 10 −4 with a reduction on a plateau by 10% after 10 epochs, 200 epochs with an early stopping after 40 epochs without improvement in the mean Intersection over Union (mIoU) of the validation set. The early stopping criterion was selected empirically based on a set of initial experiments to analyse the evolution of the metrics during training without early stopping. In addition, data augmentation techniques such as horizontal and vertical flips, random rotations ( ±5 • , 90 • ), and random shifts ( 1% ) were applied to increase the model robustness. These transformations were selected to create new versions of each patch without modifying the scale and appearance of the buildings, because each system hall type has specific geometric dimensions and roof materials. Due to the highly imbalanced dataset, the Focal Loss (FL) was used to make the model pay more attention to the hard-to-classify samples. The FL is computed as where p t is the predicted value by the model, and is a parameter that has to be tuned (see Fig. 9). We varied from 1 to 5 in our experiments.
For object detection, three versions of YOLOv5 were selected: nano (n), small (s), and medium (m), trained to produce OBB with CSL. The implementation of each model is based on the YOLOv5OBB repository. 11 All hyper-parameters for the models are the ones recommended in the original paper. Each model was pre-trained with COCO 12 (Lin et al. 2014 dataset). The same data augmentation techniques as for semantic segmentation and loss function were employed, where the parameter varied from 1 to 5. The ratio between negative and positive patches was 0.2. As these models only work with images with three channels, false RGB Colour Compositions were created (see Fig. 10) with the following configuration: Infrared (R), Grey (G), and nDSM (B), where Grey is the mean of the Red, Green and Blue channels from the DOP, with equal weights for each channel.
The assessment of the results obtained by each algorithm is performed quantitatively and qualitatively. For the quantitative evaluation, two metrics are used: mIoU and Detection Rate , computed at patch level, before using the post-processing. The mIoU is computed as the intersection divided by the union of the prediction and the ground truth (see Fig. 11). It varies from 0 to 1 and it is dimensionless, where 1 means a complete overlapping while 0 represents no overlapping. We used this metric as percentage, so the mIoU is multiplied by 100%.
The Detection Rate is computed following Eq. 1, where N detected_buildings are those buildings found by the algorithm with an IoU ≥ 0.5 (covering at least 50% of the building) per patch. N total_buildings represents all buildings present in the patches. To avoid split buildings due to the stride to extract the patches, only buildings with a minimum area of 2500 × r 2 are considered, where 2500 is approximately less than a quarter of the area of a building in a 20 cm spatial resolution DOP, and r is a scaling factor depending on the spatial resolution: All experiments were performed using Google Colab Pro 13 (Bisong 2019), which provided us with Tesla P100-PCIE-16GB GPUs, Intel Xeon CPU, 2.20 GHz, 8 CPU cores, 26 GB RAM. The whole methodology was implemented in Python language using TensorFlow (TensorFlow Developers 2022), Keras (Chollet et al. 2015) and PyTorch (Paszke et al. 2019) as frameworks. The processing time to train each model was approximately 2-3 h per experiment for semantic segmentation, and 1-2 h for object detection, depending on the model, backbone and early stopping. For testing, the inference time was very fast, less than 5 min for all patches in both cases.

Results
In the following subsections, the results obtained using the aforementioned experimental protocol are reported. These experiments were conducted during the Development stage of the proposed methodology (see Fig. 8). At the end of this section, the results of applying the best model to the Blind test set are provided, which will be added to the dataset to improve the training of the models in the future. In all cases, the worst results were achieved using the lowest spatial resolution (100 cm), and the mIoU decreases proportionally to the spatial resolution. This is plausible due to the reduction of the number of samples of the class "system hall" in conjunction with the spatial resolution.

Random Forest
Regarding the influence of each feature employed, from Table 1, we can see that the inclusion of spatial context (pixel neighbourhood) in RF spatial resulted in mIoU gains for each spatial resolution compared with pixel-wise classification ( RF pixel ). Similarly, the texture features used by RF texture achieved better results, demonstrating that the texture and spectral appearance are important for the segmentation of these buildings. This is demonstrated by using both features, as in RF hybrid , which obtained the best results for all spatial resolutions.
Concerning the performance across different spatial resolutions, we observe from Table 1 a high mIoU drop when the spatial resolution changes from 20 to 40. This might be related to the decrease in the number of training samples of the pixel-wise models by 75% . Then, mIoU decreases gradually with the spatial resolution, where again there are reductions in the number of training samples but with a lower factor. The lowest mIoU was obtained for the coarsest spatial resolution (100 cm), with values below 20% for all the methods. Figure 12 illustrates the predictions obtained using each RF algorithm (rows) for different spatial resolutions (columns). Patches of 512 × 512 from the test set are presented with their prediction (blue) and ground truth (orange). We can see how RF spatial diminishes the salt-and-pepper noise  Fig. 11 Representation of intersection and union (grey areas) of the prediction (blue) and its corresponding ground truth (orange) for semantic segmentation and object detection effect typical of pixel-wise classifications ( RF pixel ) ( Fig. 12 first and second rows). However, neither method properly differentiated a system hall from another building because of the presence of many false positives (e.g. Fig. 12, first column, long buildings with square patterns on the roofs).
As they rely only on spectral features (pixel intensity values), it seems that they have just learned to detect bright pixels, which is the common appearance of many system hall roofs. It is worth mentioning that both methods were more accurate in terms of edge detection because they properly delineated each building with high precision at the edges. On the other hand, texture features in RF texture appeared to produce fewer false positives with the cost of less precise segmentation edges (e.g. Fig. 12, third row, first and second columns). The fusion of both features in RF hybrid , spectral and texture, generates a trade-off between the other methods with fewer false positives than RF pixel and RF spatial and more accurate building edges than RF texture (e.g. Fig. 12, first and second columns). Nevertheless, there are still a few falsepositive detections, which can be reduced by using morphological operations (e.g. opening and closing), and not so accurate building edges especially for coarser resolutions (e.g. Fig. 12, last two columns), which might be related to the reduction in the number of training samples as the spatial resolutions decreases. In general terms, the inclusion of height information by the nDSM brought improvements for all models (e.g. 20 cm, U-Net ResNet50, gains of 12.6% and 19 in terms of mIoU and Detection Rate , respectively). The results are getting worse while the spatial resolution decreases, which is understandable as it became more difficult for all models to capture more fine-grained patterns with less information. U-Net ResNet50 achieved the best results for almost all spatial resolutions in terms of mIoU, followed by U-Net EfficientNet B3 and then DeepLabV3+. U-Net VGG16 got the worst results for almost all spatial resolutions. Moreover, in terms of stability against the change in spatial resolution, almost all models presented drops in mIoU for coarser resolutions. Particularly, U-Net EfficientNet B3 was the most stable from 40 to 80 cm spatial resolution, with drops in mIoU from 20 to 40 cm, and from 80 to 100 cm. Furthermore, all DL semantic segmentation models outperformed their counterparts based on RF for all the spatial resolutions (see Table 1).

Influence of height information
By comparing DeepLabV3+ and U-Net with the same backbone (ResNet50), we observed that U-Net achieved better results in terms of mIoU for almost all spatial resolutions. This might be related to how the upsampling process was performed in each model. DeepLabV3+ up-samples the feature maps by a factor of four twice and concatenates the feature maps with the same scale between up-samplings. U-Net performs a similar process by progressively upsampling the feature maps by factors of two and concatenating the feature maps after every upsampling. In this way, U-Net can reconstruct the input with a higher spatial accuracy than DeepLabV3+, bringing gains in terms of mIoU which is more sensitive at the pixel level.
Notice that a high mIoU does not always mean a high Detection Rate . The main reason for it is that a model could detect part of a building, which contributes to the mIoU, but it may not be sufficient to be considered as a detected building by the Detection Rate . U-Net EfficientNet B3 got the highest Detection Rate for almost all spatial resolutions, with just a drop from 20 to 40 cm spatial resolution. Further, it was the only one with a Detection Rate higher than 50 for 100 cm spatial resolution, far away from the others (e.g. U-Net DeepLabV3+ got a Detection Rate of 20). This might be due to the better management of the network components (e.g. network width, depth, and resolution) by EfficientNet scaling coefficients.
Based on the results presented in Table 2, we conclude that nDSM is an important feature to consider when training building detection algorithms. To select a model, it is necessary to examine which metric better reflects our requirements. mIoU is more sensitive to incorrectly detected buildings (false positives), whereas Detection Rate focuses on correctly detected buildings (true positives), discarding split buildings and being agnostic to incorrect detections. Thus, U-Net ResNet50 may be the best option based on mIoU, representing a trade-off between true positives and false positives. U-Net EfficientNet B3 can be used to detect more buildings based on Detection Rate , but it might require post-processing to discard false-positive detections. Figure 13 displays the predictions obtained by U-Net ResNet50 and U-Net EfficientNet B3 using only DOP and DOP + nDSM , the DL models for semantic segmentation that achieved the best results in terms of mIoU and Detection Rate , for different spatial resolutions. Patches of 512 × 512 from the test set are presented with their prediction (blue) and ground truth (orange).
Generally, we noticed that the addition of nDSM to train the models improved the results by reducing the number of false positives and increasing the number of true positives. In U-Net ResNet50 (Fig. 13, first and second rows), for a 20 cm spatial resolution, we can see that the inclusion of nDSM improved the results by detecting two more buildings, where one of them was split. For 40 cm, nDSM refined the segmentation of the building located at the top-right of the patch by discarding the area to the left of the building, which might correspond to an extension of the main building. For The best results for each method at each spatial resolution are shown in bold. The top three for each resolution are highlighted with greenish (first), yellowish (second), and reddish (third) background colours 60 and 80 cm, nDSM brought about similar improvements by refining the detection edges, but still had some false negatives. At 100 cm, there were only slight improvements in terms of the building edges, demonstrating that this is the most challenging scenario. From Figs. 12 and 13, we can see that the DL semantic segmentation models did not produce as many false positives as RF variants. The building edges generated by the DL models were more consistent and accurate than those generated by the RF variants, while the spatial resolution decreased. These scenarios are more challenging for RF because they are pixel-wise, and the number of samples diminishes for coarser resolutions.
For U-Net EfficientNet B3 (Fig. 13, third and fourth rows), similar behaviour was observed. For instance, at a spatial resolution of 40 cm, nDSM had a high impact on the detection results by correcting the building edges. It can be seen in the group of three buildings located next to each other, where they were initially segmented as a single building because of their proximity, but using nDSM the three buildings were segmented separately. At coarser resolutions (80 and 100 cm), this effect was even worse, making it more difficult to separate these buildings. In these scenarios, the methods detect part of each building or group the buildings as a single building.
Based on these results, quantitatively and qualitatively, we concluded that it is important to include the nDSM to train a DL semantic segmentation algorithm for building detection. For that reason, the experiments in the following sections were performed using DOP + nDSM for training. Table 3 summarises the results obtained by DL methods for semantic segmentation for different spatial resolutions trained upon DOP + nDSM and varying the parameter of Fig. 13 Patches of 512 × 512 from the test set with the prediction (blue) and its corresponding ground truth (orange). The first two rows correspond to U-Net ResNet50, and the last two rows correspond to U-Net EfficientNet B3, where in each pair of rows results obtained using only DOP and DOP + nDSM are presented. From left to right the spatial resolutions are 20, 40, 60, 80 and 100 cm the FL function. The mIoU and Detection Rate (in parentheses) are reported, where the best results are in bold. A similar colour code than in Sect. 6.2.1 has been employed for a better analysis of the results where the top three for each spatial resolution (each column in the table) are highlighted with greenish (first), yellowish (second), and reddish (third) background colours. More saturated colours (green, yellow and red) are related to mIoU and less saturated colours (light red, mustard and light green) to Detection Rate.

Influence of the Focal Loss Function
In general terms, it is worth tuning the parameter as important improvements can be achieved in terms of mIoU (e.g. a gain in mIoU of 8.4% for U-Net EfficientNet B3, 60 cm, ∶ 1 → 3 ). Moreover, increasing implies bringing the model's attention to hard-to-classify samples (see Fig. 9). However, excessively high values may not always lead to better results (e.g. a decrease in mIoU of 19% for U-Net EfficientNet B3, 60 cm, ∶ 3 → 5).
Note that, even by tuning the parameter, mIoU decreases for coarser spatial resolutions. However, the greatest improvements were obtained for the lowest spatial resolutions (e.g. a gain in mIoU of 24.2% for DeepLabV3+ ResNet50, 100 cm, ∶ 1 → 5 ). This indicates the importance of this parameter in improving the performance of the models, particularly for the most challenging scenarios. Moreover, for a 100 cm spatial resolution, the best Detection Rate was higher than the best Detection Rate from Table 2, highlighting the benefits of tuning .
U-Net ResNet50 achieved the best results in terms of mIoU for almost all spatial resolutions, followed by U-Net EfficientNetB3, which was in the top 3 in terms of both metrics mIoU and Detection Rate . In terms of Detection Rate , no unique model achieved the best results for all spatial resolutions. For instance, U-Net VGG16 got the highest Detection Rate for 40 and 100 cm, contrary to the results obtained in Table 2, where U-Net ResNet50 was better. U-Net VGG16 improved its results from 72% to 90% at 40 cm and from 12% to 73% at 100 cm for ∶ 2 → 1 in both cases. This leads us to suggest using U-Net ResNet50 as the first attempt to train a semantic segmentation algorithm for building detection because it has the highest mIoU ( 84.4% for 20 cm). The next option is U-Net Effi-cientNet B3, which obtains the highest Detection Rate ( 93% for 20 cm). Figure 14 illustrates the predictions obtained by U-Net EfficientNet B3 using DOP + nDSM for different spatial resolutions and different values of parameter of the FL function. Patches of 512 × 512 from the test set are presented with their prediction (blue) and ground truth (orange). It is observed that the tuning of the parameter improves the detection by reducing the number of false negatives and increasing the number of true positives. For instance, at a spatial resolution of 20 cm, for ∶ 3 → 5 , one more building was detected correctly. At 40 cm, for ∶ 2 → 1 , two adjacent buildings are added. Moreover, for coarser spatial resolutions, different values helps to produce more precise detections and to avoid the effect of grouping near buildings as a single one (e.g. 60 cm for ∶ 5 → 3 and 80 cm for ∶ 2 → 3).  The best results for each method at each spatial resolution are shown in bold. The top three for each resolution are highlighted with greenish (first), yellowish (second), and reddish (third) background colours False Colour RGB compositions and varying the parameter of the Focal Loss function. The mIoU and Detection Rate (in parenthesis) are reported, where the best results are in bold. The same colour code than in Sect. 6.2.2 has been employed for a better analysis of the results where the top three for each spatial resolution (each column in the table) are highlighted with greenish (first), yellowish (second), and reddish (third) background colours. More saturated colours (green, yellow and red) are related to mIoU and less saturated colours (light red, mustard and light green) to Detection Rate.

Object Detection
In general, the DL object detection algorithms did not achieve better results than semantic segmentation models in terms of mIoU and Detection Rate for all spatial resolutions. There was not a unique model that achieved the best results. For instance, YOLOv5s had the highest mIoU at 20 and 60 cm spatial resolutions, and YOLOv5m at 40, 80, and 100 cm. YOLOv5n got the lowest value of mIoU, which was Analogous to what was seen in the previous section, the parameter plays a very important role in improving the performance of each model. Moreover, we perceived that all the models were very sensitive to higher values of . For example, at a spatial resolution of 20 cm, YOLOv5n improved in terms of mIoU from 33.4 to 42.9% for ∶ 1 → 2 but decayed to 18.1% for ∶ 2 → 5 . YOLOv5s presented mIoU drops from 57.6 to 0.5% for ∶ 2 → 5 at a spatial resolution of 40 cm and from 49.0 to 2.8% for ∶ 1 → 5 at 60 cm. YOLOv5m exhibited similar behaviour to mIoU gains and losses: 33.5% → 51.4% → 17.2% for ∶ 1 → 2 → 5 at 80 cm spatial resolution, and 36.4% → 48.6% → 11.8% for ∶ 1 → 2 → 4 at 100 cm. From Tables 1 and 4, we noticed that at 20 cm spatial resolution, RF hybrid had the highest mIoU compared to the best YOLO model (YOLOv5s). This may be related to the split buildings generated during the patch extraction. As RF hybrid is a pixel-wise approach, it detects any pixel in the building even if the building is split, which contributes to the mIoU computation. In contrast, YOLOv5 models use variable object sizes (anchors) to create bounding boxes for each object. Split buildings might be too small and not detectable by these models, thereby decreasing mIoU. Moreover, this effect erroneously suggests that buildings can have very different scales, which confuses YOLOv5 learning. However, we can see how YOLOv5 variants are more robust than RF hybrid against a decrease in spatial resolution. For instance, in terms of mIoU, YOLOv5s decreased progressively: 51.5 → 25.2% from 20 to 100 cm, whereas RF hybrid decreased faster and to a higher extent: 58.6% → 16.3%.
Note that in terms of mIoU and Detection Rate , the best results were obtained at a spatial resolution of 40 cm (YOLOv5m, mIoU = 63.4 , Detection Rate = 79 ) rather than at 20 cm (YOLOv5s, mIoU = 51.5 , Detection Rate = 71 ), as might be expected because of the previous results, which show that the best results are obtained using images with the highest spatial resolution. This might be related to how YOLOv5 computes the initial anchors before training and the influence of the split buildings. YOLOv5 automatically precomputes the anchors by clustering the ground-truth sizes, assuming that objects of the same class should have similar sizes and not a high variance in scale. Subsequently, these anchors are adjusted during the training to fit each object. During the extraction of patches, as the building appears larger at higher resolutions, it is more probable to generate split buildings at this resolution. This can affect the computation of the initial anchors, as split buildings are too small compared with the entire building, suggesting that objects of this class have a high variability in scale. Therefore, because we obtained more split buildings at a spatial resolution of 20 cm than at 40 cm, it is more difficult YOLOv5 learning. This issue can be overcome by increasing the patch size for higher spatial resolutions (e.g. 1024 × 1024 ) or by extracting patches centred on each building to avoid split buildings.
Regarding the Detection Rate , the best results were obtained for 20 and 40 cm spatial resolutions. On the other hand, there was a drop in Detection Rate for 60, 80, and 100 cm spatial resolution, being the one with the coarsest spatial resolution the most difficult scenario for building detection. Similarly to what happened with the mIoU, high values brought extreme decay in Detection Rate to the point of getting 0% even for high spatial resolutions (e.g. YOLOv5s, 40 cm, = 5). Figure 15 illustrates the predictions obtained by the best object detection models for each spatial resolution in terms of Detection Rate and different values of the parameter of the FL function. Patches of 512 × 512 from the test set are presented with their prediction (blue) and ground truth The best results for each method at each spatial resolutionare shown in bold. The top three for each resolution arehighlighted with greenish (first), yellowish (second), and reddish(third) background colors.
(orange). Notice that the training of the DL object detection algorithms was performed based on False Colour RGB compositions: Infrared (R), Grey (G) and nDSM (B), but just for visualisation purposes, the predictions are presented over the True Colour RGB compositions. Moreover, all OBBs predicted were joint to show the area detected by each algorithm. We can observe the sensitivity of the YOLOv5 models to the change of the parameter by the increasing number of false positives and the reduction of true positives. For instance, at a spatial resolution of 20 cm, for ∶ 2 → 3 one false positive is added while the algorithm misses the already detected buildings, and more false positives appeared for higher values of . At 40 cm, a similar behaviour happened with more false positives for ∶ 2 → 4 . The number of false positives can be controlled by tuning the hyper-parameters of the model for inference, such as the confidence ( conf _thres ) and IoU ( iou_thres ) thresholds. The confidence threshold is used to maintain only the predictions with a confidence score (i.e. how certain the model is about that prediction) over the given threshold. The IoU threshold is used in a similar manner by discarding predictions with an IoU lower than the selected threshold. We used the default values of these thresholds for all the experiments ( conf _thres = 0.25 and iou_thres = 0.45 ). Thus, these values can be increased for every specific configuration (i.e. value, model, spatial resolution) to be stricter and to filter false positives.
Concerning the orientation of the bounding boxes, there are some cases where the angle prediction is not so accurate. For instance, at a spatial resolution of 20 cm, for = 1 , the bounding box was not well aligned with the corresponding building. Moreover, it seems that there are actually three bounding boxes, one delineating each building and one delineating both as a single building. This happened in many cases when the buildings were too close to each other and when the background was too similar to the building. The weight of the angle classification in the loss function computation can be increased to diminish this effect and improve angle prediction. Note that the experiments were performed with equal weights for all the terms in the loss function: bounding box regression, object class classification, objectness (object score), and angle classification. Thus, these hyper-parameters related to the contribution of each term to the loss function can be tuned to further improve the detection results.
From Figs. 13 and 14, we notice that almost all models had the same false negative (i.e. missed building): the building located in the middle-right part of the patch at a spatial resolution of 40 cm (second column in both figures). This building is a Ruhland type, similar to the neighbouring buildings in the patch. However, the roof pattern is slightly different with the presence of dark lines over the bright roof, which seems to be related to the deterioration of the roof or dust. The dataset can be extended to consider these cases and many other variations, such as shadows, occlusions, roof materials (producing different roof patterns), corrosion, and deterioration of the roof, among other effects related to changes in the roof over time.

Blind Testing
After the assessment of the DL methods for building detection in the Development stage, we selected U-Net Efficient-Net B3 to search for new detections in the blind test set owing to its good performance. Figure 16 illustrates the results obtained for three image tiles from the blind test set with detected buildings in Berlin. Each image tile has a size of 5K × 5K pixels ( 1 Km × 1 Km ) with a spatial resolution of 20 cm. 3D views taken from Google Earth 14 at every detected location are presented for verification whether the detected building is a system hall or not. Similar to the training, the patch size and stride were 512 × 512 and 256, respectively. The sum of the predictions was calculated to handle the overlapping areas during the reconstruction of the original image tile from the patches. The detection algorithm found two new system halls (Fig. 16a, b) and four false positives (Fig. 16c). Figure 16a shows a system hall detected of the Ruhland type, close to the Spree river and Müggelsee lake. This building is located in a different environment than that seen before, where system halls are mainly present in urban or industrial areas. Thus, it is important to consider different environmental types, such as rural areas, to effectively search for these buildings. Figure 16b displays another system hall detected of the KT 60 L type in an urban area, near the Ostkreuz train station. Figure 16c shows a detected building similar to the Ruhland type adjacent to the Spree river in the South-East part of the city. This is a false positive, as can be seen from the Google Earth 3D view. It might look like a Ruhland type based on roof shape and appearance; however, it seems that the building has been reformed. Therefore, it is necessary to include other types of data to reduce the number of false positives. For instance, oblique imagery might help to obtain side views of a building and analyse its façade. Moreover, a set of attributes (e.g. geometric dimensions, roof shape, and building footprint) can be extracted from the detection outcomes to train a classifier to distinguish between system halls and lookalikes.

Discussion
Based on the computed metrics and visual predictions provided by each algorithm, we can say that semantic segmentation models obtained better results than object detection models, being more suitable for the application of specific building detection. The RF variants used as baseline did not perform well in terms of mIoU, especially for coarser spatial resolutions. These results can be improved by applying a post-processing (e.g. morphological operations) to discard false positives. DL semantic segmentation models achieved mIoU values of up to 84.4% (U-Net ResNet 50) and Detection Rate of up to 93% (U-Net EfficientNet B3) for the highest spatial resolution tested (20 cm). For coarser spatial resolutions, U-Net ResNet50 ( mIoU = 56.6 ) and U-Net VGG16 ( Detection Rate = 73% ) obtained the best results at Among the main factors that influenced the results obtained by the detection algorithms, it is worth to pay special attention to the following: the loss function, the spatial resolution, and the dataset. The loss function for the detection of specific buildings must be robust against the highly imbalanced between the number of positive (i.e. images containing the building of interest) and negative (i.e. images without the building of interest) samples, especially if the buildings are not so common. We use focal loss as a loss function, which is designed for this type of scenario. Its main hyper-parameter, , allows us to make the model to pay more attention to hard-to-classify samples. However, as seen in Tables 3 and 4, this parameter must be tuned carefully because it can have a high impact on the model's training and results. Additionally, in cases where the loss function is computed from different sources to achieve multiple targets, it is necessary to tune the weights for each term that contributes to the loss function computation. For instance, in object detection algorithms, the loss function comprises bounding box regression, object class classification, objectness, and angle classification.
Spatial resolution also plays an important role, with higher spatial resolutions leading to better results. On the other hand, coarser spatial resolutions can provide the advantage of faster processing times for larger areas; however, it is not sufficient to capture some building details such as roof patterns and shapes. Thus, it is worth trying different spatial resolutions to find a trade-off between the performance and processing time. In this context, object detection algorithms are more stable than other methods in terms of changes in the spatial resolution. In addition, Fig. 16 Results obtained by U-Net EfficientNet B3 using the blind test set. From left to right: image tiles with the detected building area highlighted (green square), detailed view of the detected building (blue), and Google Earth 3D view of the same area. a, b Correspond to new detections, and c shows a false positive it is worth mentioning the need to tune inference hyperparameters, as in the case of YOLOv5 versions, where there are thresholds that might help decrease the number of false positives.
Finally, the dataset used to train the algorithms must consider many different variations in terms of the appearance of the building and changes that this can suffer over time. Thus, image transformations can be applied to simulate these variations, and serve as data augmentation techniques to extend the variability of buildings. Moreover, special attention should be paid to the generation of patches from image tiles by modifying the patch size and stride to avoid split buildings, which can confuse the detection model by learning erroneous representations of buildings. In addition, the dataset can be extended to include oblique imagery to discard false positives as seen in Sect. 6.4, where the detected building look very similar to a Ruhland type from an aerial view.

Conclusions and Outlook
This paper proposes a methodology for the automatic detection of system halls from the High-Modernism period. There are more than 80 types of these specific buildings in Germany, which are employed for different purposes, such as industry, commerce, agriculture, and transport. We focussed on two specific types, KT 60 L and Ruhland, which are the most common buildings found. Automatic detection was performed using DL algorithms for semantic segmentation and object detection. Both methods were extensively assessed based on a series of experiments to obtain the best results for building detection.
DL semantic segmentation models achieved the best results compared to traditional machine learning techniques (Random Forest) and DL object detection algorithms (YOLOv5 versions). U-Net ResNet50 achieved the highest mIoU ( 84.4% ), and U-Net EfficientNet B3 achieved the best Detection Rate (i.e. percentage of correctly detected buildings) ( 93% ) for high spatial resolution imagery (20 cm). All methods presented decays, while the spatial resolution decreased as the building became smaller, making the detection more difficult. Object detection methods were the most stable against changes in spatial resolution. For coarser spatial resolutions (100 cm), U-Net ResNet50 got the highest mIoU ( 56.6% ) and U-Net VGG16 had a better performance with a Detection Rate of 73%.
From our experiments, we conclude that it is important to include height information to train each model, as it helps to discriminate between lookalikes and decreases the number of false negatives and false positives. Moreover, fine-tuning of hyper-parameters is extremely important to obtain more robust and reliable models, especially the parameter of the Focal Loss function used for training. This parameter improves the results by providing a more precise detection. Furthermore, the spatial resolution plays a critical role in this task, where the results become worse as the resolution decreases. However, it is possible to find a trade-off between them to achieve lower processing times and scale the method for larger areas. Moreover, it is important to tune the object detection hyper-parameters for inference, such as confidence and IoU thresholds, to further refine the detection results and discard false positives.
In the following steps, the dataset will be extended with more building types and new detections found in different search areas as part of the blind testing set. Digital 3D parametrized models will be created from the specifications found in the manufacturers' catalogues to address the scarcity of samples for uncommon types and those that are currently unknown. To increase the variability of our dataset, many different textures and materials will be employed in these parametrized models. Aerial photos will be taken from these models to train DL algorithms for building detection by fine-tuning the pre-trained models with real samples, as shown in this study. Moreover, as the next step after the detection of a new building is to classify it as one of the known types, VHR oblique imagery will be employed to extract the details from each building that are not visible from the DOP's perspective, which will contribute to decreasing the number of false positives.
Regarding the building detection methods, other object detection algorithms such as RetinaNet and Faster-R-CNN will be included in our analysis to compare one-and two-stage detectors. In addition, instance segmentation approaches such as Mask R-CNN will be explored and assessed to determine their suitability for our application and analyse if it is possible to directly classify building types using only orthophotos.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.