1 Introduction

The availability of accurate and up-to-date building and road information plays a crucial role in shaping spatial decisions and interventions. These include settlement planning, population mapping, construction management, disaster management, and everyday navigation. However, challenges arise due to the limited availability of this information, especially in developing regions, caused by the inherent inefficiency of traditional field survey and manual data generation methods.

In this regard, the ability of machine learning approaches to automatically generate building and road data from remote sensing images has been subject to increased popularity (Luo et al. 2021; Abdollahi et al. 2020). Most prominently, the breakthroughs in Deep Learning (DL) and its potential synergy with remote sensing for supporting sustainable development (Persello et al. 2022) have driven the surge in experiments with DL-based methods for building and road extraction. Leveraging remote sensing data for large geographic coverage and the capability of DL algorithms to handle big data, recent experiments have extended beyond small-scale research to reach continental and global scales. This has resulted in the production of global datasets, such as Google’s Building Dataset and Microsoft’s Building and Road DatasetsFootnote 1,Footnote 2.

Although DL-based methods potentially offer a way to enrich local base maps, there is increased scepticism surrounding the direct integration of large remote sensing machine learning-generated (RSML) data into conventionally accepted datasets (Li et al. 2022; Burke et al. 2021). Here, major concerns relate to potential data inaccuracies, given that predictor properties of geographical phenomena, are not necessarily identical across space (Meyer and Pebesma 2021). As a result, methodologies built on existing popular benchmark datasets might struggle with real-world application, especially in unknown geographic areas (Roscher et al. 2023). Particular concerns regarding the usability of RSML data arise from datasets generated by global models (Meyer and Pebesma 2022). Given the difficulty in representing various types of localised variation within the training data of these models, they in many cases exhibit inherent biases. Often, global models use data concentrated in well-developed areas (Meyer and Pebesma 2022). Consequently, their accuracy and utility might be limited in the less-developed regions. This limitation can be even more pronounced for anthropogenic structures like buildings, which reflect diverse socioeconomic factors. Construction processes in least-developed regions are often dominated by unregulated self-designed and built structures, leading to significant variation in their appearance. Despite this fact, a number of global models lack transparent and rigorously derived local accuracy measures, which clearly delineate the area of applicability for the generated data (Meyer and Pebesma 2022).

In this paper, we compared the local accuracy of existing global building and road models with globally trained but locally fine-tuned models. Thereby, we aimed to shed light on the most efficient approach for generating more accurate RSML datasets with local applicability. The remainder of this paper is structured as follows: Sect. 2 presents state-of-the-art building and road extraction methods. Section 3 offers commonly applied metrics to evaluate RSML building and road datasets. Following this, Sect. 4 synthesises the results of benchmark implementations, to expose potential practical challenges which require continued attention. Next, we describe a novel framework for generating local building and road datasets in Sect. 5. Subsequently, the experimental results will be presented and discussed in Sect. 6. Lastly, we will summarise our findings and conclude the study in Sect. 7.

2 State-of-the-art

2.1 DL Models for Building and Road Extraction Methods

The field of building and road extraction using deep learning is flourishing, with researchers constantly developing new and improved methods. Traditionally, convolutional neural networks (CNNs), which are designed to process data with grid-like structures, have been the most popular DL algorithms in remote sensing (Ma et al. 2019) and thus in building and road extraction implementations. However, recent advancements have seen transformers designed for sequential data (Strudel et al. 2021) being experimented with in building extraction (Chen et al. 2021) as well as road extraction (Tao et al. 2023; Jiang et al. 2022). A comprehensive review of DL-based building extraction methods is given by Luo et al. (2021). In the same vein, Chen et al. (2022), Abdollahi et al. (2020), and Lian et al. (2020) reviewed DL-based road extraction methods.

DL-based building and road extraction methods involve mainly two types of segmentation: semantic segmentation and instance segmentation. Semantic segmentation applies in the bottom-up detection process, where a post-processing step is required to find instances (i.e., each individual building or connected roads). In an example of this implementation for building detection from Sirko et al. (2021), building detection was achieved through a two-step process. Step one was binarisation where ResUNet was applied for semantic segmentation to differentiate between building pixels and background (non-building pixels). Step two was connected component analysis, which identifies and groups pixels belonging to the same building instances. By contrast, instance segmentation tackles detection in an end-to-end manner, i.e., models detect and delineate each distinct instance of buildings in an image. A typical implementation of this approach involved using Mask R‑CNN (Zhao et al. 2018) to produce separated building polygons which required post-processing only for minor geometric refinement. The two approaches to detecting buildings show varying degrees of success in accurately demarcating buildings, due to variations in roof type, size, appearance, and configuration. The bottom-up approaches may perform well in the classification of building pixels, but issues arise when forming instances. A study by Sirko et al. (2021) observed that this approach tends to fragment large and complex roofs. In contrast, end-to-end approach, while effective in isolating individual objects, may fall short in capturing smaller buildings (Sirko et al. 2021).

In road extraction, pipelines that use a bottom-up approach typically require the initial application of semantic segmentation, followed by post-processing operations such as thining and graph construction (Zhang et al. 2018). In contrast, when using end-to-end segmentation with a graph-oriented model, road graphs will be created directly (He et al. 2020; Lian and Huang 2020).

2.2 Data Sources for Large-Scale Models

Building and road extraction require the availability of good quality input data. In this regard, images with sub-metre resolution for building extraction and around one metre resolution for road extraction are suitable for use as input (Luo et al. 2021). Many implementations have traditionally relied on RGB imagery. However, in several instances, the utility of multispectral images (Pasquali et al. 2019), fused RGB images as well as normalised digital surface models (nDSM) have been explored (Achanccaray et al. 2023; Bittner et al. 2018). In particular, implementations using nDSM have been reported to significantly improve building detection accuracy. In practice, however, this method is challenging to implement, given the limitations in acquiring high-resolution DSM on a large scale and the high accompanying computational costs.

Experiments covering large scale require robust models and the use of very large datasets. Owing to difficulties in acquiring datasets of this size from a single source, relying on data from multiple sources has been crucial. For instance, both the Microsoft and Google global RSML building and road datasets have been derived from images originating from a variety of sources. Importantly, these datasets not only represent typical examples of multisourced datasets but also exemplify the contribution of DL and remote sensing in facilitating open access to spatial data. Yet, the local accuracy of these datasets is currently unknown. In addition, research would benefit from the availability of annotated training data and trained base models, which generated the datasets, as this would facilitate additional independent experiments and fine-tuning to increase local accuracy.

In principle, however, this weakness could be overcome by leveraging publicly accessible, multisource, and diversely annotated datasets (Luo et al. 2023). Examples for these sources include SpaceNet (Van Etten et al. 2018), WHU building dataset (Ji et al. 2019), INRIA aerial image labelling dataset (Maggiori et al. 2017), Massachusetts dataset (Mnih 2013), crowdAI dataset (Mohanty et al. 2020) and Open Cities AI Challenge Dataset (https://mlhub.earth/10.34911/rdnt.f94cxb). Some of these datasets contain high-quality images, such as aerial and UAV images, with sufficient detailed information in the spatial domain, which increases the ability to delineate objects like buildings. Thereby, these datasets can be used to develop robust baseline models that can be optimised regionally and locally through transfer learning. Experiments by Ji et al. (2019) and Abriha and Szabó (2023) showed that using satellite imagery, a common source of images for many downstream applications, to fine-tune baseline models pre-trained on high-quality aerial VHR images improved results compared to directly training on satellite imagery.

2.3 Training and Inference Strategies for Large-scale Models

Large-scale models require training and inference strategies, which are essential for overcoming the challenges of limited data availability and the inherent variability in feature properties. For instance, pre-training and self-training (Zoph et al. 2020), ensembling (Ganaie et al. 2022), and mix-up techniques (Zhang et al. 2017) among the strategies that demonstrated accuracy improvement for continental scale building detection (Sirko et al. 2021).

Pre-training (Zoph et al. 2020) is a technique to initialise weights, serving as a powerful foundation for transfer learning. The approach, which combines pre-training with fine-tuning, is particularly effective for achieving high-accuracy models, especially when dealing with large geographical areas and limited data for the specific task. One approach to pre-training involves leveraging millions of remote sensing images and a pretext task focused on learning spatial representations. In a recent example, Google research team (Sirko et al. 2021) implemented pre-training using one million images, predicting the location where images come from, and nighttime luminance class as pretext tasks. Afterward, the model was trained to extract buildings, and results showed it outperformed other models that did not undergo pre-training.

While pre-training deals with weight initialisation and adapting to the domain, self-training (Zoph et al. 2020) iteratively refines models by generating pseudo-labels from unlabeled data to address limited annotated data in large-scale remote sensing analysis. To address the challenge of limited labeled data, self-training leverages a two-step process. First, the trained model, acting as the “teacher,” generates pseudo-labels for a large corpus of unlabeled data. These pseudo-labels are then combined with the existing labeled data to retrain a new, improved “student” model. This iterative process continues until the model’s performance stabilises. In Google’s example mentioned above, ten million unlabeled images were used for self-training, resulting in the best-performing model compared to other models that did not involve self-training.

Furthermore, when training large-scale models, approaches such as mix-up (Zhang et al. 2017) proved to mitigate potential regional biases arising from factors such as lighting variations through the introduction of controlled data augmentation during training, aiming to facilitate model generalisation (Sirko et al. 2021).

To improve the accuracy and robustness of predictions, the ensembling strategy (Ganaie et al. 2022) combines predictions from diversely trained models during the prediction phase. Thereby, it is making use of the models’ combined strengths to address challenges posed by the inherent complexity and variation in man-made features, including buildings (Cao et al. 2021).

3 Methods for Evaluating RSML Datasets

3.1 Pixel- Versus Instance-Based Accuracy Metrics for Buildings

To assess the RSML building datasets, two major metrics are used: pixel- or object-based accuracy measures (Rutzinger et al. 2009).

The per-pixel evaluation is applied to evaluate the accuracy of semantic segmentation, whereby limited attention is paid to individual building outlines. In practice, analyses will evaluate the true positive (TP), false negative (FN) and false positive (FP) rates in segmentation through the assessment of matched, missed, and committed pixels. Building up on this, a compound measure of TP, FN, and FP is the pixel-wise intersection over union (IoU). This metric, utilised to evaluate segmentation similarity, disregards how pixels are grouped in bounded regions (Zhang et al. 2021). In some papers, IoU is derived from both the building and background classes, albeit with the risk of biassing the assessment towards the background class.

Beyond pixel-level metrics, object-based evaluation methods are utilised to account for under- and over-segmentation of building instances (Eqs. 16). Unlike per-pixel segmentation evaluation, a user-defined object-based IoU (Eq. 1) threshold, typically above 50%, is used to determine the TP, FP, and FN (Padilla et al. 2020; Sirko et al. 2021; Rutzinger et al. 2009; Van Etten et al. 2018; Eq. 2). Here, the computation of the IoU involves areas of instances‘ bounding boxes or masks (Ji et al. 2019) or boundary lines (Wang et al. 2023; Xia et al. 2021). In area comparison, both raster (pixel) and vector models can be applied. However, the rasterisation of building polygons can result in distorted boundaries and poor performance (Avbelj et al. 2015; Rutzinger et al. 2009).

The IoU, TP, FP and FN serve as the foundation for calculating precision (P in Eq. 3), recall (R in Eq. 4), and the F1-score (F1 in Eq. 5). The precision provides the number of correct instances relative to the total number of RSML building instances. In contrast, recall refers to the number of correct RSML building instances relative to the total ground truth (GT) building samples. Combining both metrics, the F1-score represents the harmonic mean of precision and recall. In addition, for object-based approaches, the average precision (AP) is commonly utilised to measure potential trade-offs between precision and recall (Everingham et al. 2010; Padilla et al. 2020). The AP score summarises the area under the precision-recall curve constructed using approximative rectangles. Thereby, the width of rectangles is determined by calculating the difference between two consecutive recall points (Rn-Rn‑1). The height of the rectangles is determined by taking the maximum precision values (max(Pn, Pn‑1)) (Eq. 6). Notably, interpolation of AP can lead to errors or distorted measures and should thus be applied with additional care (Zhang et al. 2022).

Following from the description above, given a number of Y total GT and \(\hat{Y}\) as the number of RSML buildings, for each RSML building polygon \(\widehat{y_{i}}\in \hat{Y}\) and its corresponding GT building polygon \(y_{i}\in Y\) it follows that:

$$IoU=\frac{\widehat{y_{i}}\cap y_{i}}{\widehat{y_{i}}\cup y_{i}}$$
(1)
$$\hat{Y}\left(IoU\right)=\begin{cases} TP\,\textit{instances}\;\mathrm{if}IoU\geq 0.5\\ FP\,\textit{instances}\;\text{otherwise}\, \end{cases} $$
(2)
$$P=\frac{TP}{TP+FP}=\frac{TP}{\hat{Y}}$$
(3)
$$R=\frac{TP}{TP+FN}=\frac{TP}{Y}$$
(4)
$$F1=2\cdot \frac{P\cdot R}{P+R}$$
(5)
$$AP=\sum _{n}({R_{n}}-R_{n-1})\cdot \max (P_{n},P_{n-1})$$
(6)

3.2 Pixel- Versus Length-Based Accuracy Metrics for Roads

Pixel-based metrics, such as IoU and F1, are simple to implement and commonly used to evaluate road segmentation. However, to assess the topological properties of roads, length-based P and R, quality (Q) (Wiedemann et al. 1998), and detour factor (Df) (Wiedemann and Ebner 2000) were introduced. Precision is calculated as the ratio between the length of correctly detected road segments (matched RSML road lines) and the total length of all detected road segments in the RSML dataset. Recall is calculated as the ratio between the length of correctly detected road segments and the total length of all road segments in the ground truth (GT) dataset. The quality is given by the proportion of the length of the matched RSML relative to the sum of the total length of the RSML road and the length of the unmatched GT road (Eq. 7). The detour factor describes the ratio of the length of RSML roads to the length of GT roads (Eq. 8). More recently, SpaceNet has introduced the Average Path Length Similarity (APLS), a graph- and routing-based metric (Van Etten et al. 2018). This measure calculates the sum of the absolute differences in optimal path lengths across all pairs of random corresponding nodes in the predicted and ground truth network. Thereby, any missing path in the predicted network will contribute a value of 1 to the sum (Eq. 9). However, the randomness of node sampling represents a major weakness and challenge in the implementation of APLS in practice.

$$Q=\frac{\text{length of matched RSML road}}{\text{length of RSML road} + \text{length of unmatched GT road}}$$
(7)
$$Df=\frac{\text{length of RSML road}}{\text{length of GT road}\,}$$
(8)
$$APLS=1-\frac{1}{N}\sum \mathit{\min }\left\{1,\left| \frac{l\left(a,\;b\right)-l\left(a',b'\right)}{L\left(a,\;b\right)}\right| \right\}$$
(9)

In Eq. 9\(a'\)is the the node in predicted road network nearest to the location of node a in the ground truth road network. L(a,b) and L(\(a{\prime}\), \(b{\prime}\)) denote lengths of the paths in the ground truth road network and the predicted road network. N is the number of paths considered.

4 A Synthesis of Benchmark Implementations

To gain insights into the performance of various methods and metrics used for evaluation, we present a synthesis of the results from benchmark works in Table 1. Fig. 1 compares the trends in reported accuracy between submissions to benchmark challenges and peer-reviewed articles published in the Scopus database within the last five years (see supplementary information). Our analysis included only articles whose abstracts explicitly mentioned keywords such as deep learning (DL), building extraction/detection, or road extraction/detection and have used benchmark datasets.

Table 1 Best performing models used in benchmark datasets
Fig. 1
figure 1

Performance of models trained on benchmark datasets

Analysing Table 1, the DL architectures predominantly implemented in the benchmark challenges feature UNet and its variants that use an encoder-decoder structure. In addition to UNet and its variants in the family of semantic segmentation models, another commonly used network for building extraction is Mask R‑CNN in the family of instance segmentation models. Similarly, for road extraction, networks that use an encoder-decoder structure, such as UNet and D‑LinkNet, are among the most highly explored. As depicted in Fig. 1, common performance metrics tend to be lower in benchmark challenges (on the left side) compared to academic publications (on the right side). This is potentially attributable to different evaluation procedure. Notably, benchmark challenges involve standardised tests for distinct algorithms and individual submissions. As a result, these undertakings provide an externally controlled environment to assess the accuracy of RSML datasets. In contrast, academic manuscripts will are subjected to a rigorous peer-review process, involving anonymous assessment of the presented work. However, this process does not necessarily involve external evaluation of models. Moreover, researchers have the flexibility to select performance measures which align with the underlying research questions. Both Table 1 and Fig. 1 highlight the noticable influence of image resolution and data source on the underlying accuracy. It can be observed that and as expected, aerial and unmanned aerial vehicle (UAV) image data yield good results compared to satellite imagery.

When examining accuracy variations across diverse geographical regions and datasets, several intriguing questions arose. Notably, Table 1 demonstrates a very low accuracy in Khartoum, indicating potential issues for the application of global models in less-developed regions with poor data quality. Furthermore, Fig. 1 illustrates a surprising observation: models trained on the CrowdAI satellite dataset outperformed other models that were trained on other high-quality aerial and UAV image datasets. Following up on this finding, an article by Adimoolam and colleagues (Adimoolam et al. 2023) suggested that this is potentially caused by duplication and leakage within the CrowdAI dataset, likely decreasing the overall credibility of open RSML datasets.

Collectively, these observations underscore the rationale of the presented study, which aimed to perform a comprehensive evaluation and comparison of globally trained models with their locally fine-tuned counterparts.

5 Experimental Set-Up

We conducted an experiment to extract buildings and roads by pre-training our models on samples from around the world and fine-tuning them with a few local samples from Rwanda. We then compared the accuracy of the generated dataset to that of open global datasets. Figure 2 illustrates the process undertaken, which can be recapitulated in three stages. Stage one was data collection. Stage two focused on model development, encompassing four sub-steps: (i) training data preparation, (ii) baseline model training, (iii) fine-tuning of the baseline models on locally sourced samples, and (iv) inference. Finally, stage three evaluated the generated building and road datasets, alongside publicly accessible global RSML datasets, against a local test dataset.

Fig. 2
figure 2

Experimental flowchart

5.1 Datasets

Our datasets include publicly available RSML building footprints and road datasets by Microsoft and Google. Moreover, we collected training datasets, the details of which are listed below, which were used to generate a novel dataset. Samples from benchmark datasets, including the WHU aerial dataset, the Zanzibar OpenAI building mapping dataset, and the Massachusetts road dataset, were used. The local samples were drawn from a variety of image sources in Rwanda.

The WHU Aerial Dataset

(Ji et al. 2019) was obtained from Wuhan University (http://gpcv.whu.edu.cn/data/). It contains a shapefile of 220,000 building instances manually digitised from an aerial image captured at 7.5 cm resolution, which covers an area of 450 km2 in Christchurch, New Zealand.

The Zanzibar Dataset

is available in Kaggle’s open dataset repository (https://www.kaggle.com/datasets), containing pre-processed tiles (512 × 512 pixels, 20 cm resolution).

The Massachusetts Road Dataset

(Mnih 2013) was similarly downloaded from Kaggle’s open dataset repository (https://www.kaggle.com/datasets). It includes 1 m resolution aerial images of Massachusetts as well as rasterised OpenStreetMap road centrelines with a thickness of 7 pixels.

The Rwanda Imagery Dataset,

containing multisource images, was obtained from the National Land Authority (upon request). This dataset covers nationwide aerial images (2009, 25 cm resolution), UAV images (2019, 30 cm resolution) of major cities shown in Fig. 3, and a nationwide satellite image (2020, 50 cm resolution).

Fig. 3
figure 3

Major cities in Ruanda and sample locations

5.2 Generating Training Data

The preparation of training data for DL models involved manual digitisation of local sample images in order to create building and road network shapefiles. Following this, images were resampled and cropped into 512 × 512-pixel patches. Next, all shapefiles were clipped to the dimensions of the patches. Finally, label shapefiles were rasterised into binary and instance masks (Table 2 summarises the phases).

Table 2 Data sources and pre-processing methods

5.3 Training Process

The training involved two main phases: pre-training on a large training set containing samples from around the world and fine-tuning the pre-trained models on a specific dataset focused on Rwanda building and road types. The pre-training step aimed to equip the models with fundamental image recognition capabilities for buildings and roads, but not necessarily for specific local building and road types. Therefore, fine-tuning was needed to boost local accuracy.

5.3.1 Building Extraction Model

Two networks were trained. One network leverages a ResUNet architecture (Zhang et al. 2018) for bottom-up building detection, while the other is Mask R‑CNN (He et al. 2017) for end-to-end building detection. Because of limitations of each individual approach, as mentioned in Sect. 2.1, we devised a framework that seeks to combine their complementary strengths. In selecting these networks, our choice was informed by the literature analysis presented in Sects. 2.1 and 4. In fact, ResUNet (Zhang et al. 2018) combines the benefit of U‑Net architecture, known to offer precise segmentation (Ronneberger et al. 2015), with residual connections from the ResNet structure to mitigate the vanishing problem during training (He et al. 2016). Notably, this model has been used in recent global building and road extraction pipelines by Google (Sirko et al. 2021) and Microsoft. The second network, Mask R‑CNN, is popular for the detection and classification of instances. Both networks and other few most popular deep learning architectures, are making inroads into standard GIS software like ArcGISPro.

Apart from model architecture, the selection of an appropriate loss function is also critical for effective training and learning processes. This is because different loss functions can lead to significant variations in model performance. Building on previous research (Sirko et al. 2021), a weighted combination of Binary Cross-Entropy (BCE) and Focal Tversky Loss (FTL) (Eq. 8) was used in the ResUNet-based model and to replace the mask loss in the Mask-RCNN-based model.

$$Loss\left(y,\hat{y}\right)=\omega _{1}.BCE\left(y,\hat{y}\right)+\omega _{2}.FTL\left(y,\hat{y}\right)$$
(10)

where ω1and ω2 are weights for BCE (Eq. 11) and FTL (Eq. 12).

$$BCE\left(y,\hat{y}\right)=\frac{1}{N}{\sum }_{i=1}^{N}y_{i}\mathit{\log }\hat{y}_{i}+\left(1-y_{i}\right)\mathit{log}\left(1-\hat{y}_{i}\right)$$
(11)
$$FTL=\left(1-\frac{TP}{TP+\beta \cdot FP+\left(1-\beta \right)\cdot FN}\right)^{\gamma }$$
(12)

BCE is a standard loss function for binary classification, which measures how well the model’s predicted probabilities (\(\hat{\mathrm{y}}\)) align with the true labels (y) across N pixels. FTL allows for a better trade-off between precision and recall and addresses the issue of data imbalance (Abraham and Khan 2019). Beta (β) describes a factor to up- or down-weight FP or FN, whereas gamma, γ \(\in \left[1{,}3\right],\)aims to control the influence of easy-to-detect and hard-to-detect instances. If γ > 1, the loss function will show increased focus on less accurate detections. Based on explorative experiments, we set \(\omega _{1}=1\mathrm{and}\omega _{2}=1.5\). In addition we set γ = 4/3, \(\beta =0.7\). These values were utilized based on previous experiments described by Abraham and Khan (2019).

As discussed earlier, one of the challenges in handling large-scale remote sensing imagery is variation in lighting conditions. This is the case with our local training dataset, which comprises images captured in different seasons as well as distinct cities with varied landscapes. Consequently, we applied the mix-up technique, introduced in Sect. 2.3, which has been shown to effectively handle that issue and improve building detection (Sirko et al. 2021). Mathematically, mix-up constructs a weighted sample (\(\tilde{\mathrm{x}}_{\mathrm{ij}},\tilde{\mathrm{y}}_{\mathrm{ij}}\)), based on two random samples (xi, yi) and (xj, yj), from the training dataset as per Eqs. 11 and 12.

For semantic segmentation, as described by Sirko et al. (2021);

$$\tilde{x}_{i,j}=\uplambda \cdot x_{i}+\left(1-\uplambda \cdot \right)x_{j};\tilde{y}_{i,j}=\uplambda \cdot y_{i}+\left(1-\uplambda \cdot y_{j}\right)$$
(13)

For instance segmentation as highlighted in Jiang et al. (2021),

$$\tilde{x}_{i,j}=\uplambda \cdot x_{i}+\left(1-\uplambda \mathrm{\cdot }\right)x_{j};\tilde{y}_{i,j}=y_{i}\cup y_{j}$$
(14)

where λ ∈ [0, 1]. Thereby, by interpolating images, mix-up encourages the model to focus on prominent features common across images, potentially enhancing its generalisability.

5.3.2 Road Extraction Model

Our network for road extraction was based on ResUNet, the same architecture used in the Microsoft global road extraction pipeline. The model was pre-trained on around ten thousand image samples from the Massachusetts road dataset. To improve the performance on Rwanda roads, the model was fine-tuned on a smaller dataset of around three thousand local samples, split into three subsets based on road types. In other words, we developed three distinct local models for different local road types. This was done after observing that the model fine-tuned on a combined local dataset exhibited a bias towards paved roads, failing to adequately detect other road types, including earthen and rural roads, which are common in local development context. Notably, this bias originates from imbalances in the pre-training dataset, which lacked sufficient representation of road types present in least developed countries. When training, we implemented a mix-up as described in the above section. For loss, we set \(\upgamma =4/3\) and \(\beta =0.3\). Overall, Table 3 summarises the described training parameters.

Table 3 Models’ setting

5.4 Inference

To ensure a fair comparison with Google and Microsoft datasets, inferences were made from satellite imagery. For building detection, we explored alternative models: individual ResUNet and Mask R‑CNN models, as well as their ensembles, models trained with and without mix-up. For road detection, we implemented an ensemble of various models fine-tuned on different subsets.

The resulting predictions from ResUNet underwent post-processing to find and refine the geometry of instances. Morphological closing and opening operations were applied to fill holes in the predicted building masks. Subsequently, connected component analysis was used to obtain instances from the binary mask. Notably, impervious surfaces, including roads, can be easily confused with buildings. Therefore, road masking was performed in order to remove roads which were misclassified as buildings. Moreover, since buildings possess well-defined shapes and geometries, we analysed the predicted buildings using the area-to-circumference ratio and compactness. This enabled us to distinguish real roofs from potential noise. Lastly, building instance masks were vectorised. In parallel, to generate road vectors, predicted road masks were thinned, then vectorised and smoothed.

5.5 Evaluation

To evaluate the generated and Microsoft and Google datasets, first, a reference dataset of building polygons and road lines was prepared using heads-up digitisation. Second, for comparability of geometries, adjustments were made to align the global datasets with the local datasets. In fact, we noticed that global datasets exhibited misalignments and fragmention of roofs (Fig. 4). These displacements become obvious if images have been orthorectified using a terrain model only. In such cases, features like buildings are projected away from the camera projection center. The other issue seems to be that the extent of the study made it practically impossible to have control over the quality of the input images used. This may have resulted in the use of poor quality images including poorly georeferenced. The major challenge in adjusting observed shifts was that they were not systematic across areas. Therefore, rubber sheeting was performed on a per-site basis to compensate for the observed sporadic shifts. Third, we applied a minimum area threshold of 50 m2 (based on local expert knowledge) to remove negligible polygons and effectively exclude potential noise detections from evaluation.

Fig. 4
figure 4

Fragmentation of roofs. Pinks and green polygons represent Google and Microsoft datasets, respectively

After addressing the shifts and cleaning all datasets, a polygon-wise IoU threshold of 0.5 was applied to derive precision and recall measures for buildings (see description of building metrics in Sect. 3.2). In the same way, after aligning road vectors, length-based precision, recall, quality, and detour factors measures were derived to assess the performance across road datasets (see the description of road metrics in Sect. 3.2). Accuracy measures were derived from six sites, as illustrated in Fig. 5, representing local and complex real-world development scenarios.

Fig. 5
figure 5

presents testing areas (Sites 1–6, in order a, b, c, d, e, f) with detailed insets

Site‑1 shows a recently established settlement featuring well-organised, freestanding, and uniformly fashioned buildings which appear straightforward to delineate. In this area, all buildings are connected to earth roads, with some road segments covered by grass, resembling rural footpaths. Site‑2 presents a planned neighbourhood comprising buildings with complex roof shapes, which could potentially require additional attention for accurate delineation. This site is serviced by well-conditioned and clearly discernible roads. Site‑3 contains high-rise, highly complex apartment buildings, making delineation challenging. Here, the delineation of buildings may likely be affected by the aforementioned relief displacement problem. This settlement is serviced by tarmac roads, improved footpaths, and numerous paved surfaces, including playgrounds and parking areas. The abundance of these features can complicate the process of correctly dentifying and delineating buildings. Site‑4 represents a mixed neighbourhood made of easy- and hard-to-detect buildings. The road network is clearly discernible, however, smaller footpaths pose a potential challenge for detection. The penultimate site, site‑5, represents high-rise building blocks with highly fragmented roofs. The scenery contains a number of artefacts and paved surfaces, which can be hard to discriminate from buildings. Like in site‑3, one can expect relief displacement to be a potential challenge for this site. In regard to road extraction, the site is serviced by tarmac roads filled with cars, with the latter causing difficulties for delineation. Lastly, site‑6 is characterised by tiny, touching buildings as well as a poor road network, challenging the ability to delineate both.

In general, our testing dataset was carefully designed to be diverse, incorporating both easy and challenging scenes. Sites 3, 5, and 6 were purposefully chosen for their high level of clutter, making it difficult to detect structures like buildings. This inclusion of complex scenarios alongside simpler ones (like sites‑1, -2, and -4) allows us to thoroughly assess the robustness of DL-based detection models in handling diverse real-world conditions.

6 Results and Discussion

The primary objective of this study was to evaluate large-scale RSML datasets, including a novel dataset generated with models pre-trained on numerous samples from across the world, and then locally fine-tuned on a few limited local samples. Hence, we initially present our findings on locally fine-tuned models’ performance, considering the impact of various training and inference strategies. Focusing on building extraction, Figs. 67 and 8 present the results of the ablation analysis, focusing on the potential benefits of mix-up and ensemble strategies, as well as investigating the strengths and weaknesses of bottom-up and end-to-end approaches for building detection.

Fig. 6
figure 6

Visual comparison of various models revealing effect of mix-up and ensembling strategies (building models)

Fig. 7
figure 7

Performance of ResUNet, Mask R‑CNN and ensemble models compared

Fig. 8
figure 8

Disaggregated evaluation of ResUNet, Mask R‑CNN and ensemble models

Figure 6 shows the predictions of Mask R‑CNN and ResUNet, both trained with and without mix-up, as well as the results of the ensemble mode combining predictions from the two models. As shown in the figure, mix-up improved detectability of buildings. For instance, the Mask R‑CNN model with mix-up detected several instances which were missed by the model without mix-up. This improvement is further corroborated by the ResUNet model. Here, after mix-up, ResUNet is able to detect previously fractured roofs in their entirety.

Comparing Mask R‑CNN, which implements end-to-end detection, to ResNet, which implements bottom-up detection, reveals several trade-offs, as illustrated in Fig. 6. The Mask R‑CNN model outperforms ResUNet model to produce instances with less noise. However, the model tended to include non-building pixels near the edges. In contrast, ResUNet exhibits precision on boundaries of roofs. Even so, the emphasis on individual pixels led the model to classify pixels belonging to the same roof differently, due to little variation in reflectance. Roof parts with slight variation in the lighting resulting from roof structures or installations on roofs were interpreted as separate objects, causing roof fragmentation. This is observable where predicted roofs present many holes and fragments.

Consequently, the Mask R‑CNN model significantly outperformed ResUNet, achieving a precision of 54% compared to 44% and a recall of 56% compared to 52% across all testing sites (Fig. 7). In addition, using an ensemble model, which combines both predictions from Mask R‑CNN and ResUNet, could help achieve a compromise in performance. Results in Fig. 7 demonstrate that the ensemble model generally outperformed the individual models. It achieved 60% and 58% precision and recall, respectively, which represents a 14% and 6% improvement in precision and a 6% and 2% improvement in recall over ResNet and Mask R‑CNN, respectively. However, disaggregated results at individual site levels (Fig. 8) reveal that ensembling did not always lead to improvement in both precision and recall. In many areas, e.g. sites‑1, -2 and -3, Mask R‑CNN exhibited better precision, which however degraded when its predictions were combined with those from ResUnet. In contrast, in these areas, ensembling led to a significant improvement in recall (Fig. 8). This suggests that a given model might be preferable depending on the trade-off between recall and precision at each site. This finding aligns with the recommendation by Meyer and Pebesma (2022) to assess the area of applicability for each model when building accurate large datasets.

Following the ablation analysis, we now turn to evaluating the effectiveness of local fine-tuning in comparison global models. For that purpose we use results presented in Figs. 9, 10 and 11. Starting with the RSML building datasets, the evaluation results presented in Fig. 9 demonstrate that our ensemble model, which achieved the highest score during ablation analysis, outperforms global models. Evaluating each site individually, Fig. 9 reveals significant spatial variations in the performance of all methods and, generally, of DL-based methods. Evaluated models exhibited performance degradation as applied to scenes with increasing difficulty, starting from site‑1 and -2, followed by -4, -6, -3, and then -5. Site‑1 and -2 exhibited the best results (P and R above 70%) due to the presence of visually well-defined building outlines. Site‑4, which is a typical mixed neighbourhood, exhibits moderate performance, with a precision near 70% and a recall close to 60%. In site‑6, DL-based methods show high precision of around 70%, but suffer from a lower recall where 60% of buildings were missed. This is because of small and clustered buildings. Site‑3 and -5, yielded the poorest results. In site‑5, dominated by high-rise buildings and building blocks, only around 50% of the buildings were detected correctly. Site‑3 proved to be the most challenging scene, where mainly DL methods suffered from over-segmentation. Here, the model could only detect 35% of the buildings, and only about 20% of those detections were correct.

Fig. 9
figure 9

Accuracy of RSML buildings. Notation: G, M, and L represent Google’s building dataset, Microsoft’s building dataset, and our generated building dataset, respectively. Asterisks (*) indicate datasets manually adjusted by rubber sheeting

Fig. 10
figure 10

Accuracy of RSML roads. Notation: M, and L represent Microsoft’s road dataset, and our generated road dataset, respectively. Asterisks (*) indicate data manually adjusted by rubber sheeting

Fig. 11
figure 11

Visual comparison of various datasets

Next, Fig. 10 presents the evaluation results of our generated road dataset and the Microsoft road dataset. These results revealed minimally higher precision and recall by our model compared to the raw version of the Microsoft Road data. However, this does not argue for the poor detection capabilities of the Microsoft road extraction model, but rather points to issues with location shifts inherent in the images from which the data were generated. In fact, the adjusted dataset (M*) showed better precision and recall compared to our generated dataset. This finding suggests that while global datasets offer value, their usefulness is limited by challenges in controlling the quality of input data. Furthermore, this finding suggests that transparently sharing the underlying methods could be more beneficial than simply sharing generated datasets. In the context of the model used to generate a global road dataset, this would allow users to apply or fine-tune the method to their own high-quality data, potentially achieving even better local results.

Overall, better than building extraction, DL-based methods achieved precision and recall of more than 70% across all testing sites, which represents a noteworthy accomplishment in road extraction. Particularly, the good precision and recall achieved by the Microsoft road dataset highlight the robustness of DL-based methods for road extraction, even at very large scales. The underperformance of our model compared to the one used to generate Microsoft’s road dataset can be attributed to potential inaccuracies in the labels used for initial training. These labels were automatically generated from OpenStreetMap centerlines which make them prone to imprecisions.

Figure 11 provides a visual comparison of our RSML building and road datasets with open global datasets, enabling a more nuanced evaluation of both data sources. Taking a glance at predicted building shapes, we noticed that all models perform well on sites -1 and -2, moderately well on site -4, and poorly on the remaining sites, consistent with quantitative evaluation results. Figure 11 also highlights an interesting observation on site-6: bottom-up methods may achieve good semantic segmentation but struggle to separate individual buildings. In this site, the Microsoft model performed better in segmentation despite showing low precision in the quantitative evaluation (Fig. 9). Lastly, visually comparing our generated and Microsoft road networks exposes a key challenge: inconsistencies in post-processing that lead to variations in data quality.

In a nutshell, our evaluation underscores the paramount importance of RSML datasets by demonstrating sufficient performance of deep learning models in extracting buildings and roads across various areas. According to Mayer et al. (2006), a recall of 60–70% and a precision of 75–85% is considered as the minimum accuracy for practical utility of RSML road datasets. Interestingly, data generated by the locally fine-tuned road model in four out of six sites appears usable. Considering the limitations from imprecise labels in the initial training, improved data could lead to performance gains, potentially matching or surpassing the results achieved with the manually adjusted Microsoft dataset. Applying the same standards as RSML roads to RSML buildings, the locally fine-tuned model’s generated datasets for sites -1, -2, and -4 would be suitable for some practical applications. Notably, these sites contain primarily single-detached houses, a building type that, according to a recent study (Dieye et al. 2023), represents nearly 85% of all buildings in Rwanda which implies a good local applicability of our model.

7 Conclusion

Within this study, we assessed the local accuracy of large-scale RSML building and road datasets. Beyond the assessment of the accuracy of open, global RSML datasets, we here proposed an efficient framework to generate more accurate local datasets. Our framework combines bottom-up and end-to-end segmentation for building detection. For road detection, our framework proposes combining various regionally optimised road models that capture various road types. It also seeks to capitalise on huge publicly available samples from across the world and a limited number of locally sourced samples. Our framework tackles several limitations of global models, including the lack of local context, difficulty in controlling input data quality, and inherent inaccuracies. Our evaluation demonstrated the effectiveness of the proposed framework. Notably, the generated building data achieved higher accuracy compared to open global datasets, while the road extraction model showed moderate performance. Our study emphasised the importance of disaggregated evaluation of RSML datasets to capture potential localised discrepancies and identify areas of application. Beyond this, it underscores characteristics in which further explorations are needed.

Our study does not come without potential limitations. Our evaluation was limited by inconsistencies in implementations of data generation pipelines, especially post-processing. This is because of the lack of transparency and methodological details particulary for the Microsoft datasets, potentially affecting the fairness of the comparison. Despite that, the evaluation remains valuable. It illuminated the potential and limitations of open global RSML datasets. Most importantly, it enabled exploration of an alternative, efficient pipeline for generating data with enhanced local accuracy.

Accurate building assessment can be utilised for various downstream applications, including wealth and population mapping. In future work, we intend to explore additional image sources, potentially incorporating LIDAR data into our building detection framework. We believe that this will improve the detection and classification of building instances into groups, ultimately inferring the underlying variation in household wealth status.