Local Evaluation of Large-scale Remote Sensing Machine Learning-generated Building and Road Dataset: The Case of Rwanda

Nyandwi, Emmanuel; Gerke, Markus; Achanccaray, Pedro

doi:10.1007/s41064-024-00297-9

Local Evaluation of Large-scale Remote Sensing Machine Learning-generated Building and Road Dataset: The Case of Rwanda

Original Article
Open access
Published: 24 July 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

PFG – Journal of Photogrammetry, Remote Sensing and Geoinformation Science Aims and scope Submit manuscript

Local Evaluation of Large-scale Remote Sensing Machine Learning-generated Building and Road Dataset: The Case of Rwanda

Download PDF

Abstract

Accurate and up-to-date building and road data are crucial for informed spatial planning. In developing regions in particular, major challenges arise due to the limited availability of these data, primarily as a result of the inherent inefficiency of traditional field-based surveys and manual data generation methods. Importantly, this limitation has prompted the exploration of alternative solutions, including the use of remote sensing machine learning-generated (RSML) datasets. Within the field of RSML datasets, a plethora of models have been proposed. However, these methods, evaluated in a research setting, may not translate perfectly to massive real-world applications, attributable to potential inaccuracies in unknown geographic spaces. The scepticism surrounding the usefulness of datasets generated by global models, owing to unguaranteed local accuracy, appears to be particularly concerning. As a consequence, rigorous evaluations of these datasets in local scenarios are essential for gaining insights into their usability. To address this concern, this study investigates the local accuracy of large RSML datasets. For this evaluation, we employed a dataset generated using models pre-trained on a variety of samples drawn from across the world and accessible from public repositories of open benchmark datasets. Subsequently, these models were fine-tuned with a limited set of local samples specific to Rwanda. In addition, the evaluation included Microsoft’s and Google’s global datasets. Using ResNet and Mask R‑CNN, we explored the performance variations of different building detection approaches: bottom-up, end-to-end, and their combination. For road extraction, we explored the approach of training multiple models on subsets representing different road types. Our testing dataset was carefully designed to be diverse, incorporating both easy and challenging scenes. It includes areas purposefully chosen for their high level of clutter, making it difficult to detect structures like buildings. This inclusion of complex scenarios alongside simpler ones allows us to thoroughly assess the robustness of DL-based detection models for handling diverse real-world conditions. In addition, buildings were evaluated using a polygon-wise comparison, while roads were assessed using network length-derived metrics.

Our results showed a precision (P) of around 75% and a recall (R) of around 60% for the locally fine-tuned building model. This performance was achieved in three out of six testing sites and is considered the lowest limit needed for practical utility of RSML datasets, according to the literature. In contrast, comparable results were obtained in only one out of six sites for the Google and Microsoft datasets. Our locally fine-tuned road model achieved moderate success, meeting the minimum usability threshold in four out of six sites. In contrast, the Microsoft dataset performed well on all sites. In summary, our findings suggest improved performance in road extraction, relative to building extraction tasks. Moreover, we observed that a pipeline relying on a combination of bottom-up and top-down segmentation, while leveraging open global benchmark annotation dataset as well as a small number of samples for fine-tuning, can offer more accurate RSML datasets compared to an open global dataset. Our findings suggest that relying solely on aggregated accuracy metrics can be misleading. According to our evaluation, even city-level derived measures may not capture significant variations in performance within a city, such as lower accuracy in specific neighbourhoods. Overcoming the challenges of complex areas might benefit from exploring alternative approaches, including the integration of LiDAR data, UAV images, aerial images or using other network architectures.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The availability of accurate and up-to-date building and road information plays a crucial role in shaping spatial decisions and interventions. These include settlement planning, population mapping, construction management, disaster management, and everyday navigation. However, challenges arise due to the limited availability of this information, especially in developing regions, caused by the inherent inefficiency of traditional field survey and manual data generation methods.

In this regard, the ability of machine learning approaches to automatically generate building and road data from remote sensing images has been subject to increased popularity (Luo et al. 2021; Abdollahi et al. 2020). Most prominently, the breakthroughs in Deep Learning (DL) and its potential synergy with remote sensing for supporting sustainable development (Persello et al. 2022) have driven the surge in experiments with DL-based methods for building and road extraction. Leveraging remote sensing data for large geographic coverage and the capability of DL algorithms to handle big data, recent experiments have extended beyond small-scale research to reach continental and global scales. This has resulted in the production of global datasets, such as Google’s Building Dataset and Microsoft’s Building and Road Datasets^{Footnote 1}^,^{Footnote 2}.

Although DL-based methods potentially offer a way to enrich local base maps, there is increased scepticism surrounding the direct integration of large remote sensing machine learning-generated (RSML) data into conventionally accepted datasets (Li et al. 2022; Burke et al. 2021). Here, major concerns relate to potential data inaccuracies, given that predictor properties of geographical phenomena, are not necessarily identical across space (Meyer and Pebesma 2021). As a result, methodologies built on existing popular benchmark datasets might struggle with real-world application, especially in unknown geographic areas (Roscher et al. 2023). Particular concerns regarding the usability of RSML data arise from datasets generated by global models (Meyer and Pebesma 2022). Given the difficulty in representing various types of localised variation within the training data of these models, they in many cases exhibit inherent biases. Often, global models use data concentrated in well-developed areas (Meyer and Pebesma 2022). Consequently, their accuracy and utility might be limited in the less-developed regions. This limitation can be even more pronounced for anthropogenic structures like buildings, which reflect diverse socioeconomic factors. Construction processes in least-developed regions are often dominated by unregulated self-designed and built structures, leading to significant variation in their appearance. Despite this fact, a number of global models lack transparent and rigorously derived local accuracy measures, which clearly delineate the area of applicability for the generated data (Meyer and Pebesma 2022).

In this paper, we compared the local accuracy of existing global building and road models with globally trained but locally fine-tuned models. Thereby, we aimed to shed light on the most efficient approach for generating more accurate RSML datasets with local applicability. The remainder of this paper is structured as follows: Sect. 2 presents state-of-the-art building and road extraction methods. Section 3 offers commonly applied metrics to evaluate RSML building and road datasets. Following this, Sect. 4 synthesises the results of benchmark implementations, to expose potential practical challenges which require continued attention. Next, we describe a novel framework for generating local building and road datasets in Sect. 5. Subsequently, the experimental results will be presented and discussed in Sect. 6. Lastly, we will summarise our findings and conclude the study in Sect. 7.

2 State-of-the-art

2.1 DL Models for Building and Road Extraction Methods

The field of building and road extraction using deep learning is flourishing, with researchers constantly developing new and improved methods. Traditionally, convolutional neural networks (CNNs), which are designed to process data with grid-like structures, have been the most popular DL algorithms in remote sensing (Ma et al. 2019) and thus in building and road extraction implementations. However, recent advancements have seen transformers designed for sequential data (Strudel et al. 2021) being experimented with in building extraction (Chen et al. 2021) as well as road extraction (Tao et al. 2023; Jiang et al. 2022). A comprehensive review of DL-based building extraction methods is given by Luo et al. (2021). In the same vein, Chen et al. (2022), Abdollahi et al. (2020), and Lian et al. (2020) reviewed DL-based road extraction methods.

DL-based building and road extraction methods involve mainly two types of segmentation: semantic segmentation and instance segmentation. Semantic segmentation applies in the bottom-up detection process, where a post-processing step is required to find instances (i.e., each individual building or connected roads). In an example of this implementation for building detection from Sirko et al. (2021), building detection was achieved through a two-step process. Step one was binarisation where ResUNet was applied for semantic segmentation to differentiate between building pixels and background (non-building pixels). Step two was connected component analysis, which identifies and groups pixels belonging to the same building instances. By contrast, instance segmentation tackles detection in an end-to-end manner, i.e., models detect and delineate each distinct instance of buildings in an image. A typical implementation of this approach involved using Mask R‑CNN (Zhao et al. 2018) to produce separated building polygons which required post-processing only for minor geometric refinement. The two approaches to detecting buildings show varying degrees of success in accurately demarcating buildings, due to variations in roof type, size, appearance, and configuration. The bottom-up approaches may perform well in the classification of building pixels, but issues arise when forming instances. A study by Sirko et al. (2021) observed that this approach tends to fragment large and complex roofs. In contrast, end-to-end approach, while effective in isolating individual objects, may fall short in capturing smaller buildings (Sirko et al. 2021).

In road extraction, pipelines that use a bottom-up approach typically require the initial application of semantic segmentation, followed by post-processing operations such as thining and graph construction (Zhang et al. 2018). In contrast, when using end-to-end segmentation with a graph-oriented model, road graphs will be created directly (He et al. 2020; Lian and Huang 2020).

2.2 Data Sources for Large-Scale Models

Building and road extraction require the availability of good quality input data. In this regard, images with sub-metre resolution for building extraction and around one metre resolution for road extraction are suitable for use as input (Luo et al. 2021). Many implementations have traditionally relied on RGB imagery. However, in several instances, the utility of multispectral images (Pasquali et al. 2019), fused RGB images as well as normalised digital surface models (nDSM) have been explored (Achanccaray et al. 2023; Bittner et al. 2018). In particular, implementations using nDSM have been reported to significantly improve building detection accuracy. In practice, however, this method is challenging to implement, given the limitations in acquiring high-resolution DSM on a large scale and the high accompanying computational costs.

Experiments covering large scale require robust models and the use of very large datasets. Owing to difficulties in acquiring datasets of this size from a single source, relying on data from multiple sources has been crucial. For instance, both the Microsoft and Google global RSML building and road datasets have been derived from images originating from a variety of sources. Importantly, these datasets not only represent typical examples of multisourced datasets but also exemplify the contribution of DL and remote sensing in facilitating open access to spatial data. Yet, the local accuracy of these datasets is currently unknown. In addition, research would benefit from the availability of annotated training data and trained base models, which generated the datasets, as this would facilitate additional independent experiments and fine-tuning to increase local accuracy.

In principle, however, this weakness could be overcome by leveraging publicly accessible, multisource, and diversely annotated datasets (Luo et al. 2023). Examples for these sources include SpaceNet (Van Etten et al. 2018), WHU building dataset (Ji et al. 2019), INRIA aerial image labelling dataset (Maggiori et al. 2017), Massachusetts dataset (Mnih 2013), crowdAI dataset (Mohanty et al. 2020) and Open Cities AI Challenge Dataset (https://mlhub.earth/10.34911/rdnt.f94cxb). Some of these datasets contain high-quality images, such as aerial and UAV images, with sufficient detailed information in the spatial domain, which increases the ability to delineate objects like buildings. Thereby, these datasets can be used to develop robust baseline models that can be optimised regionally and locally through transfer learning. Experiments by Ji et al. (2019) and Abriha and Szabó (2023) showed that using satellite imagery, a common source of images for many downstream applications, to fine-tune baseline models pre-trained on high-quality aerial VHR images improved results compared to directly training on satellite imagery.

2.3 Training and Inference Strategies for Large-scale Models

Large-scale models require training and inference strategies, which are essential for overcoming the challenges of limited data availability and the inherent variability in feature properties. For instance, pre-training and self-training (Zoph et al. 2020), ensembling (Ganaie et al. 2022), and mix-up techniques (Zhang et al. 2017) among the strategies that demonstrated accuracy improvement for continental scale building detection (Sirko et al. 2021).

Pre-training (Zoph et al. 2020) is a technique to initialise weights, serving as a powerful foundation for transfer learning. The approach, which combines pre-training with fine-tuning, is particularly effective for achieving high-accuracy models, especially when dealing with large geographical areas and limited data for the specific task. One approach to pre-training involves leveraging millions of remote sensing images and a pretext task focused on learning spatial representations. In a recent example, Google research team (Sirko et al. 2021) implemented pre-training using one million images, predicting the location where images come from, and nighttime luminance class as pretext tasks. Afterward, the model was trained to extract buildings, and results showed it outperformed other models that did not undergo pre-training.

While pre-training deals with weight initialisation and adapting to the domain, self-training (Zoph et al. 2020) iteratively refines models by generating pseudo-labels from unlabeled data to address limited annotated data in large-scale remote sensing analysis. To address the challenge of limited labeled data, self-training leverages a two-step process. First, the trained model, acting as the “teacher,” generates pseudo-labels for a large corpus of unlabeled data. These pseudo-labels are then combined with the existing labeled data to retrain a new, improved “student” model. This iterative process continues until the model’s performance stabilises. In Google’s example mentioned above, ten million unlabeled images were used for self-training, resulting in the best-performing model compared to other models that did not involve self-training.

Furthermore, when training large-scale models, approaches such as mix-up (Zhang et al. 2017) proved to mitigate potential regional biases arising from factors such as lighting variations through the introduction of controlled data augmentation during training, aiming to facilitate model generalisation (Sirko et al. 2021).

To improve the accuracy and robustness of predictions, the ensembling strategy (Ganaie et al. 2022) combines predictions from diversely trained models during the prediction phase. Thereby, it is making use of the models’ combined strengths to address challenges posed by the inherent complexity and variation in man-made features, including buildings (Cao et al. 2021).

3 Methods for Evaluating RSML Datasets

3.1 Pixel- Versus Instance-Based Accuracy Metrics for Buildings

To assess the RSML building datasets, two major metrics are used: pixel- or object-based accuracy measures (Rutzinger et al. 2009).

The per-pixel evaluation is applied to evaluate the accuracy of semantic segmentation, whereby limited attention is paid to individual building outlines. In practice, analyses will evaluate the true positive (TP), false negative (FN) and false positive (FP) rates in segmentation through the assessment of matched, missed, and committed pixels. Building up on this, a compound measure of TP, FN, and FP is the pixel-wise intersection over union (IoU). This metric, utilised to evaluate segmentation similarity, disregards how pixels are grouped in bounded regions (Zhang et al. 2021). In some papers, IoU is derived from both the building and background classes, albeit with the risk of biassing the assessment towards the background class.

Beyond pixel-level metrics, object-based evaluation methods are utilised to account for under- and over-segmentation of building instances (Eqs. 1–6). Unlike per-pixel segmentation evaluation, a user-defined object-based IoU (Eq. 1) threshold, typically above 50%, is used to determine the TP, FP, and FN (Padilla et al. 2020; Sirko et al. 2021; Rutzinger et al. 2009; Van Etten et al. 2018; Eq. 2). Here, the computation of the IoU involves areas of instances‘ bounding boxes or masks (Ji et al. 2019) or boundary lines (Wang et al. 2023; Xia et al. 2021). In area comparison, both raster (pixel) and vector models can be applied. However, the rasterisation of building polygons can result in distorted boundaries and poor performance (Avbelj et al. 2015; Rutzinger et al. 2009).

The IoU, TP, FP and FN serve as the foundation for calculating precision (P in Eq. 3), recall (R in Eq. 4), and the F1-score (F1 in Eq. 5). The precision provides the number of correct instances relative to the total number of RSML building instances. In contrast, recall refers to the number of correct RSML building instances relative to the total ground truth (GT) building samples. Combining both metrics, the F1-score represents the harmonic mean of precision and recall. In addition, for object-based approaches, the average precision (AP) is commonly utilised to measure potential trade-offs between precision and recall (Everingham et al. 2010; Padilla et al. 2020). The AP score summarises the area under the precision-recall curve constructed using approximative rectangles. Thereby, the width of rectangles is determined by calculating the difference between two consecutive recall points (R_n-R_n‑1). The height of the rectangles is determined by taking the maximum precision values (max(P_n, P_n‑1)) (Eq. 6). Notably, interpolation of AP can lead to errors or distorted measures and should thus be applied with additional care (Zhang et al. 2022).

Following from the description above, given a number of Y total GT and $\hat{Y}$ as the number of RSML buildings, for each RSML building polygon $\widehat{y_{i}}\in \hat{Y}$ and its corresponding GT building polygon $y_{i}\in Y$ it follows that:

$$IoU=\frac{\widehat{y_{i}}\cap y_{i}}{\widehat{y_{i}}\cup y_{i}}$$

(1)

$$\hat{Y}\left(IoU\right)=\begin{cases} TP\,\textit{instances}\;\mathrm{if}IoU\geq 0.5\\ FP\,\textit{instances}\;\text{otherwise}\, \end{cases} $$

(2)

$$P=\frac{TP}{TP+FP}=\frac{TP}{\hat{Y}}$$

(3)

$$R=\frac{TP}{TP+FN}=\frac{TP}{Y}$$

(4)

$$F1=2\cdot \frac{P\cdot R}{P+R}$$

(5)

$$AP=\sum _{n}({R_{n}}-R_{n-1})\cdot \max (P_{n},P_{n-1})$$

(6)

3.2 Pixel- Versus Length-Based Accuracy Metrics for Roads

Pixel-based metrics, such as IoU and F1, are simple to implement and commonly used to evaluate road segmentation. However, to assess the topological properties of roads, length-based P and R, quality (Q) (Wiedemann et al. 1998), and detour factor (Df) (Wiedemann and Ebner 2000) were introduced. Precision is calculated as the ratio between the length of correctly detected road segments (matched RSML road lines) and the total length of all detected road segments in the RSML dataset. Recall is calculated as the ratio between the length of correctly detected road segments and the total length of all road segments in the ground truth (GT) dataset. The quality is given by the proportion of the length of the matched RSML relative to the sum of the total length of the RSML road and the length of the unmatched GT road (Eq. 7). The detour factor describes the ratio of the length of RSML roads to the length of GT roads (Eq. 8). More recently, SpaceNet has introduced the Average Path Length Similarity (APLS), a graph- and routing-based metric (Van Etten et al. 2018). This measure calculates the sum of the absolute differences in optimal path lengths across all pairs of random corresponding nodes in the predicted and ground truth network. Thereby, any missing path in the predicted network will contribute a value of 1 to the sum (Eq. 9). However, the randomness of node sampling represents a major weakness and challenge in the implementation of APLS in practice.

$$Q=\frac{\text{length of matched RSML road}}{\text{length of RSML road} + \text{length of unmatched GT road}}$$

(7)

$$Df=\frac{\text{length of RSML road}}{\text{length of GT road}\,}$$

(8)

$$APLS=1-\frac{1}{N}\sum \mathit{\min }\left\{1,\left| \frac{l\left(a,\;b\right)-l\left(a',b'\right)}{L\left(a,\;b\right)}\right| \right\}$$

(9)

In Eq. 9$a'$is the the node in predicted road network nearest to the location of node a in the ground truth road network. L(a,b) and L($a{\prime}$, $b{\prime}$) denote lengths of the paths in the ground truth road network and the predicted road network. N is the number of paths considered.

4 A Synthesis of Benchmark Implementations

To gain insights into the performance of various methods and metrics used for evaluation, we present a synthesis of the results from benchmark works in Table 1. Fig. 1 compares the trends in reported accuracy between submissions to benchmark challenges and peer-reviewed articles published in the Scopus database within the last five years (see supplementary information). Our analysis included only articles whose abstracts explicitly mentioned keywords such as deep learning (DL), building extraction/detection, or road extraction/detection and have used benchmark datasets.

Table 1 Best performing models used in benchmark datasets

Full size table

Analysing Table 1, the DL architectures predominantly implemented in the benchmark challenges feature UNet and its variants that use an encoder-decoder structure. In addition to UNet and its variants in the family of semantic segmentation models, another commonly used network for building extraction is Mask R‑CNN in the family of instance segmentation models. Similarly, for road extraction, networks that use an encoder-decoder structure, such as UNet and D‑LinkNet, are among the most highly explored. As depicted in Fig. 1, common performance metrics tend to be lower in benchmark challenges (on the left side) compared to academic publications (on the right side). This is potentially attributable to different evaluation procedure. Notably, benchmark challenges involve standardised tests for distinct algorithms and individual submissions. As a result, these undertakings provide an externally controlled environment to assess the accuracy of RSML datasets. In contrast, academic manuscripts will are subjected to a rigorous peer-review process, involving anonymous assessment of the presented work. However, this process does not necessarily involve external evaluation of models. Moreover, researchers have the flexibility to select performance measures which align with the underlying research questions. Both Table 1 and Fig. 1 highlight the noticable influence of image resolution and data source on the underlying accuracy. It can be observed that and as expected, aerial and unmanned aerial vehicle (UAV) image data yield good results compared to satellite imagery.

When examining accuracy variations across diverse geographical regions and datasets, several intriguing questions arose. Notably, Table 1 demonstrates a very low accuracy in Khartoum, indicating potential issues for the application of global models in less-developed regions with poor data quality. Furthermore, Fig. 1 illustrates a surprising observation: models trained on the CrowdAI satellite dataset outperformed other models that were trained on other high-quality aerial and UAV image datasets. Following up on this finding, an article by Adimoolam and colleagues (Adimoolam et al. 2023) suggested that this is potentially caused by duplication and leakage within the CrowdAI dataset, likely decreasing the overall credibility of open RSML datasets.

Collectively, these observations underscore the rationale of the presented study, which aimed to perform a comprehensive evaluation and comparison of globally trained models with their locally fine-tuned counterparts.

5 Experimental Set-Up

We conducted an experiment to extract buildings and roads by pre-training our models on samples from around the world and fine-tuning them with a few local samples from Rwanda. We then compared the accuracy of the generated dataset to that of open global datasets. Figure 2 illustrates the process undertaken, which can be recapitulated in three stages. Stage one was data collection. Stage two focused on model development, encompassing four sub-steps: (i) training data preparation, (ii) baseline model training, (iii) fine-tuning of the baseline models on locally sourced samples, and (iv) inference. Finally, stage three evaluated the generated building and road datasets, alongside publicly accessible global RSML datasets, against a local test dataset.

5.1 Datasets

Our datasets include publicly available RSML building footprints and road datasets by Microsoft and Google. Moreover, we collected training datasets, the details of which are listed below, which were used to generate a novel dataset. Samples from benchmark datasets, including the WHU aerial dataset, the Zanzibar OpenAI building mapping dataset, and the Massachusetts road dataset, were used. The local samples were drawn from a variety of image sources in Rwanda.

The WHU Aerial Dataset

(Ji et al. 2019) was obtained from Wuhan University (http://gpcv.whu.edu.cn/data/). It contains a shapefile of 220,000 building instances manually digitised from an aerial image captured at 7.5 cm resolution, which covers an area of 450 km² in Christchurch, New Zealand.

The Zanzibar Dataset

is available in Kaggle’s open dataset repository (https://www.kaggle.com/datasets), containing pre-processed tiles (512 × 512 pixels, 20 cm resolution).

The Massachusetts Road Dataset

(Mnih 2013) was similarly downloaded from Kaggle’s open dataset repository (https://www.kaggle.com/datasets). It includes 1 m resolution aerial images of Massachusetts as well as rasterised OpenStreetMap road centrelines with a thickness of 7 pixels.

The Rwanda Imagery Dataset,

containing multisource images, was obtained from the National Land Authority (upon request). This dataset covers nationwide aerial images (2009, 25 cm resolution), UAV images (2019, 30 cm resolution) of major cities shown in Fig. 3, and a nationwide satellite image (2020, 50 cm resolution).

5.2 Generating Training Data

The preparation of training data for DL models involved manual digitisation of local sample images in order to create building and road network shapefiles. Following this, images were resampled and cropped into 512 × 512-pixel patches. Next, all shapefiles were clipped to the dimensions of the patches. Finally, label shapefiles were rasterised into binary and instance masks (Table 2 summarises the phases).

Table 2 Data sources and pre-processing methods

Full size table

5.3 Training Process

The training involved two main phases: pre-training on a large training set containing samples from around the world and fine-tuning the pre-trained models on a specific dataset focused on Rwanda building and road types. The pre-training step aimed to equip the models with fundamental image recognition capabilities for buildings and roads, but not necessarily for specific local building and road types. Therefore, fine-tuning was needed to boost local accuracy.

5.3.1 Building Extraction Model

Two networks were trained. One network leverages a ResUNet architecture (Zhang et al. 2018) for bottom-up building detection, while the other is Mask R‑CNN (He et al. 2017) for end-to-end building detection. Because of limitations of each individual approach, as mentioned in Sect. 2.1, we devised a framework that seeks to combine their complementary strengths. In selecting these networks, our choice was informed by the literature analysis presented in Sects. 2.1 and 4. In fact, ResUNet (Zhang et al. 2018) combines the benefit of U‑Net architecture, known to offer precise segmentation (Ronneberger et al. 2015), with residual connections from the ResNet structure to mitigate the vanishing problem during training (He et al. 2016). Notably, this model has been used in recent global building and road extraction pipelines by Google (Sirko et al. 2021) and Microsoft. The second network, Mask R‑CNN, is popular for the detection and classification of instances. Both networks and other few most popular deep learning architectures, are making inroads into standard GIS software like ArcGISPro.

Apart from model architecture, the selection of an appropriate loss function is also critical for effective training and learning processes. This is because different loss functions can lead to significant variations in model performance. Building on previous research (Sirko et al. 2021), a weighted combination of Binary Cross-Entropy (BCE) and Focal Tversky Loss (FTL) (Eq. 8) was used in the ResUNet-based model and to replace the mask loss in the Mask-RCNN-based model.

$$Loss\left(y,\hat{y}\right)=\omega _{1}.BCE\left(y,\hat{y}\right)+\omega _{2}.FTL\left(y,\hat{y}\right)$$

(10)

where ω₁ and ω₂ are weights for BCE (Eq. 11) and FTL (Eq. 12).

$$BCE\left(y,\hat{y}\right)=\frac{1}{N}{\sum }_{i=1}^{N}y_{i}\mathit{\log }\hat{y}_{i}+\left(1-y_{i}\right)\mathit{log}\left(1-\hat{y}_{i}\right)$$

(11)

$$FTL=\left(1-\frac{TP}{TP+\beta \cdot FP+\left(1-\beta \right)\cdot FN}\right)^{\gamma }$$

(12)

BCE is a standard loss function for binary classification, which measures how well the model’s predicted probabilities ($\hat{\mathrm{y}}$) align with the true labels (y) across N pixels. FTL allows for a better trade-off between precision and recall and addresses the issue of data imbalance (Abraham and Khan 2019). Beta (β) describes a factor to up- or down-weight FP or FN, whereas gamma, γ $\in \left[1{,}3\right],$aims to control the influence of easy-to-detect and hard-to-detect instances. If γ > 1, the loss function will show increased focus on less accurate detections. Based on explorative experiments, we set $\omega _{1}=1\mathrm{and}\omega _{2}=1.5$. In addition we set γ = 4/3, $\beta =0.7$. These values were utilized based on previous experiments described by Abraham and Khan (2019).

As discussed earlier, one of the challenges in handling large-scale remote sensing imagery is variation in lighting conditions. This is the case with our local training dataset, which comprises images captured in different seasons as well as distinct cities with varied landscapes. Consequently, we applied the mix-up technique, introduced in Sect. 2.3, which has been shown to effectively handle that issue and improve building detection (Sirko et al. 2021). Mathematically, mix-up constructs a weighted sample ($\tilde{\mathrm{x}}_{\mathrm{ij}},\tilde{\mathrm{y}}_{\mathrm{ij}}$), based on two random samples (x_i, y_i) and (x_j, y_j), from the training dataset as per Eqs. 11 and 12.

For semantic segmentation, as described by Sirko et al. (2021);

$$\tilde{x}_{i,j}=\uplambda \cdot x_{i}+\left(1-\uplambda \cdot \right)x_{j};\tilde{y}_{i,j}=\uplambda \cdot y_{i}+\left(1-\uplambda \cdot y_{j}\right)$$

(13)

For instance segmentation as highlighted in Jiang et al. (2021),

$$\tilde{x}_{i,j}=\uplambda \cdot x_{i}+\left(1-\uplambda \mathrm{\cdot }\right)x_{j};\tilde{y}_{i,j}=y_{i}\cup y_{j}$$

(14)

where λ ∈ [0, 1]. Thereby, by interpolating images, mix-up encourages the model to focus on prominent features common across images, potentially enhancing its generalisability.

5.3.2 Road Extraction Model

Our network for road extraction was based on ResUNet, the same architecture used in the Microsoft global road extraction pipeline. The model was pre-trained on around ten thousand image samples from the Massachusetts road dataset. To improve the performance on Rwanda roads, the model was fine-tuned on a smaller dataset of around three thousand local samples, split into three subsets based on road types. In other words, we developed three distinct local models for different local road types. This was done after observing that the model fine-tuned on a combined local dataset exhibited a bias towards paved roads, failing to adequately detect other road types, including earthen and rural roads, which are common in local development context. Notably, this bias originates from imbalances in the pre-training dataset, which lacked sufficient representation of road types present in least developed countries. When training, we implemented a mix-up as described in the above section. For loss, we set $\upgamma =4/3$ and $\beta =0.3$. Overall, Table 3 summarises the described training parameters.

Table 3 Models’ setting

Full size table

5.4 Inference

To ensure a fair comparison with Google and Microsoft datasets, inferences were made from satellite imagery. For building detection, we explored alternative models: individual ResUNet and Mask R‑CNN models, as well as their ensembles, models trained with and without mix-up. For road detection, we implemented an ensemble of various models fine-tuned on different subsets.

The resulting predictions from ResUNet underwent post-processing to find and refine the geometry of instances. Morphological closing and opening operations were applied to fill holes in the predicted building masks. Subsequently, connected component analysis was used to obtain instances from the binary mask. Notably, impervious surfaces, including roads, can be easily confused with buildings. Therefore, road masking was performed in order to remove roads which were misclassified as buildings. Moreover, since buildings possess well-defined shapes and geometries, we analysed the predicted buildings using the area-to-circumference ratio and compactness. This enabled us to distinguish real roofs from potential noise. Lastly, building instance masks were vectorised. In parallel, to generate road vectors, predicted road masks were thinned, then vectorised and smoothed.

5.5 Evaluation

To evaluate the generated and Microsoft and Google datasets, first, a reference dataset of building polygons and road lines was prepared using heads-up digitisation. Second, for comparability of geometries, adjustments were made to align the global datasets with the local datasets. In fact, we noticed that global datasets exhibited misalignments and fragmention of roofs (Fig. 4). These displacements become obvious if images have been orthorectified using a terrain model only. In such cases, features like buildings are projected away from the camera projection center. The other issue seems to be that the extent of the study made it practically impossible to have control over the quality of the input images used. This may have resulted in the use of poor quality images including poorly georeferenced. The major challenge in adjusting observed shifts was that they were not systematic across areas. Therefore, rubber sheeting was performed on a per-site basis to compensate for the observed sporadic shifts. Third, we applied a minimum area threshold of 50 m² (based on local expert knowledge) to remove negligible polygons and effectively exclude potential noise detections from evaluation.

After addressing the shifts and cleaning all datasets, a polygon-wise IoU threshold of 0.5 was applied to derive precision and recall measures for buildings (see description of building metrics in Sect. 3.2). In the same way, after aligning road vectors, length-based precision, recall, quality, and detour factors measures were derived to assess the performance across road datasets (see the description of road metrics in Sect. 3.2). Accuracy measures were derived from six sites, as illustrated in Fig. 5, representing local and complex real-world development scenarios.

Site‑1 shows a recently established settlement featuring well-organised, freestanding, and uniformly fashioned buildings which appear straightforward to delineate. In this area, all buildings are connected to earth roads, with some road segments covered by grass, resembling rural footpaths. Site‑2 presents a planned neighbourhood comprising buildings with complex roof shapes, which could potentially require additional attention for accurate delineation. This site is serviced by well-conditioned and clearly discernible roads. Site‑3 contains high-rise, highly complex apartment buildings, making delineation challenging. Here, the delineation of buildings may likely be affected by the aforementioned relief displacement problem. This settlement is serviced by tarmac roads, improved footpaths, and numerous paved surfaces, including playgrounds and parking areas. The abundance of these features can complicate the process of correctly dentifying and delineating buildings. Site‑4 represents a mixed neighbourhood made of easy- and hard-to-detect buildings. The road network is clearly discernible, however, smaller footpaths pose a potential challenge for detection. The penultimate site, site‑5, represents high-rise building blocks with highly fragmented roofs. The scenery contains a number of artefacts and paved surfaces, which can be hard to discriminate from buildings. Like in site‑3, one can expect relief displacement to be a potential challenge for this site. In regard to road extraction, the site is serviced by tarmac roads filled with cars, with the latter causing difficulties for delineation. Lastly, site‑6 is characterised by tiny, touching buildings as well as a poor road network, challenging the ability to delineate both.

In general, our testing dataset was carefully designed to be diverse, incorporating both easy and challenging scenes. Sites 3, 5, and 6 were purposefully chosen for their high level of clutter, making it difficult to detect structures like buildings. This inclusion of complex scenarios alongside simpler ones (like sites‑1, -2, and -4) allows us to thoroughly assess the robustness of DL-based detection models in handling diverse real-world conditions.

6 Results and Discussion

The primary objective of this study was to evaluate large-scale RSML datasets, including a novel dataset generated with models pre-trained on numerous samples from across the world, and then locally fine-tuned on a few limited local samples. Hence, we initially present our findings on locally fine-tuned models’ performance, considering the impact of various training and inference strategies. Focusing on building extraction, Figs. 6, 7 and 8 present the results of the ablation analysis, focusing on the potential benefits of mix-up and ensemble strategies, as well as investigating the strengths and weaknesses of bottom-up and end-to-end approaches for building detection.

Figure 6 shows the predictions of Mask R‑CNN and ResUNet, both trained with and without mix-up, as well as the results of the ensemble mode combining predictions from the two models. As shown in the figure, mix-up improved detectability of buildings. For instance, the Mask R‑CNN model with mix-up detected several instances which were missed by the model without mix-up. This improvement is further corroborated by the ResUNet model. Here, after mix-up, ResUNet is able to detect previously fractured roofs in their entirety.

Comparing Mask R‑CNN, which implements end-to-end detection, to ResNet, which implements bottom-up detection, reveals several trade-offs, as illustrated in Fig. 6. The Mask R‑CNN model outperforms ResUNet model to produce instances with less noise. However, the model tended to include non-building pixels near the edges. In contrast, ResUNet exhibits precision on boundaries of roofs. Even so, the emphasis on individual pixels led the model to classify pixels belonging to the same roof differently, due to little variation in reflectance. Roof parts with slight variation in the lighting resulting from roof structures or installations on roofs were interpreted as separate objects, causing roof fragmentation. This is observable where predicted roofs present many holes and fragments.

Consequently, the Mask R‑CNN model significantly outperformed ResUNet, achieving a precision of 54% compared to 44% and a recall of 56% compared to 52% across all testing sites (Fig. 7). In addition, using an ensemble model, which combines both predictions from Mask R‑CNN and ResUNet, could help achieve a compromise in performance. Results in Fig. 7 demonstrate that the ensemble model generally outperformed the individual models. It achieved 60% and 58% precision and recall, respectively, which represents a 14% and 6% improvement in precision and a 6% and 2% improvement in recall over ResNet and Mask R‑CNN, respectively. However, disaggregated results at individual site levels (Fig. 8) reveal that ensembling did not always lead to improvement in both precision and recall. In many areas, e.g. sites‑1, -2 and -3, Mask R‑CNN exhibited better precision, which however degraded when its predictions were combined with those from ResUnet. In contrast, in these areas, ensembling led to a significant improvement in recall (Fig. 8). This suggests that a given model might be preferable depending on the trade-off between recall and precision at each site. This finding aligns with the recommendation by Meyer and Pebesma (2022) to assess the area of applicability for each model when building accurate large datasets.

Following the ablation analysis, we now turn to evaluating the effectiveness of local fine-tuning in comparison global models. For that purpose we use results presented in Figs. 9, 10 and 11. Starting with the RSML building datasets, the evaluation results presented in Fig. 9 demonstrate that our ensemble model, which achieved the highest score during ablation analysis, outperforms global models. Evaluating each site individually, Fig. 9 reveals significant spatial variations in the performance of all methods and, generally, of DL-based methods. Evaluated models exhibited performance degradation as applied to scenes with increasing difficulty, starting from site‑1 and -2, followed by -4, -6, -3, and then -5. Site‑1 and -2 exhibited the best results (P and R above 70%) due to the presence of visually well-defined building outlines. Site‑4, which is a typical mixed neighbourhood, exhibits moderate performance, with a precision near 70% and a recall close to 60%. In site‑6, DL-based methods show high precision of around 70%, but suffer from a lower recall where 60% of buildings were missed. This is because of small and clustered buildings. Site‑3 and -5, yielded the poorest results. In site‑5, dominated by high-rise buildings and building blocks, only around 50% of the buildings were detected correctly. Site‑3 proved to be the most challenging scene, where mainly DL methods suffered from over-segmentation. Here, the model could only detect 35% of the buildings, and only about 20% of those detections were correct.

Next, Fig. 10 presents the evaluation results of our generated road dataset and the Microsoft road dataset. These results revealed minimally higher precision and recall by our model compared to the raw version of the Microsoft Road data. However, this does not argue for the poor detection capabilities of the Microsoft road extraction model, but rather points to issues with location shifts inherent in the images from which the data were generated. In fact, the adjusted dataset (M*) showed better precision and recall compared to our generated dataset. This finding suggests that while global datasets offer value, their usefulness is limited by challenges in controlling the quality of input data. Furthermore, this finding suggests that transparently sharing the underlying methods could be more beneficial than simply sharing generated datasets. In the context of the model used to generate a global road dataset, this would allow users to apply or fine-tune the method to their own high-quality data, potentially achieving even better local results.

Overall, better than building extraction, DL-based methods achieved precision and recall of more than 70% across all testing sites, which represents a noteworthy accomplishment in road extraction. Particularly, the good precision and recall achieved by the Microsoft road dataset highlight the robustness of DL-based methods for road extraction, even at very large scales. The underperformance of our model compared to the one used to generate Microsoft’s road dataset can be attributed to potential inaccuracies in the labels used for initial training. These labels were automatically generated from OpenStreetMap centerlines which make them prone to imprecisions.

Figure 11 provides a visual comparison of our RSML building and road datasets with open global datasets, enabling a more nuanced evaluation of both data sources. Taking a glance at predicted building shapes, we noticed that all models perform well on sites -1 and -2, moderately well on site -4, and poorly on the remaining sites, consistent with quantitative evaluation results. Figure 11 also highlights an interesting observation on site-6: bottom-up methods may achieve good semantic segmentation but struggle to separate individual buildings. In this site, the Microsoft model performed better in segmentation despite showing low precision in the quantitative evaluation (Fig. 9). Lastly, visually comparing our generated and Microsoft road networks exposes a key challenge: inconsistencies in post-processing that lead to variations in data quality.

In a nutshell, our evaluation underscores the paramount importance of RSML datasets by demonstrating sufficient performance of deep learning models in extracting buildings and roads across various areas. According to Mayer et al. (2006), a recall of 60–70% and a precision of 75–85% is considered as the minimum accuracy for practical utility of RSML road datasets. Interestingly, data generated by the locally fine-tuned road model in four out of six sites appears usable. Considering the limitations from imprecise labels in the initial training, improved data could lead to performance gains, potentially matching or surpassing the results achieved with the manually adjusted Microsoft dataset. Applying the same standards as RSML roads to RSML buildings, the locally fine-tuned model’s generated datasets for sites -1, -2, and -4 would be suitable for some practical applications. Notably, these sites contain primarily single-detached houses, a building type that, according to a recent study (Dieye et al. 2023), represents nearly 85% of all buildings in Rwanda which implies a good local applicability of our model.

7 Conclusion

Within this study, we assessed the local accuracy of large-scale RSML building and road datasets. Beyond the assessment of the accuracy of open, global RSML datasets, we here proposed an efficient framework to generate more accurate local datasets. Our framework combines bottom-up and end-to-end segmentation for building detection. For road detection, our framework proposes combining various regionally optimised road models that capture various road types. It also seeks to capitalise on huge publicly available samples from across the world and a limited number of locally sourced samples. Our framework tackles several limitations of global models, including the lack of local context, difficulty in controlling input data quality, and inherent inaccuracies. Our evaluation demonstrated the effectiveness of the proposed framework. Notably, the generated building data achieved higher accuracy compared to open global datasets, while the road extraction model showed moderate performance. Our study emphasised the importance of disaggregated evaluation of RSML datasets to capture potential localised discrepancies and identify areas of application. Beyond this, it underscores characteristics in which further explorations are needed.

Our study does not come without potential limitations. Our evaluation was limited by inconsistencies in implementations of data generation pipelines, especially post-processing. This is because of the lack of transparency and methodological details particulary for the Microsoft datasets, potentially affecting the fairness of the comparison. Despite that, the evaluation remains valuable. It illuminated the potential and limitations of open global RSML datasets. Most importantly, it enabled exploration of an alternative, efficient pipeline for generating data with enhanced local accuracy.

Accurate building assessment can be utilised for various downstream applications, including wealth and population mapping. In future work, we intend to explore additional image sources, potentially incorporating LIDAR data into our building detection framework. We believe that this will improve the detection and classification of building instances into groups, ultimately inferring the underlying variation in household wealth status.

Availability of Data and Materials

Data used for pre-training are publicly accessible from: http://gpcv.whu.edu.cn/data/

https://www.kaggle.com/datasets/balraj98/massachusetts-roads-dataset/data

https://www.kaggle.com/datasets/sayantandas30011998/zanzibar-openai-building-footprint-mapping.

Local sample data used for fine-tuning will be made available by the authors upon request. Countrywide images are available on request for research purposes from the Rwanda National Land Authority.

Notes

References

Abdollahi A, Pradhan B, Shukla N, Chakraborty S, Alamri A (2020) Deep learning approaches applied to remote sensing datasets for road extraction: a state-of-the-art review. Remote Sens 12(9):1444
Article Google Scholar
Abraham N, Khan NM (2019) A novel focal tversky loss function with improved attention u‑net for lesion segmentation. In: 2019 IEEE 16th international symposium on biomedical imaging (ISBI 2019). IEEE, pp 683–687
Google Scholar
Abriha D, Szabó S (2023) Strategies in training deep learning models to extract building from multisource images with small training sample sizes. Int J Digit Earth 16(1):1707–1724
Article Google Scholar
Achanccaray P, Gerke M, Wesche L, Hoyer S, Thiele K, Knufinke U, Krafczyk C (2023) Automatic detection of specific constructions on a large scale using deep learning in very high resolution airborne imagery: the case of steel construction system halls of the high modernism period. J Photogramm Remote Sens Geoinform Sci: 1–21
Adimoolam YK, Chatterjee B, Poullis C, Averkiou M (2023) Efficient deduplication and leakage detection in large scale image datasets with a focus on the crowdAI mapping challenge dataset. arXiv preprint arXiv:2304.02296
Google Scholar
Avbelj J, Müller R, Bamler R (2015) A metric for polygon comparison and building extraction evaluation. IEEE Geosci Remote Sens Lett 12(1):170–174
Article Google Scholar
Bittner K, Adam F, Cui S, Körner M, Reinartz P (2018) Building footprint extraction from VHR remote sensing images combined with normalized DSMs using fused fully convolutional networks. IEEE J Sel Top Appl Earth Observations Remote Sensing 11(8):2615–2629
Article Google Scholar
Burke M, Driscoll A, Lobell DB, Ermon S (2021) Using satellite imagery to understand and promote sustainable development. Science 371(6535)
Cao D, Ing H, Wong MS, Kwan MP, Xing H, Meng Y (2021) A stacking ensemble deep learning model for building extraction from remote sensing images. Remote Sens 13(19):3898
Article Google Scholar
Chen K, Zou Z, Shi Z (2021) Building extraction from remote sensing images with sparse token transformers. Remote Sens 13(21):4441
Article Google Scholar
Chen Z, Deng L, Luo Y, Li D, Junior JM, Gonçalves WN, Li D (2022) Road extraction in remote sensing data: a survey. Int J Appl Earth Observ Geoinform 112:102833
Article Google Scholar
Demir I, Koperski K, Lindenbaum D, Pang G, Huang J, Basu S, Raskar R (2018) Deepglobe 2018: a challenge to parse the earth through satellite images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp 172–181
Google Scholar
Dieye F, Lénack L, Nsengiyumva Y, de Viron P, Ngirabacu V, Rust K, Setzler J (2023) Study on the housing market and low-cost and efficient building materials and technologies vol 2023. Ministry of Infrastructure, Rwanda Housing Authority, Development Bank of Rwanda
Google Scholar
Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vis 88:303–338
Article Google Scholar
Ganaie MA, Hu M, Malik AK, Tanveer M, Suganthan PN (2022) Ensemble deep learning: a review. Eng Appl Artif Intell 115:105151
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Google Scholar
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask R‑CNN. http://arxiv.org/abs/1703.06870arXiv:1703.06870
He S, Bastani F, Jagwani S, Alizadeh M, Balakrishnan H, Chawla S, Sadeghi MA (2020) Sat2graph: road graph extraction through graph-tensor encoding. In: European Conference on Computer Vision. Springer, Cham, pp 51–67
Google Scholar
Ji S, Wei S, Lu M (2019) Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set. IEEE Trans Geosci Remote Sens 57(1):574–586
Article Google Scholar
Jiang X, Gao T, Zhu Z, Zhao Y (2021) Real-time face mask detection method based on YOLOv3. Electronics 10(7):837
Article CAS Google Scholar
Jiang X, Li Y, Jiang T, Xie J, Wu Y, Cai Q, Zhang H (2022) RoadFormer: Pyramidal deformable vision transformers for road network extraction with remote sensing images. Int J Appl Earth Observ Geoinform 113:102987
Article Google Scholar
Li H, Herfort B, Lautenbach S, Chen J, Zipf A (2022) Improving OpenStreetMap missing building detection using few-shot transfer learning in sub-Saharan Africa. Trans GIS 00:1–22
Google Scholar
Lian R, Huang L (2020) DeepWindow: Sliding window based on deep learning for road extraction from remote sensing images. IEEE J Sel Top Appl Earth Observations Remote Sensing 13:1905–1916
Article Google Scholar
Lian R, Wang W, Mustafa N, Huang L (2020) Road extraction methods in high-resolution remote sensing images: a comprehensive review. IEEE J Sel Top Appl Earth Observations Remote Sensing 13:5489–5507
Article Google Scholar
Luo L, Pengpeng L, Yan X (2021) Deep learning-based building extraction from remote sensing images: a comprehensive review. Energies 14(23):7982
Article Google Scholar
Luo M, Ji S, Wei S (2023) A diverse large-scale building dataset and a novel plug-and-play domain generalization method for building extraction. IEEE J Sel Top Appl Earth Observations Remote Sensing
Ma L, Liu Y, Zhang X, Ye Y, Yin G, Johnson BA (2019) Deep learning in remote sensing applications: a meta-analysis and review. ISPRS J Photogramm Remote Sens 152:166–177
Article Google Scholar
Maggiori E, Tarabalka Y, Charpiat G, Alliez P (2017) Can semantic labelling methods generalize to any city? the inria aerial image labelling benchmark. In: 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). IEEE, pp 3226–3229
Chapter Google Scholar
Mayer H, Hinz S, Bacher U, Baltsavias E (2006) A test of automatic road extraction approaches. International archives of the photogrammetry, remote sensing and spatial. Inf Sci 36(3):209–214
Google Scholar
Meyer H, Pebesma E (2021) Predicting into unknown space? Estimating the area of applicability of spatial prediction models. Method Ecol Evol 12(9):1620–1633
Google Scholar
Meyer H, Pebesma E (2022) Machine learning-based global maps of ecological variables and the challenge of assessing them. Nat Commun 13(1):2208
Article CAS Google Scholar
Mnih V (2013) Machine learning for aerial image labelling. University of Toronto, Toronto
Google Scholar
Mohanty SP, Czakon J, Kaczmarek KA, Pyskir A, Tarasiewicz P, Kunwar S, Schilling M (2020) Deep learning for understanding satellite imagery: an experimental survey. Front Artif Intell 3:534696
Article Google Scholar
Padilla R, Netto SL, Da Silva EA (2020) A survey on performance metrics for object-detection algorithms. In: 2020 international conference on systems, signals and image processing (IWSSIP). IEEE, pp 237–242
Chapter Google Scholar
Pasquali G, Iannelli GC, Dell’Acqua F (2019) Building footprint extraction from multispectral, spaceborne earth observation datasets using a structurally optimized U‑Net convolutional neural network. Remote Sens 11(23):2803
Article Google Scholar
Persello C, Wegner JD, Hänsch R, Tuia D, Ghamisi P, Koeva M, Camps-Valls G (2022) Deep learning and earth observation to support the sustainable development goals: current approaches, open challenges, and future opportunities. IEEE Geosci Remote Sens Mag 10(2):172–200
Article Google Scholar
Ronneberger O, Fischer P, Brox T (2015) U‑net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention MICCAI 2015: 18th International Conference, Munich, October 5–9, 2015. Proceedings, Part III 18. Springer, Berlin Heidelberg, pp 234–241
Google Scholar
Roscher R, Rußwurm M, Gevaert C, Kampffmeyer M, Santos JAD, Vakalopoulou M, Tuia D (2023) Data-centric machine learning for Geospatial remote sensing data. arXiv preprint arXiv:2312.05327
Google Scholar
Rottensteiner F, Sohn G, Gerke M, Wegner JD, Breitkopf U, Jung J (2014) Results of the ISPRS benchmark on urban object detection and 3D building reconstruction. ISPRS J Photogramm Remote Sens 93:256–271
Article Google Scholar
Rutzinger M, Rottensteiner F, Pfeifer N (2009) A comparison of evaluation techniques for building extraction from airborne laser scanning. IEEE J Sel Top Appl Earth Observations Remote Sensing 2(1):11–20
Article Google Scholar
Sirko W, Kashubin S, Ritter M, Annkah A, Bouchareb YSE, Dauphin Y, Quinn J (2021) Continental-scale building detection from high-resolution satellite imagery. arXiv preprint arXiv:2107.12283
Google Scholar
Strudel R, Garcia R, Laptev I, Schmid C (2021) Segmenter: Transformer for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7262–7272
Google Scholar
Tao J, Chen Z, Sun Z, Guo H, Leng B, Yu Z, Yang J (2023) Seg-road: a segmentation network for road extraction based on transformer and CNN with connectivity structures. Remote Sens 15(6):1602
Article Google Scholar
Van Etten A, Lindenbaum D, Bacastow TM (2018) Spacenet: a remote sensing dataset and challenge series. ArXiv, abs/1807.01232
Google Scholar
Wang Z, Zhou Y, Wang F, Wang S, Qin G, Zou W, Zhu J (2023) A multi-scale edge constraint network for the Fine extraction of buildings from remote sensing images. Remote Sens 15(4):927
Article Google Scholar
Wiedemann C, Ebner H (2000) Automatic completion and evaluation of road networks. J Arch Photogramm Remote Sens 33(B3/2; PART 3):979–986
Google Scholar
Wiedemann C, Heipke C, Mayer H, Jamet O (1998) Empirical evaluation of automatically extracted road axes. Empir Eval Tech Comput Vis 12:172–187
Google Scholar
Xia L, Zhang X, Zhang J, Yang H, Chen T (2021) Building extraction from very-high-resolution remote sensing images using semi-supervised semantic edge detection. Remote Sens 13(11):2187
Article Google Scholar
Zhang H, Cisse M, Dauphin YN, Lopez-Paz D (2017) Mix-up: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412
Google Scholar
Zhang H, Rogozan A, Bensrhair A (2022) An enhanced N‑point interpolation method to eliminate average precision distortion. Pattern Recognit Lett 158:111–116
Article Google Scholar
Zhang Y, Mehta S, Caspi A (2021) Rethinking semantic segmentation evaluation for explainability and model selection. arXiv preprint arXiv:2101.08418
Google Scholar
Zhang Z, Liu Q, Wang Y (2018) Road extraction by deep residual u‑net. IEEE Geosci Remote Sens Lett 15(5):749–753
Article CAS Google Scholar
Zhao K, Kang J, Jung J, Sohn G (2018) Building extraction from satellite images using mask R‑CNN with building boundary regularization. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 247–251
Google Scholar
Zoph B, Ghiasi G, Lin TY, Cui Y, Liu H, Cubuk ED, Le Q (2020) Rethinking pre-training and self-training. Adv Neural Inf Process Syst 33:3833–3845
Google Scholar

Acknowledgements

The authors would like to acknowledge Rwanda’s National Land Authority for providing local imagery for this research.

Funding

DAAD doctoral programme research grants (57507871). Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Institute of Geodesy and Photogrammetry, Technische Universität Braunschweig, Bienroder Weg 81, 38106, Braunschweig, Germany
Emmanuel Nyandwi, Markus Gerke & Pedro Achanccaray
Department of Spatial Planning, School of Architecture and Built Environment, College of Science and Technology, University of Rwanda, Po Box: 3900, KN 67 Street Nyarugenge, Kigali, Rwanda
Emmanuel Nyandwi

Authors

Emmanuel Nyandwi
View author publications
You can also search for this author in PubMed Google Scholar
Markus Gerke
View author publications
You can also search for this author in PubMed Google Scholar
Pedro Achanccaray
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Emmanuel Nyandwi conducted the literature review, collected and curated the data, conducted experiments and drafted the manuscript. Pedro Acchanccharay offered technical support in data processing and editing the manuscript. Markus Gerke is the supervisor, helped guiding the overall process and provided editorial support of manuscript.

Corresponding author

Correspondence to Emmanuel Nyandwi.

Ethics declarations

Conflict of interest

E. Nyandwi, M. Gerke and P. Achanccaray declare that they have no competing interests.

Supplementary Information

Trends in accuracy of building and road extraction DL models (Articles Included from Scopus)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Nyandwi, E., Gerke, M. & Achanccaray, P. Local Evaluation of Large-scale Remote Sensing Machine Learning-generated Building and Road Dataset: The Case of Rwanda. PFG (2024). https://doi.org/10.1007/s41064-024-00297-9

Download citation

Received: 24 November 2023
Accepted: 12 June 2024
Published: 24 July 2024
DOI: https://doi.org/10.1007/s41064-024-00297-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Local Evaluation of Large-scale Remote Sensing Machine Learning-generated Building and Road Dataset: The Case of Rwanda

Abstract

1 Introduction

2 State-of-the-art

2.1 DL Models for Building and Road Extraction Methods

2.2 Data Sources for Large-Scale Models

2.3 Training and Inference Strategies for Large-scale Models

3 Methods for Evaluating RSML Datasets

3.1 Pixel- Versus Instance-Based Accuracy Metrics for Buildings

3.2 Pixel- Versus Length-Based Accuracy Metrics for Roads

4 A Synthesis of Benchmark Implementations

5 Experimental Set-Up

5.1 Datasets

The WHU Aerial Dataset

The Zanzibar Dataset

The Massachusetts Road Dataset

The Rwanda Imagery Dataset,

5.2 Generating Training Data

5.3 Training Process

5.3.1 Building Extraction Model

5.3.2 Road Extraction Model

5.4 Inference

5.5 Evaluation

6 Results and Discussion

7 Conclusion

Availability of Data and Materials

Notes

References

Further Reading

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Supplementary Information

Trends in accuracy of building and road extraction DL models (Articles Included from Scopus)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation