Training phase of CNN models per UTM grid zone
Hyper-parameters tuning
During the training phase of the model per each UTM grid zone, 10% of the training data were reserved for validation in order for the CNNs to prevent over-fitting. The input Sentinel-2 composite data were rescaled in the range [0,1]. The number of epochs to train the models was set to 25 iterations. The weights were initialized based on uniform distribution with bounds [−0.1065, 0.1065]. Finally, the Adam stochastic optimization with a learning rate of 0.0001 has been used to optimize the binary cross-entropy, log loss function:
$$L\left( {{\text{y}}, \hat{y}} \right) = - \frac{1}{N}\mathop \sum \limits_{n = 1}^{N} \left[ {y_{n} \log \hat{y}_{n} + \left( {1 - y_{n} } \right)\left( {1 - \log \hat{y}_{n} } \right)} \right]$$
(1)
where N is the number of training samples, y is the vector of the real target values of the training set in binary coding, and \(\hat{y}\) is the vector of the model responses in the continuous range [0, 1]. The cross-entropy loss has fast convergence rate and is numerically stable when coupled with sigmoid normalization [85].
Performance evaluation
For evaluating the classification performance of the models during the training and prevent overfitting, a fraction representing 10% of the training data was used for validation. Figure 7 shows the progress of the average loss curves produced by 485 GHS-S2Net models during their training and validation which last 25 epochs. Every model corresponds to one UTM grid zone, resulting in 485 out of 615 grid zones that refer to landmass with presence of built-up according to the learning sets. The learning curves show that both the average training loss (green curve) and validation loss (red curve) decrease rapidly to a point of stability with a convergence around 12 epochs. The fact that the gap between the two curves is very small even for the first 5 iterations and that it completely disappears around 12 iterations after, shows that the size of the training sets, selected following the two-stage training approach, is optimal and that the models have good generalization capacity [86].
Computational performance of the GHS-S2Net models during the training and prediction phases
Both training and prediction were performed on GPUs and their runtime is reported in Fig. 8. The reported elapsed time refers to every UTM grid zone predominantly covered by land (204 grid zones) and those zones predominantly covered by water (281 grid zones). In inland tiles, more training samples are usually fed to the GHS-S2Net, while in water tiles the number of training samples is smaller. The stacked bar plots show that the average training time is around 3600 s, while the prediction time is around 15,500 s. For inland zones, the average training time is 3900 s and the prediction time is 16,400 s, while for water zones, the processing time is shorter with an average training time of 3100 s and prediction time of 15,000 s.
These results show that the GHS-S2Net-based multi-modeling approach scales seamlessly in a distributed multi-GPU platform. For the processing at a global scale, our main constraint was the limited amount of concurrently available GPUs: we employed 6 GPU modules for the training phase and 2 modules for the prediction phase that were available at the time of deployment. Despite these limitations, we managed to scale up the GHS-S2Net-based multi-modeling approach and achieved to process a dataset having global coverage at 10 m spatial resolution thanks to: (i) an efficient partitioning of the processing per UTM grid zone, (2) the two-stage training approach with a subsampling of non-built-up patches within the selected tiles containing training samples, and (3) the optimal size of input data (i.e., 100 × 100 km tiles) used for both the training and prediction. Increased GPU capacities and activation of early stopping during the training in order to reduce the number of iterations (epochs) when the loss function stops improving, can significantly reduce both the training and the prediction time of the GHS-S2Net model.
Qualitative assessment of the models predictions
The results of the GHS-S2Net implementation on the Sentinel-2 global mosaic were assessed visually. Compared to the training sets, the results of built-up detection showed a significant reduction in both commission and omission errors and other artifacts that were observed in the training sets (see Sect. 2.2). In addition, GHS-S2Net resulted in a refined mapping of built-up areas and open spaces within urban areas and most importantly the detection of new settlements, never annotated so far in the training sets or identified in any other global scale dataset. Figure 9 illustrates some examples of each type of improvement obtained with the GHS-S2Net models. Figures 9, 10 and 11 show, for selected cities, the enhanced built-up areas detection, represented in the form of continuous-range outputs (probability), in comparison with the best available training sets. The most notable improvements relate to the detection of built-up areas which are omitted from the training sets, under the assumption that the initial purpose of these datasets was to map completely the contiguous areas they cover. These omissions are either due to lack of data or to flaws and gaps in the training sets themselves given that they were all extracted through automatic classification of satellite imagery. In the case of FB_HRS (Fig. 9a: 7.34 Latitude, 3.90 Longitude), the most critical omissions were systematically observed in dense built-up areas (often corresponding to urban cores), while in ESM_BU (Fig. 9b: 51.44 Latitude, −0.97 Longitude), the omissions were essentially due to lack of input satellite data in some countries (mainly United Kingdom and Ireland). In the case of MS_BFP (Fig. 9c: 43.11 Latitude, −79.05 Longitude), most of the omissions concerned large industrial buildings but several small buildings were also not detected in this training data. For GHSL_BU (Fig. 9d: 30.51 Latitude, 120.67 Longitude), underdetections were mainly observed in rural areas and in particular in small scattered settlements due to the size of the built-up structures which were difficult to be captured due to the sensor’s spatial resolution.
Figure 10 is another example highlighting the capacity of the GHS-S2Net in reducing significantly commission problems observed in the training sets that were fed to the models. In the case of MS_BFP, overdetections were mainly observed in mountainous areas with bare rocks or in agricultural areas with bare fields (Fig. 10a: 33.25 Latitude, −90.62 Longitude). In the case of ESM_BU, overdetections were frequently identified in sand dunes (Fig. 10b: Latitude 43.36, 16.65 Longitude) and rocky beaches, bright bare soils and riverbeds.
The visual comparison of the results of the GHS-S2Net probabilistic output against the best available training sets provides a clear evidence of the refined built-up areas detection from the Sentinel-2 image composite. Figure 11 is an example of such enhanced capabilities covering the city of Sassari (Italy). It compares the ESM_BU training set derived from VHR satellite data at a spatial resolution of 2 m to the results obtained by the GHS-S2Net trained with ESM_BU. These results illustrate the unprecedented performance of GHS-S2Net for pixel-wise classification of 10 m Sentinel-2 data and for detecting urban structures in complex urban environments. Not only the classification of built-up areas is more refined, despite the coarser spatial resolution of Sentinel-2 data (10 m) in comparison with the VHR imagery used for producing ESM_BU (2 m) (Fig. 11b), but it is almost possible to identify single buildings as well as open spaces in the urban layout. Besides, the probabilistic output seems to be highly related to the patterns of built-up areas suggesting that GHS-S2Net may be a proxy measure for building densities.
These examples provide experimental findings that support the GHS-S2Net model generalization capacity, which was already evidenced during the training phase (3.1.2). With a relatively small number of parameters (1,447,042 trainable parameters) and a very large number of samples (511,502,073 total number of built-up patches—See Supplementary material R1 for training samples per UTM zone), the model proved to be robust to noise or missing data with respect to the training sets, while effectively capturing the essential patterns and salient features, resulting in precise mapping of built-up areas.
Validation of the model predictions and assessment of generalization performance
Two approaches were implemented for the validation of the GHS-S2Net output that are based on comparison with independent cartographic data of building footprints, not employed for the training of the models:
-
Continuous assessment: by testing the GHS-S2Net output as predictor of the built-up densities at the spatial resolution of 10 m through least-square linear regression;
-
Binary assessment: by evaluating the contingency table between the binarized outputs of GHS-S2net after the application of a probability cut-off value, and the binarized reference data used as a “ground-truth.”
For the validation of pixel-wise predictions, a reference spatial database including single building delineation derived from digital cartography at a nominal scale of 1:10,000 was developed. The suitability of this database for the global scale validation of built-up products derived from remote sensing data has been previously evaluated in Corbane et al., 2019 [5]. The reference database consists of more than 40 million individual building polygons selected from 277 different areas of interest (AOI) around the globe. These are mostly local administrative units covering specific cities or full counties (for the United States of America) and spread across different continents. While not covering all the combinations of geographical, environmental, and cultural conditions that are determinant factors of the settlement patterns, the reference data spread across various landscapes. The reference years for the collected reference data range between 2012 and 2018 with the latter being the most frequent year of update. This makes the reference database suitable for the validation of the results derived from the Sentinel-2 pixel based image composite produced for the reference year 2018. The building footprints span over the whole spectrum of low-density and high-density human settlement patterns, representing typical rural, suburban and urban spatial patterns (see supplementary material R2 for more information on the spatial distribution and characteristics of the reference dataset). In order to support the accuracy assessment exercise, the reference data collected in vector format were converted into binary raster layers indicating the presence/absence of built-up areas. The rasterization of the vector cartographic data was performed at a spatial resolution of 10 m corresponding to the spatial resolution of the Sentinel-2 image composite and the outputs of the GHS-S2Net model.
Continuous assessment: validation of the model output as predictor of built-up densities
For analyzing the performance of the GHS-S2Net model as a predictor of the densities of built-up areas, we perform a regression analysis between the probability of built-up areas given by the model as response and the reference built-up surface densities as derived from the database of building footprints for the 277 different areas of interest. The knowledge of the systematic bias and gain parameters of the automatically classified built-up areas allows us to gain insights into the capacity of the GHS-S2Net model in capturing the patterns and densities of built-up areas and to identify a suitable threshold for the binarization of the output probabilities for the subsequent accuracy assessment step.
The strength of the linear relation between the automatically generated built-up probabilities and the reference data is assessed through the Pearson correlation coefficient (r). The gain factor (slope) allows the user to model, retro-fit and compare the results obtained from the GHS-S2Net model for the different AOIs. In addition, the slope of the regression is an indicator of the optimal threshold for translating the built-up probabilities to binary values for the pixel-based accuracy assessment.
The results of the regression analysis at 10 m for all AOI sites showed an average correlation coefficient r of 0.67 and an average slope of 0.52 (Fig. 12).
The average correlation coefficient shows that the output probabilities from GHS-S2Net models are capable of capturing around 67% of the structural variability in built-up areas. The lowest correlation coefficients were observed for AOIs covering complete counties in the United States where there are a lot of building sizes below 100 m2 (which is the limit of the detectability of the Sentinel-2 sensor) and the built-up density is very low, less than 0.5%. This is for instance the case of the Matanuska-Susitna Borough AOI which is a borough located in the state of Alaska, covering an area 9492.46 km2 with a built-up density of 0.1% and an average size of buildings of 140 m2 (Supplementary material R2). The output probabilities of the GHS-S2Net models seem to better capture building densities in urban areas and high density AOIs where the correlation coefficients were greater than 0.6. This is the case for example of the AOI covering San Francisco city with an area of 194 km2 and a building density of 26.4%.
It is also worth noting that the gain factor (slope) translating the built-up probabilities as derived from Sentinel-2 data to built-up surface densities as derived from the reference cartographic data is almost constant. The slope has an average of 0.2 in low density AOIs, in particular those covering full counties in the United States (e.g., San Juan County). In high-density AOIs covering cities, the slope (bias) is higher (e.g., city of Rome where the slope is close to 0.8) with an average around 0.54.
According to these findings, it is not straightforward to define one general-purpose threshold to binarize the output of the GHS-S2Net models into two classes “built-up” and “non-built-up.” A threshold of 0.2 would then be good compromise targeting large areas including scattered settlement patterns, in particular rural areas, while a more conservative threshold of 0.5 would be more suitable for areas largely dominated by high-density built-up areas (i.e. city centers). Following this finding, both thresholds were applied to the outputs of the GHS-S2Net models for assessing the quality of the classifications following a pixel-wise accuracy method.
Binary accuracy assessment
The thresholds 0.2 and 0.5 identified in the previous regression analysis were used to binarize the probabilistic output as required by the pixel-wise binary accuracy assessment at the spatial resolution of the sensor. Standard accuracy and error metrics derived from the confusion matrix were calculated for the binary results obtained with the two thresholds. Given the lack of a single universally accepted measure of agreement, we use a combination of two main performance metrics to give a complete picture of the performance of the GHS-S2Net models: the balanced accuracy and the Kappa coefficient that were introduced to the remote sensing community and recommended by Congalton, 2011 [87]. The Balanced Accuracy and Kappa are measures of classification accuracy, the former providing information about the rate of correctly classified pixels in an unbalanced setting where non-built-up pixels are predominant compared to built-up pixels. The latter compensates for random chance in the pixels assignment.
The results of the per-pixel accuracy assessment with the two binary outputs are summarized in Fig. 13 and disaggregated per continent. The figure shows the average and standard deviations of the Balanced Accuracy and Kappa coefficients per binary output and per continent. The 277 AOI were grouped by continent to evidence major improvements, especially in areas where previous global products failed to produce satisfactory results. Overall, both binary classifications produce good results with an average Balanced Accuracy greater than 0.7 and an average Kappa greater than 0.5. However, when compared to the binary outputs derived with the 0.5 probability threshold, the classification with a less conservative threshold of 0.2 produces better agreement with the reference data, consistently for all continents. The best results in the least conservative classification outputs (threshold of 0.2) were obtained in Oceania an Asia with an average Balanced Accuracy of 0.91, followed by North America and Africa where the mean Balanced Accuracies were equal to 0.86 and 0.85, respectively.
The results of the per-pixel accuracy assessment, in particular those obtained by applying a low threshold to the probability outputs, constitute a strong evidence of the modeling power of the GHS-S2Net and the reliability of the outputs. They are also a confirmation of the merit of the new classification framework for identifying settlements in challenging landscapes such as in Africa and Asia. They also suggest that for the generation of a global binary classification from the probabilistic output of the models, a low probability threshold is recommended, in particular if the purpose is to capture all the scattered settlements in rural landscapes such as in Africa. In this particular context, the binary outputs obtained with a threshold of 0.2 outperform significantly those derived from the conservative threshold.
Comparison between the results of close range and far range transfer learning
When computing the GHS-S2Net predictions at the global scale, the majority of the UTM grid zones and in particular the 100 × 100 km2 tiles were processed with the close range transfer learning. However, to allay the scarcity and quality issues in the training dataset, 28 UTM grid zones were classified according to the far range transfer learning and the outputs were compared to those obtained by the direct close range transfer learning. Figure 14a illustrates the differences between close range (middle figure) and far range transfer learning (bottom figure) in areas suffering from the lack of training samples (e.g., in Ethiopia). It shows the capacity of the far range transfer learning in discovering undetected built-up features in UTM grid zone 37P, on the basis of the parameters of the model trained in the neighboring UTM grid zone 37 M. In such a situation, the close range transfer learning was less effective in identifying those scattered settlements due to insufficient training samples in the UTM grid zone 37P.
Figure 14b shows another example with respect to the city of Moscow, showing the added-value of the far range transfer learning in areas where only the GHS_BU low-resolution training data were available (UTM grid zone 37U). The example highlights the generalization capacity of the GHS-S2Net trained on a UTM grid zone where detailed training samples are available (e.g., in UTM grid zone 34U) and then applied to the nearby zone. The generalization capacity of the model here is reflected in: (i) reproducing fine-scale settlement structures in dense built-up areas, (ii) reducing overdetections of roads and other impervious features and (iii) enhancing the sharp delineation of buildings and open spaces in the built-up areas.
Moscow is one of the cities where detailed building footprints were available in the reference database used in the validation exercise. The availability of “ground-truth” data enabled to conduct a quantitative binary accuracy assessment of the results of far range transfer learning in comparison with those obtained with the close range transfer learning. The results are illustrated in Table 2 for the binary outputs with cut-off values of 0.2 and 0.5. They show higher overall and balanced accuracy values resulting from the application of far range transfer leaning. These results are an additional evidence of the enhanced mapping capabilities of a well-designed far range transfer learning approach deployed in this work.
Table 2 Summary Results of binary accuracy assessment of the close range and far transfer learning in the city of Moscow based on detailed building footprints The encouraging results were determinant for expanding the application of far range transfer learning which was finally implemented on a total of 28 UTM grid zones. The selection of source and target UTM grid zones was mainly driven by spatial adjacency or similarities in the landscape and in the type of built-up areas.