Convolutional Neural Networks for Global Human Settlements Mapping from Sentinel-2 Satellite Imagery

Spatially consistent and up-to-date maps of human settlements are crucial for addressing policies related to urbanization and sustainability especially in the era of an increasingly urbanized world. The availability of open and free Sentinel-2 data of the Copernicus Earth Observation programme offers a new opportunity for wall-to-wall mapping of human settlements at a global scale. This paper presents a deep-learning-based framework for a fully automated extraction of built-up areas at a spatial resolution of 10 meters from a global composite of Sentinel-2 imagery. A multi-neuro modelling methodology, building on a simple Convolution Neural Networks architecture for pixel-wise image classification of built-up areas is developed. The deployment of the model on the global Sentinel-2 image composite provides the most detailed and complete map reporting about built-up areas for reference year 2018. The validation of the results with an independent reference dataset of building footprints covering 277 sites across the world, establishes the reliability of the built-up layer produced by the proposed framework and the model robustness. The results of this study contribute to cutting-edge research in the field of automated built-up areas mapping from remote sensing data and establish a new reference layer for the analysis of the spatial distribution of human settlements across the rural-urban continuum


Introduction
New ways to map and measure the built-up environment over large areas are critical to answering a wide range of research questions and to addressing policies related to urbanization and sustainability. This is particularly true in the era of an increasingly urbanized world [1]. Earth Observation (EO) has become a promising tool to provide up to date geospatial information on the status and dynamics of built-up areas and human settlements [2]. With the routine acquisition of satellite imagery and the availability of different satellite collections, several efforts have focused on mapping built-up areas at a global scale in the last decade. The most recent datasets include the Global Urban Footprint (GUF) with its 12 m product derived from TerraSAR-X imagery acquired in 2011-2013 [3]; the Global Human Settlement Layer (GHSL) with the 30 m multitemporal datasets derived from Landsat archives and showing the evolution of built-up areas in four epochs 1975,1990,2000 and 2014 [4], [5]; the World Settlement Footprint (WSF) with the 10 m resolution datasets based on Landsat-8 and Sentinel-1 sensors for reference year 2015 [6] and the FROM-GLC10 landcover map which includes a dedicated class for artificial surfaces derived from Sentinel-2 data acquired in 2017 [7]. Unlike the GUF which was generated from commercial imagery, all the other products were derived from free and open-access satellite image datasets, primarily from Landsat and the European Copernicus Sentinel missions. The advantages of these products are numerous and are mainly related to their free availability (absence of restrictions on their use for multiple types of applications) and most of all for the relatively low cost of their systematic update.
The methods used to produce these products and in general to extract built-up areas or artificial surfaces from remote sensing data include statistically derived indices and both supervised and unsupervised learning approaches. The first group of methods covers typically spectral indices [8]- [10], spectral mixture analysis [11], [12] and local/contextual image contrast/texture analysis [13], [14]. The latter includes regression analysis [15]- [17] and machine learning techniques, comprising mostly decision trees and random forests [18]- [20], support vector machines [7] and associative rule learning [4], [5].
Although some of these methods have proved to be suitable for large-area mapping of human settlements from satellite imagery, several limitations must be considered when using the information products generated from public satellite data for analytical purposes. These limitations are mostly related to accuracy, sensor-scale dependency, mapping of the extrema of the settlement density range, and the continuous monitoring of urban land cover changes. A non-exhaustive list follows below:  High false positive and false negative error rates from the automated detection of urban land cover classes when compared to non-urban classes (e.g. bare rocks, sand dunes, bare agricultural fields, river bank lines) due to the limited actual extent of built-up areas and the discontinuous surface they compose [21];  High disagreement on total land cover surface estimates of different sensor-derived products and high dependency on input sensor resolution of the urban land cover total estimates [22]- [24];  Unsatisfactory mapping of the extrema of the settlement spatial patterns at the very low-density rural areas and the very high-density urban areas [25]- [27];  Lack of a commonly-approved methodology and/or a machine-based automatic and reproducible solution which allows consistent and continuous monitoring of global urban land cover changes across time and across different sensors [2], [28], [29].
Compelling challenges and opportunities still lie ahead in high-resolution mapping and accurate classification of built-up areas over large areas. A key issue in this context is up-to-date and reliable information on the status and development of the human settlements. The availability of free and open remotely sensed big data streams has brought significant innovations in the field of automatic information extraction from satellite imagery. There is an increasing need to mine the large amount of earth observation data delivered in a free and open way by some of the new generation of satellites, especially the Sentinel missions. Operational since 2017, the Sentinel-2 mission of the European Copernicus programme provides a 5-day repeat cycle and a span of 13 spectral bands at a spatial resolution as high as 10 m. Sentinel-2 has great potential for mapping and monitoring built-up areas on a global scale [7], [30], [31]. Novel approaches for mapping human settlements are needed to deal with the increased spatial and temporal resolution of Sentinel-2.

Background
Advances in Deep Learning (DL) has led to leaps in the fields of computer vision, speech recognition and natural language processing. Whereas the task of built-up areas extraction from remote sensing data has a number of unique challenges, primarily related to the sensor and the features to be detected, it draws concepts and theories from computer vision, signal processing, statistics and machine learning [32]. Recent applications in remote sensing have used DL approaches for image classification tasks at which the purpose was the labeling of single pixels or regions of an image according to two or more classes [33]- [35]. DL methods have experimentally proved to outperform state-of-the-art machine learning methods (e.g. Support Vector Machines, Random Forests) [36] for the classification of both optical (hyperspectral and multispectral imagery) [35], [37], radar imagery [38], change detection [39] and for the extraction of different land cover types such as roads [40], crop types [34] and buildings [41]. Ball at al. (2017) [32] provide a comprehensive survey of image classification works in remote sensing that rely on DL approaches while the review paper of Ma et al., 2019 [42] on DL approaches covers nearly every application and technology in the field of remote sensing, ranging from preprocessing to image fusion, object detection and land cover mapping. A recent study suggested that deep learning is suitable for capturing the fine features of complex urban areas, and performs better than conventional thresholdbased methods, traditional supervised classifications and machine learning approaches [43]. In particular architectures building on Convolutional Neural Networks (CNNs) have become viable solutions for remote sensing image classification where traditional handcrafted feature engineering and domain-knowledge methods fail because of the limited generalization capabilities of the algorithms, the inter-class similarity, the intra-class variability as well as the changing image acquisition conditions [44], [45]. Differently from other DL approaches, deep CNNs were specifically designed for image classification, nevertheless they can be easily adapted to solve image segmentation problems by performing pixel-wise classification [46]. The hierarchical features of the input image data are modeled naturally by the CNN hierarchical structure, a fact that boosts the CNN performance in satellite image classification in general and facilitates the extraction of built-up features in particular. Another main advantage of CNN architectures over other established methods used for generating the global maps of built-up areas is their capacity to be integrated with mature frameworks of image pre-processing and standardization tools providing shift-invariant and contrast-invariant image local transforms [47].
Recognizing the inherent advantages of convolution operations in the characterization of the built-up environment in remote sensing data, a significant amount of works have recently explored the potential of diverse CNN architectures for mapping built-up areas from different types of sensors and different spatial resolutions: Synthetic Aperture Radar [38], high and very high spatial resolution imagery [43], [48], [49] and aerial imagery [50] (i.e. with a ground sampling distance equal to or even less than 1 m). However, little effort has been directed towards the challenge of large-scale built-up areas mapping with CNN from data of lower spatial resolution such as the ones powered by Sentinel-2. The works of [51], [52] represent a significant advancement in that direction. In particular, the framework of human settlements mapping proposed at 20 m by [52] is a step-forward towards a global scale model. Despite the demonstrated generalization and upscaling capabilities of their proposed framework, the authors failed to implement the CNN model in rural areas, which represent one of the main challenges in built-up areas mapping from satellite data at global scale.

Challenges addressed in this work
When deploying CNNs on large geographical areas or at global scale, four main issues should be taken into consideration:  The necessity to develop a model flexible enough to be applied to a global carpet of satellite data entailing the design of a sound training approach, a strategy for transfer learning and a plan for the consistency verification of the classification output.  The substantial amount of training data required for training complex models. In the case of builtup classification, the training samples should cover different building types (e.g. residential and industrial buildings of different sizes, colors and rotations) in various types of landscapes (e.g. dense urban areas, rural areas, desert landscapes, built-up areas mixed with neighborhood green spaces);  The increased need for computational processing resources, especially for adjusting and fine-tuning multiple and/or complex models;  The requirement for CNN architectures that are robust to noise in satellite imagery (e.g. presence of snow, clouds, haze) and to other seasonal effects. This feature would enable the generalization capacity of the models over large areas and the extraction of built-up areas with comparable efficacy along the urban-rural continuum.
In this work, we propose a Neural Computing framework tailored for global scale mapping human settlements at a spatial resolution of 10 m, from a cloud-free composite of Sentinel-2 data for reference year 2018. The output is a global map of built-up areas expressed in terms of a probability grid.
The main contributions of the work can be summarized as follows:  A new framework for pixel-wise large-scale classification of built-up areas from a Sentinel-2 image composite at a spatial resolution of 10 m has been developed, named GHS-S2Net (GHS stands for Global Human Settlements, S2 refers to the Sentinel-2 satellite);  A multi-neuro modelling methodology is proposed following the Universal Transverse Mercator (UTM) grid zones schema and a systematic sampling within each UTM grid zone;  Transfer learning is implemented following two separate approaches depending on the availability of reliable training data at the different UTM zones: a close range transfer learning within each UTM grid zone and a far range transfer learning from one UTM grid zone to neighboring data-poor zones. In this work, transfer learning does not obey the most dominant definition of using the weight values of pre-trained models from different domains. As a concept herein, it is closer to the verification of the generalization capacity of the models when the training and testing data do not necessarily follow similar statistical distribution;  An extensive assessment of the models output, that is based on an independent validation using fine-scale digital cartographic reference data reporting the footprint of every single building for 277 sites around the globe.
The new framework leverages the JRC Big Data Platform (JEODPP) [53] for the storage of the global input data and for optimized fast parallel processing using the high performance Graphical Processing Units (GPUs). This dedicated infrastructure allows tackling the challenges of large scale processing, boosting the CNN training, and enhancing the prediction accuracy through duly fine-tuning of the models.

Sentinel-2 cloud-free image composite
The input data for human settlements mapping over the entirety of the landmass (excluding Antarctica) consists in a global cloud-free image composite for reference year 2018 derived from Sentinel-2 satellite data of the European Copernicus Earth observation programme. Sentinel-2 mission offers a great potential for fine scale mapping and monitoring of built-up areas thanks to high spatial and temporal resolutions, with a five-day revisit time and decametric resolution [31]. However, the selection of the best available scenes, their download from the dedicated data hubs together with the requirements in terms of storage and computing resources pose restrictions for large-scale mapping. Pixel-based compositing is an approach to leverage the large volumes of available data, whilst effectively mitigating cloud and aerosol contamination as well as data gaps in the archive [54]. This method has been recognized for being a valuable tool for large area applications using high spatial resolution optical data [55]. Accordingly, the image composite was generated in and exported from Google Earth Engine [56]. The methodology used for the selection of the satellite imagery and for image compositing is based on a data driven approach which uses a summary statistic for aggregating the pixel time series (i.e. the 25th percentile). A detailed description of the workflow is presented in [57]. The output image composite consists of a global scale raster grid of four spectral bands derived from top of atmosphere Sentinel-2 image tiles (B2: Blue, B3: Green, B4: Red and B8: Visible and Near Infrared) with a spatial resolution of 10 m. It was produced and tiled following the UTM system with each tile having the projection of the UTM zone (UTM/WGS84 projection) to which it corresponds to. There are in total 615 grid zones with data covering mostly mainland and islands ( Figure  1). The full dataset has a total volume of 15 TB and is hosted on the Big Data platform of the Joint Research Centre (JEODPP). The raster data have been stored in 16-bit geotiff format. The data set can be freely accessed and downloaded from the Open Data Catalogue of the Joint Research Centre of the European Commission 1 [58].

Model input data: learning sets
A sensitive point regarding CNNs is the amount of training data required to properly adjust the network parameters. A large source of free and open access datasets describing built-up areas was collected with different levels of details, completeness, consistency and accuracy. Since the aim is to achieve a stable and at the same time detailed and accurate delineation of built-up areas, the most detailed datasets describing built-up areas were compiled from public sources: The Global Human Settlement Layer (GHSL_BU), the European Settlement Map (ESM_BU), the Facebook high resolution settlement data (FB_HRS) and the Microsoft building footprints (MS_BFP) described hereafter.
2.2.1. Global Human Settlement Layer built-up areas GHSL_BU was derived from automatic classification of Landsat 30 m-resolution data of the year 2014 as described in [59]. The method for mapping built-up areas from Landsat data at global scale builds on the Symbolic Machine Learning (SML) classifier which automatically generates inferential rules linking the image data to available high-abstraction semantic layers used as training sets [60]. The product is provided with a spatial resolution of 30 m. Despite the overall good performance in depicting built-up areas at global scale, the GHSL_BU suffers from under-detection problems in sparsely built-up areas and mainly in rural African landscapes.

European Settlement Map
ESM_BU is the 2 m resolution land cover class "built-up area" produced by the automatic classification of the Copernicus VHR_IMAGE_2015 collection which covers 39 European countries (EEA39) with various earth observation sensors. The built-up areas extraction has been achieved through supervised learning with the SML classifier along with textural and morphological features [61]. The ESM_BU is currently the most detailed map of built-up areas available for Europe. The main issue in this layer is the presence of false alarms, in particular over mountainous areas and sand beaches as well as the absence of cloud-free satellite data in some regions resulting in large data gaps observed in certain urban areas (e.g. United Kingdom (Manchester, Peterborough, Reading, Luton, Coventry) and Ireland (Dungarvan)).

Facebook high-resolution settlement data
The FB_HRS data used in the study are derived from the high resolution settlement grids produced by Facebook [62]. The data set was made available for public use in the frame of "Data for Good" Facebook program that supports international humanitarian efforts [63]. The settlement areas of FB_HRS were automatically delineated by a Convolutional Neural Network classifier working over sub-meter resolution optical satellite imagery and using fine-scale open-source training data as Open Street Map (OSM) [64]. The 30 meter spatial resolution derived data [63] have been used in the present study. At the time we compiled the FB_HRS data, 150 countries were covered by the FB_HRS including large parts of South America, Africa, Europe and Asia. According to the information available on a subset of 194 countries, the image data supporting the FB_HRS spatial delineation were collected in the time range from 2002 to 2017, with a temporal surface-weighted average centered in the year 2013. Based on our internal quality control procedure, the precision of these data was particularly remarkable in rural areas flagging (at a spatial resolution of 30 m) the presence of single isolated houses and small rural hamlets precisely. Commission errors were noticed occasionally in rural areas, especially in correspondence with dense forest patterns. The mapping of large urban areas as accounted by the FB_HRS data turned to be more problematic; in these areas, remarkable systematic omission errors were noticed.

Microsoft building footprint data
The MS_BFP 10m-resolution data derived from the work of the Microsoft map team and are available for public use in the OpenStreetMap community. The data were automatically extracted by the Open Source CNTK Unified Toolkit developed by Microsoft. CNTK and the ResNet34 with RefineNet up-sampling layers were applied to detect building footprints from the Bing imagery that may include VHR satellite and airborne sensors [65]. The MS_BFP data were made available in vector format at a nominal scale of 1:10.000, thus supporting a detailed rasterization at 1x1 m of spatial resolution successively aggregated to 10x10 m resolution used in this study. At the time we compiled the MS_BFP data, information about four countries was available: United States, Canada, Uganda and Tanzania. Despite the detailed representation of single buildings, the MS_BFP data suffers from omission errors referring to large industrial buildings and fewer errors related to over-detections of buildings in mountainous and agricultural areas. Table 1 gives an overview of the specific training sets used for adjusting the models with respect to the following characteristics: spatial resolution, coverage, source image collection date used for layer production, identified issues as well as the number of pixels (total and relative percentages) used as training samples. Figure 2 displays the selected information sources for training the models by geographic area.
Due to the overall quality and spatial detail of the training data and to the variability in both the spatial coverage and the type of issues associated with each dataset, a hierarchical process was implemented for selecting the best data available locally: the priority was given first to MS_BFP and ESM_BU which are the closest proxies to the built-up areas to be derived from 10 m resolution satellite data. They were followed by the FB_HRS and finally by the GHSL_BU, which is the least detailed representation of built-up areas.

GHS-S2Net building blocks
The purpose of the proposed CNN model named here GHS-S2Net is to perform pixel-wise classification of built-up areas at a spatial resolution of 10 m. The concept of "built-up area" applied here is consistent with the definition adopted in the framework of GHSL which is "the union of all the satellite data samples that corresponds to a roofed construction above ground which is intended or used for the shelter of humans, animals, things, the production of economic goods or the delivery of services" [66].
Pixel-wise grouping is equivalent to the standard image segmentation process, i.e. partitioning of the image into multiple segments corresponding to individual pixels or homogenous areas. GHS-S2Net architecture builds on the CNN configurations described in [67]. A schematic representation of the GHS-S2Net is visualized in Figure 3. The two major drivers that framed the design of this CNN model are explained below:  Firstly, given that the target to be recognized ranges in size from single residences until block of contiguous buildings, the model capacity should allow the collection and distillation of the fine information provided by either the single pixels or the small sized groups of pixels consisting of homogeneous characteristics. Unlike popular tasks for natural image segmentation and object localization where there exist sizeable image regions with common characteristics (colour, texture, connectivity, etc.), the size of the objects to be recognized herein varies from 10 m (the finest resolution associated with a single pixel) to some dozens of meters. Consequently, the contextual information that surrounds one pixel and accommodates the prominent features can be expressed by narrow image windows (patches) having a size of few pixels. An extensive experimentation specifically for Sentinel-2 imagery with respect to the optimal size of an image patch at which the convolution performs efficiently is presented in [67]. In the present study, an image patch of size 5×5 has been selected as input image to the CNN, whereas the convolution of the image is achieved through successive kernels of size 2x2 with stride 1x1. At this narrow representation and with the intention of avoiding losing essential information, no pooling layers have been employed to reduce further the spatial size.  Secondly, the motivation was to design a lightweight model that could serve adequately the chosen multi-modelling approach and allow several degrees of flexibility in terms of distributed computing. The total number of model parameters is 1,448,578 (1,447,042 trainable and 1,536 nontrainable), 95 times less than VGGNet [68] and 2.7 times less than GoogleNet [69] (indicative CNNs). While the number of 2D convolutional layers is limited to 4 layers and the number of flattened layers to 2, the number of parameters has been increased due to the high number of filters. Tests showed that the specific CNN topology can perform quite well even if the number of filters is smaller, yet we decided to keep the number of filters high in order for the model to capture very subtle details. This lightweight topology facilitates the algorithm execution across heterogeneous GPU modules throughout the prototyping and operational phase. Additionally, it enables smoothly the multi-modelling deployment at which a different model has been trained over every UTM zone, capturing more precisely the local characteristics and the variance along similar geographical regions.
The 2DCONV block as shown in Figure 3 comprises two successive stacks where 2×2 convolution takes place and the linear and the hyperbolic tangent activation functions (tanh) respectively transform the signals across the network layers. Although the rectifier activation function and its variants have been used widely in the various deep neural network architectures due to their robustness against the vanishing gradient problem [70] our experimentation indicated that by using a smaller number of neural network layers, the functional mapping via tanh activations captures better the complexity of the features with respect to the Sentinel-2 imagery. Besides, the tanh function is more suitable in the case of optimization with stochastic gradient descent where sigmoid function shows sharp damp gradients during backpropagation as well as gradient saturation [71]. The alternation with linear mappings results in a cost-effective solution in terms of computations. Speed-up of the training process and remedy to the effect of the internal covariate shift is provided through data batch normalization operations [72]; at each data batch, transformation is performed by keeping mean activation close to 0 and the activation standard deviation close to 1. A subsequent dropout regularization layer [73] has been used to prevent overfitting, with a ratio of 0.1 of neurons not considering at each update during the training phase.
The sigmoid function has been employed only for the last layer and maps the model output into the range [0,1], giving rise to the probability of a pixel to belong to the class built-up. We propose a two-stage training approach at which a single model per each UTM grid zone has been trained in accordance with the zones used for generating the Sentinel-2 image composite. This multi-modelling approach aims at capturing the variations in the Sentinel-2 data and the diverse characteristics of human settlements (in terms of size, shape, morphology and structure). Furthermore, rather than training a very complex single model that would need big volumes of representative data, the training of several relatively light CNNs facilitates the modelling of local features and distributes effectively the computational load into several machines by increasing significantly the total throughput. Each UTM grid zone covers an average area of 447,650 km 2 (area calculated in equal area projection). This type of data splitting is prone to containing various types of built-up areas and settlement patterns across heterogeneous landscapes even within the same UTM zone. Besides, the semantic classes of "built-up" and "non built-up" are unevenly distributed spatially and their frequencies are highly varying. The class "built-up" is very rare compared to the non-built-up class (See Supplementary material R1) (2% of the training samples (5x5 pixel blocks) represent built-up while 98% represent non built-up). To tackle this uneven distribution of training samples, each UTM grid zone was split into tiles of 100x100 km 2, which is consistent with the tile size of the Sentinel-2 granules (purple cells in Figure 4). The two stages are described below: 1) select systematically 50% of the ~100x100 km 2 tiles of the UTM grid zones for the model training (orange boxes in Figure 4); 2) consider all built-up patches (5x5 blocks of pixels of 10 m containing at least one built-up pixel) falling within the selected 100x100 km 2 tiles and randomly sample 60% of the non-built-up patches uniformly with respect to their frequency in the tile (checkerboard in Figure 4). The training of the models per UTM grid zones is done by grouping the built-up and non-built-up patches into minibatches of 200,000 samples (where the steps per epoch depend on the training size of each UTM zone) as a compromise between computational constraints and the need to converge to a global optimum. A special attention is given to UTM grid zones largely covered by water surfaces and no data in the Sentinel-2 image composite. In such cases, all the tiles of the valid data domain are considered in the training phase without applying any sampling approach. 2.3.2. Per-tile predictions As described previously, the CNN model consists of encoding layers solely, through which the information existing into image blocks of size 5 rows x 5 columns x 4 bands is multiplexed and transformed to a single value, denoting the probability of the central pixel of the 5 x 5 block to belong to the built-up class. The prediction phase has been performed with tiles of size 10,000 rows x 10,000 columns x 4 bands. A sliding window of size 5 x 5 pixels has been applied to produce the 5 rows x 5 columns x 4 bands input blocks. Constant-value image padding has been also implemented in order for the pixels at the image border to be correctly inserted into the 5 x 5 x 4 input blocks. The predictions of the model are given in vector format having exactly the same size as the rows and columns product of the original input tile.

Close Range and Far-range transfer learning
Transfer learning is a paradigm in DL to solve a target problem by reusing the learning with minor modifications from a different but related source problem. Qin et al., [74] review transfer learning in remote sensing applications and categorize the methods into four families depending on what is being transferred:  instance-based transfer which uses partial training samples in the source domain to improve the performance of the model of the target domain [75];  feature representation-based transfer [76] which assists the target domain classifier to learn a more effective feature expression from the source domain and improve its performance;  relational knowledge transfer [77] where knowledge among the data in the source domain is transferred to the target domain;  parameter-based transfer [78] considers that the source domain classifier and target domain classifier have the same optimal parameters, which can be found from the source domain classifier and then used for the target domain classifier.
Another more general classification of transfer learning methods considers the availability of labeled data and categorizes the methods into three sub-settings [79]: inductive transfer learning, when labeled data in the target domain are available; transductive transfer learning, when solely labeled data in the source domain are available; and unsupervised transfer learning, when labeled data do not exist in either the source or target domain.
One of the goals of this work is to address the following aspects of the pixel-wise classification: the computation time for training a big number of models for every UTM grid zone and the availability and precision of the training data. Parameter-based transfer learning was adopted in a transductive transfer learning framework tailored to the training strategy described in the section 2.3.1. This includes a close range and a far range transductive transfer of model parameters ( Figure 5):  The close range transfer learning consists in training the model with a subset of the input data in a given UTM grid zone (following the method described in section 2.3.1) and applying it to all the 100x100 km 2 tiles falling within the same UTM grid zone. This approach allows speeding up the training process of 615 different models and producing the predictions of a total of 30,000 tiles. It also helps overcoming overfitting issues;  The far range transfer learning consists in training the model with detailed samples such as MS_BFP and FB_HRS in a given UTM grid zone and applying it to a neighboring zone or to zones with similar landscape and built-up typology, at which labeled samples are scarce or zones where only GHSL_BU training datasets are available. This approach allows refining the predictions and testing the generalization capabilities of the GHS-S2Net model. 2.4. Processing infrastructure The computing-intensive workflow was executed on the JEODPP infrastructure. The JEODPP is a versatile platform with multi-petabyte scale storage (14 PiB currently) co-located with computational capabilities [53]. The platform is based on commodity hardware and open-source software stack including the EOS storage technology developed by the European Organization for Nuclear Research (CERN) [80]. The platform has been recently upgraded with a series of GPU nodes to speed-up machine/deep learning applications. Currently, there are 5 GPU nodes equipped with different types of GPU modules and memory per module. For the training of the GHS-S2Net models, as well as for the prediction phase, 2 GPU nodes were used: the first with 4 Quadro RTX 6000 with 24.2 GB of memory and the second with 2 Tesla V100-PCIE with 32.5 GB of memory. Dedicated Docker images integrating the necessary deep learning packages were created to run all the experiments.

Training phase of CNN models per UTM grid zone
3.1.1. Hyper-parameters tuning During the training phase of the model per each UTM grid zone, 10% of the training data was reserved for validation in order for the CNNs to prevent over-fitting. The input Sentinel-2 composite data were rescaled in the range [0,1]. The number of epochs to train the models was set to 25 iterations. The weights were initialized based on uniform distribution with bounds [-0.1065, 0.1065]. Finally, the Adam stochastic optimization with a learning rate of 0.0001 has been used to optimize the binary cross-entropy, log loss function: where N is the number of training samples, y is the vector of the real target values of the training set in binary coding, and ̂is the vector of the model responses in the continuous range [0, 1]. The cross-entropy loss has fast convergence rate and is numerically stable when coupled with sigmoid normalization [81].
3.1.2. Performance evaluation For evaluating the classification performance of the models during the training and prevent overfitting, a fraction representing 10% of the training data was used for validation. Figure 6 shows the progress of the average loss curves produced by 485 GHS-S2Net models during their training and validation which last 25 epochs. Every model corresponds to one UTM grid zone, resulting in 485 out of 615 grid zones that refer to landmass with presence of built-up according to the learning sets. The learning curves show that both the average training loss (green curve) and validation loss (red curve) decrease rapidly to a point of stability with a convergence around 12 epochs. The fact that the gap between the two curves is very small even for the first 5 iterations and that it completely disappears around 12 iterations after, shows that the size of the training sets, selected following the two-stage training approach, is optimal and that the models have good generalization capacity [82].  Figure 7. The reported elapsed time refers to every UTM grid zone predominantly covered by land (204 grid zones) and those zones predominantly covered by water (281 grid zones). In inland tiles, more training samples are usually fed to the GHS-S2Net while in water tiles the number of training samples is smaller. The stacked bar plots show that the average training time is around 3,600 seconds while the prediction time is around 15,500 seconds. For inland zones, the average training time is 3,900 seconds and the prediction time is 16,400 seconds while for water zones, the processing time is shorter with an average training time of 3,100 seconds and prediction time of 15,000 seconds. These results show that the GHS-S2Net-based multi-modelling approach scales seamlessly in a distributed multi-GPU platform. For the processing at a global scale, our main constraint was the limited amount of concurrently available GPUs: we employed 6 GPU modules for the training phase and 2 modules for the prediction phase that were available at the time of deployment. Despite these limitations, we managed to scale up the GHS-S2Net-based multi-modelling approach and achieved to process a data set having global coverage at 10 meter spatial resolution thanks to: i) an efficient partitioning of the processing per UTM grid zone, 2) the two-stage training approach with a subsampling of non-built-up patches within the selected tiles containing training samples, and 3) the optimal size of input data (i.e. 100 x100 km tiles) used for both the training and prediction. Increased GPU capacities and activation of early stopping during the training in order to reduce the number of iterations (epochs) when the loss function stops improving, can significantly reduce both the training and the prediction time of the GHS-S2Net model.  2L  8U  2W  25M  20T  55V  14P  54G  28W  32W  19T  18R  42Q  29V  33W  35H  26W  56V  15P  41K  6V  22T 22V  29S  58G  32K  47M  32L 30V  21T  58U  20F  6W  56U  57M  7V  26S  29T 29N  18G  18K  21U  36S  34H  50K  57L  28S  1K  58N  58M  30U  18W  52R 46P  50L  2Q  22N  21P  46N  39K  32M  55T  36J  49P  55U  17Q  20P  27R  58P  19Q  51L  54U  56M  39P 56H  53S 18Q  44P  23J  53T  55L 40Q  56J  25L  55M  15R  18S UTM grid zone

Training_time Prediction_time
Processing time per UTM grid zone water

Qualitative assessment of the models predictions
The results of the GHS-S2Net implementation on the Sentinel-2 global mosaic were assessed visually. Compared to the training sets, the results of built-up detection showed a significant reduction of both commission and omission errors and other artifacts that were observed in the training sets (see section 2.2). In addition, GHS-S2Net resulted in a refined mapping of built-up areas and open spaces within urban areas and most importantly the detection of new settlements, never annotated so far in the training sets or identified in any other global scale dataset. Figure 8, illustrates some examples of each type of improvement obtained with the GHS-S2Net models. Figure 8, Figure 9 and Figure 10 show, for selected cities, the enhanced built-up areas detection, represented in the form of continuous-range outputs (probability), in comparison to the best available training sets. The most notable improvements relate to the detection of built-up areas which are omitted from the training sets, under the assumption that the initial purpose of these data sets was to map completely the contiguous areas they cover. These omissions are either due to lack of data or to flaws and gaps in the training sets themselves given that they were all extracted through automatic classification of satellite imagery. In the case of FB_HRS (Figure 8a

Validation of the model predictions and assessment of generalization performance
Two approaches were implemented for the validation of the GHS-S2Net output that are based on comparison with independent cartographic data of building footprints, not employed for the training of the models:  Continuous assessment: by testing the GHS-S2Net output as predictor of the built-up densities at the spatial resolution of 10m through least-square linear regression;  Binary assessment: by evaluating the contingency table between the binarized outputs of GHS-S2net after the application of a probability cut-off value, and the binarized reference data used as a "ground-truth". The strength of the linear relation between the automatically-generated built-up probabilities and the reference data is assessed through the Pearson correlation coefficient (r). The gain factor (slope) allows the user to model, retro-fit and compare the results obtained from the GHS-S2Net model for the different AOIs. In addition, the slope of the regression is an indicator of the optimal threshold for translating the built-up probabilities to binary values for the pixel-based accuracy assessment.
The results of the regression analysis at 10 m for all AOI sites showed an average correlation coefficient r of 0.67 and an average slope of 0.52 ( Figure 11). The average correlation coefficient shows that the output probabilities from GHS-S2Net models are capable of capturing around 67% of the structural variability in built-up areas. The lowest correlation coefficients were observed for AOIs covering complete counties in the United States where there are a lot of building sizes below 100 m 2 (which is the limit of the detectability of the Sentinel-2 sensor) and the built-up density is very low, less than 0.5%. This is for instance the case of the Matanuska-Susitna Borough AOI which is a borough located in the state of Alaska, covering an area 9492.46 km 2 with a built-up density of 0.1% and an average size of buildings of 140 m 2 (Supplementary material 2). The output probabilities of the GHS-S2Net models seem to better capture building densities in urban areas and high density AOIs where the correlation coefficients were greater than 0.6. This is the case for example of the AOI covering San Francisco city with an area of 194 km 2 and a building density of 26.4%.
It is also worth noting that the gain factor (slope) translating the built-up probabilities as derived from Sentinel-2 data to built-up surface densities as derived from the reference cartographic data is almost constant. The slope has an average of 0.2 in low density AOIs, in particular those covering full counties in the United States (e.g. San Juan County). In high-density AOIs covering cities, the slope (bias) is higher (e.g. city of Rome where the slope is close to 0.8) with an average around 0.54.
According to these findings it is not straightforward to define one general-purpose threshold to binarize the output of the GHS_S2Net models into two classes 'built-up' and 'non-built-up'. A threshold of 0.2 would then be good compromise targeting large areas including scattered settlement patterns, in particular rural areas, while a more conservative threshold of 0.5 would be more suitable for areas largely dominated by high-density built-up areas (i.e. city centers). Following this finding, both thresholds were applied to the outputs of the GHS-S2Net models for assessing the quality of the classifications following a pixel-wise accuracy method.

Binary accuracy assessment
The thresholds 0.2 and 0.5 identified in the previous regression analysis were used to binarize the probabilistic output as required by the pixel-wise binary accuracy assessment at the spatial resolution of the sensor. Standard accuracy and error metrics derived from the confusion matrix were calculated for the binary results obtained with the two thresholds. Given the lack of a single universally accepted measure of agreement, we use a combination of two main performance metrics to give a complete picture of the performance of the GHS-S2Net models: the balanced accuracy and the Kappa coefficient that were introduced to the remote sensing community and recommended by Congalton, 2011 [83] . The Balanced Accuracy and Kappa are measures of classification accuracy, the former providing information about the rate of correctly classified pixels in an unbalanced setting where non-built-up pixels are predominant compared to built-up pixels. The latter compensates for random chance in the pixels assignment.
The results of the per-pixel accuracy assessment with the two binary outputs are summarized in Figure 12 and disaggregated per continent. The figure shows the average and standard deviations of the Balanced Accuracy and Kappa coefficients per binary output and per continent. The 277 AOI were grouped by continent to evidence major improvements especially in areas where previous global products failed to produce satisfactory results. Overall, both binary classifications produce good results with an average Balanced Accuracy greater than 0.7 and an average Kappa greater than 0.5. However, when compared to the binary outputs derived with the 0.5 probability threshold, the classification with a less conservative threshold of 0.2 produces better agreement with the reference data, consistently for all continents. The best results in the least conservative classification outputs (threshold of 0.2) were obtained in Oceania an Asia with an average Balanced Accuracy of 0.91, followed by North America and Africa where the mean Balanced Accuracies were equal to 0.86 and 0.85 respectively.
The results of the per-pixel accuracy assessment, in particular those obtained by applying a low threshold to the probability outputs, constitute a strong evidence of the modeling power of the GHS-S2Net and the reliability of the outputs. They are also a confirmation of the merit of the new classification framework for identifying settlements in challenging landscapes such as in Africa and Asia. They also suggest that for the generation of a global binary classification from the probabilistic output of the models, a low probability threshold is recommended, in particular if the purpose is to capture all the scattered settlements in rural landscapes such as in Africa. In this particular context, the binary outputs obtained with a threshold of 0.2 outperform significantly those derived from the conservative threshold.

Comparison between the results of close range and far range transfer learning
When computing the GHS-S2Net predictions at the global scale, the majority of the UTM grid zones and in particular the 100 x100 km 2 tiles were processed with the close range transfer learning. However, to allay the scarcity and quality issues in the training dataset, 28 UTM grid zones were classified according to the far range transfer learning and the outputs were compared to those obtained by the direct close range transfer learning. Figure 13a illustrates the differences between close range (middle figure) and far range transfer learning (bottom figure) in areas suffering from the lack of training samples (e.g. in Ethiopia). It shows the capacity of the far range transfer learning in discovering undetected built-up features in UTM grid zone 37P, on the basis of the parameters of the model trained in the neighboring UTM grid zone 37M. In such a situation, the close range transfer learning was less effective in identifying those scattered settlements due to insufficient training samples in the UTM grid zone 37P. Moscow is one of the cities where detailed building footprints were available in the reference database used in the validation exercise. The availability of "ground-truth" data enabled to conduct a quantitative binary accuracy assessment of the results of far range transfer learning in comparison to those obtained with the close range transfer learning. The results are illustrated in Table 2 for the binary outputs with cut-off values of 0.2 and 0.5. They show higher overall and balanced accuracy values resulting from the application of far range transfer leaning. These results are an additional evidence of the enhanced mapping capabilities of a well-designed far range transfer learning approach deployed in this work. The encouraging results were determinant for expanding the application of far range transfer learning which was finally implemented on a total of 28 UTM grid zones. The selection of source and target UTM grid zones was mainly driven by spatial adjacency or similarities in the landscape and in the type of built-up areas.

Discussion and future work
In this paper, we presented a novel end-to-end framework for large-scale pixel-wise classification of builtup areas from high-resolution satellite imagery. The developed multi-model approach designated by GHS-S2Net builds on a relatively simple CNN architecture. The implementation of the models on a global cloudfree Sentinel-2 image composite provides the most detailed and complete map reporting about built-up areas in the form of probability outputs (i.e. probability of a pixel to belong to the class 'built-up'). The results confirm the high generalization capacity of the model and its ability to not only detect new built-up areas in difficult landscapes (i.e. in Africa and Asia) without site specific training sets, but also its potential to mitigate commission errors in the best available training sets reporting about built-up areas across the globe.
The implementation of the developed framework for large classification of human settlements was achieved thanks to three main building blocks:  The multi-neuro modelling methodology, which follows the UTM grid zones schema and the systematic sampling within each UTM grid zone. This approach of training sub-models at global scale allows decomposing the optimization phase into smaller tasks, which are then solved in parallel. The adopted sampling approach meets the three following criteria: class balance, diversity,

b) a)
and representativeness. It shows to be suitable for an optimal learning of the models at a global scale without compromising performance;  The transfer learning including both the close range and the far range transfer learning. Both approaches benefit from parameter-based transfer methods where the optimal parameters found in the source domain classifier are used for the target domain. The novelty of the approach implemented in the paper was the use of the close range transfer learning within the same UTM grid zone in a way to alleviate the computational burden and avoid overfitting issues. The far range transfer learning leverages the optimal parameters found when training the models with detailed and high quality training sets in a given UTM grid zone and then applying them to neighboring zones subject to training data scarcity. The far range transfer learning allowed allaying the scarcity and quality issues in the training sets while achieving outstanding performance in the reduction of commission and omission errors found in the best available data and in the refinement of built-up areas detection;  The deployment of the high-throughput processing, including data preparation, learning and inference on the multi-petabyte scale JEODPP platform. The big data multi GPU platform enables: i) the efficient storage of the large volume of input satellite data (15 TB) and the output (1.5 TB) maps encoded in 16 bits, ii) the parallel training of the models on an heterogeneous cluster of GPUs, and the iii) optimal load balance in terms of data retrieval and processing from and to the distributed system due to the efficient co-location of the data with the processing units.
The validation of the results with an independent reference dataset of building footprints covering 277 sites across the world, establishes the reliability of the built-up layer produced by the GHS-S2Net approach and the model robustness against both the variable conditions in the satellite imagery and the heterogeneity in the landscapes and built-up characteristics. The most noticeable achievement is the capacity of the model to classify built-up areas in remote areas (e.g. in Africa and in Asia), reported in none of the global products (i.e. GUF, WSF, FROM GLC10). Another significant result is the strong relationship between the output probabilities and the building densities suggesting that the model outputs can be used as proxy measures for building densities without additional calibration or modeling. Despite the unprecedented results obtained by the proposed approach on an extremely challenging dataset in terms of spatial coverage, resolution and spectral variability, some challenges need to be considered, especially if the aim is to regularly update the built-up layer for continuous monitoring of human settlements with Copernicus Sentinel-2 data. The challenges pertain to methodological choices when designing the model and during its scaling to the classification of the global composite: -The choice of patch size: in general, assessment of CNN accuracy indicates that using larger patch sizes yields higher accuracies because the network is able to learn more contextual features. In the case of the Sentinel-2 pixel-based classification, the experiments performed by [67] on Sentinel-2 data showed that larger patch sizes (e.g.15x15) did not yet yield significant improvement in the model accuracy.
In this work, we tested a 10x10 patch size resulting in a deeper network topology, yet the loss function did not improve during the training phase whereas the prediction accuracy worsened. -The far range transfer learning: the strategy for implementing the far range transfer learning was based on criteria related to spatial adjacency of UTM grid zones or similarities in the landscape and in the type of built-up areas. The potential of this approach for mitigating problems in the training data and for deriving fine-grained classification outputs was clearly demonstrated in the classification results. Nevertheless, the added-value of this approach was not fully exploited in the context of this work. Additional work should focus on the analysis of spatial patterns of landscape features and typologies of built-up areas and their influence on the outputs of the classification with GHS-S2Net. The ultimate goal is to unveil the underlying rules and associations for designing a more systematic approach to identify the source and the target UTM grid zones candidate for the far range transfer learning.
-The variable quality of the training data: despite their outstanding learning capability, the lack of accurate training data might limit the applicability of CNN models in realistic remote-sensing contexts [84]. For our global scale application, the strategy was to collect the best publicly available training data and reporting about built-up areas. The higher the spatial resolution of the training data, the more detailed is the output of the classification. Ideally, the spatial resolution of the input training data should be equal or better to that of the input Sentinel-2 imagery. As described in section 2.2, the reference data sources have variable spatial resolutions. In addition, the trustworthiness of samples is highly variable across the different sources but also within the same reference data source. The lack of consistency in the training data produces outputs with variable qualities depending on the input data used for training the models. This was reflected by the results of the validation when disaggregated per continent. One approach to deal with imperfect training data was to use the far range transfer learning. However, this approach has a limited applicability at global scale since it supposes that the target UTM grid zones have similar characteristics (in terms of landscape and types of built-up areas) with the source zones. Another approach is to use a two-step training approach in which the models are first initialized by using a large amount of possibly inaccurate reference data, and then refined on a small amount of accurately labeled data, similarly to the method developed in Maggiori et al., [84]. In the context of our large-scale classification, it is perfectly reasonable to use the output produced by the GHS-S2Net to train a new model. The use of high quality and consistent outputs produced for the reference year 2018 by the application of the GHS-S2Net model at global scale is a key for frequent updates of built-up layers from Sentinel-2 Copernicus data and for continuous monitoring of built-up areas.

Conflicts of interest/Competing interests:
The authors have no competing interests.

Availability of data and material:
The final map of built-up areas has been uploaded in Google Earth Engine for visualization. The values for the probabilities have been rescaled to 8 bits in the range 0 -100 with no data values set to 255. https://code.earthengine.google.com/6a1457205bd295a44902a6c2eb266204?hideCode=true