Introduction

Impervious surfaces are usually defined as the entirety of impermeable surfaces such as roads, buildings, parking lots, and other urban infrastructures, where water cannot infiltrate through the ground (Sun et al. 2011). Urbanization results in the increase in impervious surfaces, which in turn casts great impacts on urban environmental problems such as increased urban heat islands (Ma et al. 2016), surface runoff (Sun et al. 2014), water contamination (Kim et al. 2016), and air pollution (Touchaei et al. 2016). Facing rapid urbanization all over the world, these environmental concerns have triggered a surge of interest in impervious surface studies.

Satellite imagery with various spatial and temporal resolutions has been widely employed to map impervious surfaces. Medium- and low-spatial-resolution images, including Landsat, MODIS data, have rich spectral information and high temporal resolution, which is suitable for large-scale impervious surface mapping (Xu et al. 2018a, b; Zhang et al. 2018). However, challenges for mixed pixels confuse us to extract impervious surfaces in the large-area mapping. High-spatial-resolution images produce detailed land-cover and land-use information, but the spectral similarity of different objects and shadows of tall buildings or large trees limit the impervious surface extraction (Guo et al. 2014). Hyper-spectral images solve the problem of spectral similarity of different objects and spectral heterogeneity of the same land class, but small map coverage and high price obstruct the application of hyper-spectral images (Weber et al. 2018). SAR image extracts land information free from cloud, and SAR image is helpful for extracting impervious surfaces under the large tree crowns (Guo et al. 2014). However, the coherent noise of SAR image is a significant problem for impervious surface extraction. Therefore, the single-source imagery has various restrictions on urban impervious surface mapping. More recently, integration of multiple datasets from different image acquisition mechanisms has been considered promising to address these uncertainties (Xu et al. 2017), including medium–low-spatial-resolution images and high-spatial-resolution images (Parece and Campbell 2013a, b), optical images and SAR images (Zhang et al. 2014), high-spatial-resolution images and light detection and ranging (LiDAR) data (Im et al. 2012). In particular, LiDAR data can improve impervious surface extraction by providing the height information that significantly distinguishes between objects with similar spectral characteristics (Im et al. 2012). For example, although the buildings, roads, and bare soil usually have similar spectral features, the height difference is large. Therefore, LiDAR height information is helpful for dealing with the different objects with similar spectrum. What is more, the roof of buildings is flatter than the tree crowns. The LiDAR height variance is helpful for distinguishing buildings and trees.

Various approaches have been developed to extract and quantify urban impervious surfaces from satellite images (Sun et al. 2017). Index analysis approach is applied usually on large-scale mapping and estimation of impervious surfaces. Different indices demand different image pre-processing, bands, and time. Index analysis is sensitive to thresholds (Sun et al. 2017). Regression model is also applied to large-scale estimation of impervious surfaces from different remote sensing images (Akpona et al. 2018). Spectral mixture analysis method certainly solves the problem of mixed pixels. However, it is often used for medium–low-spatial-resolution images (Wang et al. 2018). Decision tree method is also generally applied to impervious surface extraction, including CART algorithm and rule-based decision method (Xu 2013). Decision tree method is a weak learning method and it is sensitive to the noise of data and the error of training samples. Moreover, the rule-based method is uncertain due to different indexical thresholds. Classification method from machine learning has a wide application on impervious surface extraction, including artificial neural network (Hu and Weng 2009), support vector machine (Sun et al. 2011), and random forest (Xu et al. 2018a, b). These machine learning approaches have a good performance. However, these swallow learning methods train from raw images and manually extracted features, which consist of some superfluous and useless information. More recently, deep learning (Hinton and Salakhutdinov 2006; LeCun et al. 2015) has become a hot topic in many research areas, including urban remote sensing. As a typical deep learning model, the convolutional neural networks (CNNs) employ a set of trainable filters and local neighbourhood pooling operations on raw images, resulting in a hierarchy of increasingly complex features (Ji et al. 2013). Compared with other neural networks, the CNNs take advantage of weight sharing and local connections, which help to reduce the total number of trainable parameters and leads to more efficient training and more effective recognition of patterns (Chen et al. 2016a). Another benefit of the CNNs is the use of pooling, which results in slightly translational and rotational invariant features, a desirable property for natural signals (Längkvist et al. 2016). Therefore, the CNNs have demonstrated excellent performance in image classification (Maggiori et al. 2016; Marmanis et al. 2016; Chen et al. 2016b; Scott et al. 2017; Kussul et al. 2017) and target recognition (Sevo and Avramovic 2016; Ding et al. 2016; Cheng et al. 2016; Zhang et al. 2017). However, most of this research used the one-dimensional (1D) or two-dimensional (2D) CNNs to conduct image classification. Advantageously, the three-dimensional (3D) CNNs, due to their 3D convolutional operation, have the ability to simultaneously model spatial, textural, spectral, and other information. Presently, 3D CNNs are beginning to be applied for videos or volumetric images (Ji et al. 2013; Chen et al. 2016b; Tran et al. 2015), and their performance has not been illustrated in extracting urban impervious surfaces from satellite imagery.

The objective of this study is first to explore the potential of the 3D CNNs in extracting urban impervious surface from high-resolution (HR) imagery. To test its effectiveness, its results are compared with the classification outputs from the commonly applied support vector machine (SVM) method and 2D CNN methods. We also illustrate the influence of different parameters of the proposed 3D CNN model on impervious surface extraction. Two HR images are utilized to examine the effects of single-source (WorldView-2, WV-2) and multi-source (WV-2 and LiDAR) datasets in this study.

Methodology

3D Convolutional Neural Networks

The CNNs are multi-stage feed-forward neural networks that hold the state-of-the-art performance in remote sensing. Although the CNNs were proposed many years ago, only in recent years it has become possible to train and implement large CNNs in remote sensing with computational progress such as high-performance GPUs, rectified linear units (ReLU) to improve much faster training, and dropout or data augmentation techniques to reduce overfitting (Chen et al. 2016a). In this study, we propose to perform 3D convolutions using a 3D kernel in the CNNs to compute various features from HR WV-2 and LiDAR data.

Typically, the 3D CNN layers consist of convolutional, nonlinear, and pooling operators. The input and output of each layer are called feature maps. Given an input image \( x \in R^{m \times m \times c \times d} \) having number of lines (\( m \)), columns (\( m \)), channels (c), and depth (d), the output image \( y \in R^{{m^{{\prime }} \times m^{{\prime }} \times c^{{\prime }} \times t}} \) is composed of a number of lines (\( m^{{\prime }} \)), columns (\( m^{{\prime }} \)), output features (t), and depth (\( c^{{\prime }} \)). The convolution of the input x with a 3D kernel \( w \in R^{n \times n \times c \times t} \) is calculated as (Ji et al. 2013):

$$ y_{{i^{{\prime }} j^{{\prime }} r^{{\prime }} }}^{lh} = f\left( {b_{lh} + \mathop \sum \limits_{g} \mathop \sum \limits_{i = 1}^{n} \mathop \sum \limits_{j = 1}^{n} \mathop \sum \limits_{r = 1}^{c} W_{ijr}^{lhg} \times X_{{S_{n} \left( {i^{{\prime }} } \right) + i - P_{m} ,S_{n} \left( {j^{{\prime }} } \right) + j - P_{m} ,r}}^{{g\left( {l - 1} \right)}} } \right) $$
(1)

where \( y_{{i^{{\prime }} j^{{\prime }} r^{{\prime }} }}^{lh} \) is the neuron at position \( \left( {i^{{\prime }} j^{{\prime }} r^{{\prime }} } \right) \) of the \( h{\text{th}} \) feature map in the \( l \)th layer; \( b_{lh} \) is the bias for the \( h{\text{th}} \) feature map in the \( l \)th layer; \( n \) expresses the height and width of the spatial convolution kernel; \( c \) denotes the size of the 3D kernel along the spectral and elevation dimension (same as the number of channels of input image); \( g \) is the feature map in the \( \left( {l - 1} \right) \)th layer connected to the current (\( h \)th) feature map; \( W_{ijr}^{lhg} \) represents the weight value of the position (\( i,j,r \)) connected to the \( g \)th feature map; \( P_{m} \) denotes padding of the input image; \( S_{n} \) is the sub-sampling stride of the output image; \( X_{{S_{n} \left( {i^{{\prime }} } \right) + i - P_{m} ,S_{n} \left( {j^{{\prime }} } \right) + j - P_{m} ,r}}^{{g\left( {l - 1} \right)}} \) denotes the neuron connected to the neuron of \( y_{{i^{{\prime }} j^{{\prime }} r^{{\prime }} }}^{lh} \) in the \( \left( {l - 1} \right) \) layer; and \( f\left( \cdot \right) \) is a nonlinear activation function. Generally, the typical activation function for CNNs is hyperbolic tangent or rectified linear units (ReLU). Here, the ReLU activation function (Eq. 2) (Ding et al. 2016) is used, which can reach CNNs’ best performance with precisely supervised training on large labelled datasets.

$$ y_{i,j,r} = \hbox{max} \left( {0,x_{i,j,r} } \right) $$
(2)

The pooling layer decreases the solution of the output features to make them less sensitive to the input shift and distortions. We adopt the max-pooling layer method in this study. It computes the maximum response of each image channel in a \( p \times p \) sub-window, which acts as a sub-sampling operation. The max-pooling layer can be written as (Vedaldi and Lenc 2015):

$$ y_{{i^{{\prime }} ,j^{{\prime }} ,r}} = \mathop {\hbox{max} }\limits_{1 \le i \le p,1 \le j \le p} \left( {x_{{S_{n} \left( {i^{{\prime }} - 1} \right) + i - P_{m} ,S_{n} \left( {j^{{\prime }} - 1} \right) + j - P_{m} ,r}} } \right) $$
(3)

Architecture of 3D CNNs for Image Classification

The architecture of 3D CNNs for extracting impervious surfaces is depicted in Fig. 1. It is composed of one input layer, multiple convolutional layers, multi-max-pooling layers, and one softmax-loss layer (output layer). The size of input layer is \( m \times m \times c \times d \), where \( c = 10 \) represents the number of input image channels (including eight WV-2-fused multi-spectral bands, one panchromatic WV-2 band, and the LiDAR-derived nDSM band). The number of CNN convolutional layers \( L \) is designed for extracting features from the image. For each CNN layer, the size of convolutional kernel is \( n_{l} \times n_{l} \times d_{l} \) (\( l = 2,3,4, \ldots ,L \)), \( d_{l} \) denotes the depth (d) of convolutional kernel in the \( l \)th layer, and the size of the sub-window in the max-pooling layer is \( p_{l} \times p_{l} \) (\( l = 2,3,4, \ldots ,L \)). \( t_{l} \) is the number of output feature maps in the \( l{\text{th}} \) layer. The optimal selection of CNN parameters (\( m, n, p, L, t \)) will be discussed in Sect. 2.3. In our study, the ReLU are used to reduce training time in each convolutional layer, and the recently proposed dropout is employed for reducing the overfitting in the network training after each max-pooling layer (Chen et al. 2016a). The dropout rate is 0.5. During the architecture construction of the 3D CNNs, the padding parameter \( P_{m} , \) and the stride parameter \( S_{n} \,{\text{of}}\,{\text{the}}\,{\text{steps }} \) are set as: \( P_{m} = 0 \) and \( S_{n} = 1 \) in the convolutional layer, and \( P_{m} = 0 \) and \( S_{n} = 2 \) in the sub-sample layer. In addition, Vedaldi and Lenc (2015) suggested that the optimizer is Adam, the value of the learning rate is set to 0.0001, the number of training epochs is set to 100, and the mini-batch of input images for each epoch is set to 8. After the CNN convolution, a full-connection (FC) layer is built, which is similar to the convolutional layer. However, the size of output map in the FC layer must be \( 1 \times K \), where \( K \) of the output layer is the number of land-cover classes.

Fig. 1
figure 1

Architecture of 3D CNNs with input features of elevation, spectral, and spatial information for impervious surface extraction (CF—convolution filter; SW—sub-window; and FC—full connection)

The size of the output layer corresponds to the total number of land-cover classes in our study area: buildings, roads/other impervious surfaces, trees, grasslands, and bare soils. The softmax nonlinearity is employed to conduct multi-class logistic regression for the output layer. Its output is a K-dimensional vector, in which each element corresponds to the probability of each class. Within a mini-batch \( \left( B \right) \) of input samples \( \left\{ {\left( {x^{\left( 1 \right)} ,y^{\left( 1 \right)} } \right),\left( {x^{\left( 1 \right)} ,y^{\left( 1 \right)} } \right), \ldots ,\left( {x^{\left( B \right)} ,y^{\left( B \right)} } \right)} \right\} \), for an input sample \( \left( {x^{\left( b \right)} ,y^{\left( b \right)} } \right), b \in \left\{ {1,2, \ldots ,B} \right\} \), its label is \( y^{\left( b \right)} \in \left\{ {1,2, \ldots ,K} \right\} \). The probability of the \( k \)th \( \left( {k \in \left\{ {1,2, \ldots ,K} \right\}} \right) \) class is estimated as follows (Chen et al. 2016b):

$$ P\left( {y^{\left( b \right)} = k|x^{\left( b \right)} ;\theta } \right) = \frac{{\exp \left( {\theta_{k} x^{\left( b \right)} } \right)}}{{\mathop \sum \nolimits_{i = 1}^{K} \exp \left( {\theta_{i} x^{\left( b \right)} } \right)}} . $$
(4)

\( \theta \) is the model parameter that needs to be adjusted by constructing the softmax-loss function \( {\mathcal{L}}\left( \theta \right) \). \( {\mathcal{L}}\left( \theta \right) \) is used to compare a prediction \( P\left( {y^{\left( b \right)} = k|x^{\left( b \right)} ;\theta } \right) \) with a ground-truth class label \( k \). The focal loss function is selected for reacting to the imbalance of training samples of each class. Focal loss is an effective development from cross-entropy loss. The focal loss adds the weight of each class to reduce the weight of easy-classified samples and major samples and raise the weight of hard-classified samples and minor samples. Easy-classified samples have larger \( P\left( {y^{\left( b \right)} = k|x^{\left( b \right)} ;\theta } \right) \), and hard-classified samples have smaller \( P\left( {y^{\left( b \right)} = k|x^{\left( b \right)} ;\theta } \right) \). The weight is \( (1 - \alpha^{\left( b \right)} )\left( {1 - P\left( {y^{\left( b \right)} = k|x^{\left( b \right)} ;\theta } \right)} \right)^{\gamma } \). The \( 1 - \alpha^{\left( b \right)} \) is a coefficient for adjusting the weight of imbalance samples. For more samples of class \( b \), \( 1 - \alpha^{\left( b \right)} \) is smaller and the \( \alpha^{\left( b \right)} \) is larger. In our study, designed \( \alpha^{\left( b \right)} \) is equal to the ratio between the number of samples of class \( b \) and the sum of all samples. In our study, \( \alpha^{\left( 0 \right)} = 0.56, \alpha^{\left( 1 \right)} = 0.38, \alpha^{\left( 2 \right)} = 0.06 \). \( \gamma \) is used for adjusting the weight of easy-classified and hard-classified samples. With the increase in \( \gamma \), the weight of easy-classified samples is smaller than the weight of hard-classified samples. The formula of focal loss is computed as follows (Chen et al. 2016; Lin et al. 2017):

$$ {\mathcal{L}}\left( \theta \right) = - \frac{1}{B}\left( {1 - \alpha^{\left( b \right)} } \right)\left( {1 - P\left( {y^{\left( b \right)} = k|x^{\left( b \right)} ;\theta } \right)} \right)^{\gamma } \left[ {\mathop \sum \limits_{b = 1}^{B} \mathop \sum \limits_{k = 1}^{K} 1\left\{ {y^{\left( b \right)} = k} \right\}\log P\left( {y^{\left( b \right)} = k|x^{\left( b \right)} ;\theta } \right)} \right] . $$
(5)

Optimal Selection of the 3D CNN Hyper-Parameters

The initial parameters of the 3D CNN model have an important influence on information extraction and classification results. Therefore, the optimal selection of CNN hyper-parameters is a key step for generating better training curacy of the CNN model. In this study, the selection of hyper-parameters comes from the training and validation samples. The five hyper-parameters, including the size of input image \( m \), convolution kernel size \( n \), pooling dimension \( p \), the number of feature maps \( t \), and the number of CNN layers \( L \), are evaluated to assess their influences on classification accuracy. The optimal hyper-parameters are further used to construct the optimal 3D CNN model.

First, given the input image layers \( m \times m \times c \times d \) (\( c = 10 \)), when \( m = 25 \), \( L = 1 \), and \( t = 50 \), we examine the accuracies of the 3D CNNs using different convolutional kernel sizes \( n = \left[ {2,4,6,8,10,12,14,16,18,20} \right] \) and pooling dimensions \( p = \left[ {2,4,6,8,10,12} \right] \). By way of epoch, the optimal combination of \( n \) and \( p \) parameters is identified. Second, based on the optimal parameters \( n \) and \( p \) and a predetermined parameter \( L = 1 \), we further examine the accuracies using different sizes of input images \( m = \left[ {5,15,25,35,45} \right] \) and different numbers of output features \( t = \left[ {50,100,150,200} \right] \). The optimal parameters \( m \) and \( t \) are thus iteratively determined. Finally, based on the optimal combination of \( n \) and \( p \), \( m \) and \( t \), we further evaluate the influence of the different sizes of input images \( m = \left[ {3,5,7, \ldots ,47,49,51} \right] \) and the numbers of CNN layers \( L = \left[ {1,2,3} \right] \) on classification accuracy. The optimal parameters \( m \) and \( L \) are iteratively determined.

Experiments and Discussion

Datasets

The experimental site of this study is located in Pingdingshan City, Henan Province, China. A WV-2 scene was acquired in August 2014, and the airborne LiDAR data were taken in August 2013. The 11-bit WV-2 image consists of one panchromatic (PAN) band with 0.5 m pixel size and eight multi-spectral (MS) bands (Costal Blue, Blue, Yellow, Red, Red Edge, NIR1, NIR2) across spectral regions from 400 to 1040 nm with 2 m spatial resolution. The WV-2 MS bands were pan-sharpened with the PAN band to reach the 0.5 m pixel size using the haze- and ratio-based (HR) fusion scheme. The airborne LiDAR datasets were provided in ASCII format, including the \( X \), \( Y \), and \( Z \) coordinates of first return points and their intensities. The average point cloud densities were 23.52 points m−2. The DSM and the DEM were generated from LiDAR point cloud using ENVI software (version 5.1). Then, the nDSM image with 0.5 m spatial resolution was produced by subtracting the DEM values from DSM. Finally, the nDSM together with eight WV-2-fused multi-spectral bands and one panchromatic WV-2 band is inputted to 3D CNN model.

We set three land classes in the whole study area, including impervious surfaces, vegetation, and bare soil, based on V–I–S model. For 3D CNN classification, selected manually and randomly training, validation, and test sample pixels are listed in Table 1. In our study, the number of samples is equal to that of other papers about the deep learning methods for remote sensing image classification (Yue and Ma 2016; Gao et al. 2018), such as about 2000–7000 training samples of other studies. The training samples are used for the training of 3D CNN model, the validation samples are used for adjusting hyper-parameters, and the test samples are used for final accuracy evaluation.

Table 1 Training, validation, and test sample pixels of each class in our study area

Optimal 3D CNN Hyper-Parameters

To identify the optimal 3D CNN parameters for model development, their performances were evaluated by using randomly selected training and validation samples (Fig. 2).

Fig. 2
figure 2

Influence of the different 3D CNN model parameters on pre-training CNN model accuracies. a Influence of the parameters n and p; b performance of the parameters m and t; c performance of the parameters m and L. (The unit of image input size is pixel)

Figure 2a shows the influence of the hyper-parameters \( n \) and \( p \) on the CNN model accuracy in the case of \( m = 25 \), \( L = 1 \), and \( t = 50 \). It can be observed that: (1) with the increase in parameter \( n \), the CNN model errors (the ratio of wrong classification to total classification at the 300th epoch in the test samples) generally decrease and then increase; (2) when \( n = \left[ {6, 8, 10} \right] \), relatively lower errors can be acquired with a different parameter \( p \); (3) when \( n < p \), the model errors are high; when \( n \in \left[ {2p, 3p} \right] \), the errors become the lowest. Therefore, the optimal combination of \( n \) and \( p \) is \( n \in \left[ {2p, 3p} \right] \).

Figure 2b examines the performance of the hyper-parameters \( m \) and \( t \) in the case of \( L = 1 \). When \( m = 5 \), the model errors are the highest; when \( m = \left[ {25,35,45} \right] \), the errors are less than 10%. In addition, with the same parameter \( m \), the number of the output feature \( t \) does not affect the CNN model errors.

Figure 2c examines the hyper-parameters \( m \) and \( L \). When \( m \le 15 \), the model errors are high. However, a larger m requires lower computation time. From Fig. 2b, c, the parameter m performs better in the range of 20–40. When \( L = 1 \), a larger or smaller \( m \) value results in higher model error. Generally speaking, when \( L = 2 \) and \( L = 3 \), the pre-training CNN model errors are fairly stable. The model accuracies with \( L = 2 \) are superior to those with \( L = 1 \) and \( L = 3 \).

In short, for the 3D CNN model in this study, the optimal size of input image is set to \( m = 37,\,d = 3,\, c = 10 \) and the number of layer \( L = 2 \). For the first layer, the size of the 3D convolutional kernel is \( 8 \times 8 \times 2 \), and the size of sub-sample kernel is set to \( 4 \times 4 \). For the second layer, the sizes of the 3D convolutional kernel and sub-sample kernel are \( 3 \times 3 \times 2 \) and \( 2 \times 2 \), respectively. Additionally, the numbers of input features are set to 50 and 100, respectively. Similarly, to compare results of 2D CNNs and 3D CNNs, the hyper-parameters of 2D CNNs are the equal to those of 3D CNNs, except the convolution kernels have two dimensions. That is, for the first layer, the size of the 2D convolutional kernel is \( 8 \times 8 \), and the size of sub-sample kernel is set to \( 4 \times 4 \). For the second layer, the sizes of the 2D convolutional kernel and sub-sample kernel are \( 3 \times 3 \) and \( 2 \times 2 \), respectively. Our 3D CNN and 2D CNN implementation is based on the Pytorch 0.4 platform and NVIDIA GTX 1080Ti GPU. The SVM method is used for the comparison, whose hyper-parameters are consistent with the paper (Guo et al. 2014). The penalty coefficient C is 100, and the gamma is 0.1 for WorldView-2 + LiDAR image (10 bands) and 0.11 for WorldView-2 image (9 bands).

Urban Impervious Surface Extraction

Figure 3 visually compares the extracted impervious surface area (ISA) with the 3D CNNs, 2D CNN and SVM methods from the WV-2 only and WV-2 + airborne LiDAR, respectively. The error matrix of Accuracy assessment is conducted by using the error matrix approach. The accuracy measures, i.e. producer’s accuracy (PA), user’s accuracy (UA), overall accuracy (OA), and overall kappa coefficient (OK), are calculated (Table 2).

Fig. 3
figure 3

WV-2 image (a) and airborne LiDAR nDSM (b); extracted ISA distributions from WV-2 only using SVM (c), 2D CNNs (e) and 3D CNNs (g); and from WV-2 + LiDAR using SVM (d), 2D CNNs (f) and 3D CNNs (h)

Table 2 Error matrix and accuracy comparison between the SVM, 2D CNN and 3D CNN classifications with WV-2 only and WV-2 + LiDAR

As shown in Table 2, the 3D CNNs provide significantly higher accuracy for impervious surface extraction than SVM and 2D CNNs. With the WV-2 image only, the OA and OK in 3D CNNs are 93.08% and 0.89 compared with 89.08% and 0.83 in SVM, and with 85.12% and 0.77 in 2D CNNs, respectively. With the WV-2 + LiDAR, 3D CNNs reach an OA of 93.20% and OK of 0.89, while SVM has an OA of 91.44% and OK of 0.87, and 2D CNNs have an OA of 91.32% and OK of 0.86. Even for the WV-2 image only, the 3D CNNs have higher accuracies than 2D CNNs and SVM using WV-2 + LiDAR. When both datasets are considered, the improvements in 2D CNNs and SVM performance are higher than that of 3D CNNs.

The 3D CNN method can automatically extract multi-scale spectral, spatial, texture, and elevation features by a series of convolution steps, which results in higher in-class similarity and higher divisibility among different classes. Therefore, it is superior to the pixel-based SVM classifier. Furthermore, the correlation among bands is better considered in 3D CNNs than in 2D CNN by using three-dimensional convolution to extract the features of three bands.

For the producer’s accuracy and user’s accuracy of each class, the producer’s accuracy of the impervious surfaces and the user’s accuracy of bare soil for SVM and 2D CNN method are lower than 3D CNN method, especially WorldView-2 only classification. The main reason is that many impervious surfaces are misclassified as bare soil due to lacking of height information. When adding the LiDAR height features, 3D CNNs have a better representation capacity to features than SVM and 2D CNNs.

To better demonstrate the effectiveness of our proposed 3D CNNs in impervious surface mapping, the detailed results in various local areas (Fig. 4) are further compared. For high-resolution optical images, because of the complex landscape types in urban areas, the same objects with different material (e.g. the rooftops) have different spectral signatures. In Fig. 4I, II, some building roofs with different materials are misclassified as bare soil from the WV-2 image only using the SVM method (Fig. 4b). In Fig. 4c, height information extracted from LiDAR data resolves this problem. In Fig. 4d, 2D CNNs improve the extraction of buildings through training the spectral and spatial features from WorldView-2 image. However, some roads are misclassified as bare soil. After adding the height information, the mixture gets the improvement (Fig. 4e). For the 3D CNNs, with both WV-2 image only and WV-2 + LiDAR, rooftops are completely identified (Fig. 4f, g), because 3D CNNs not only extract spectral, spatial, and height features, but also consider the correlation of neighbourhood spectrum. Moreover, in Fig. 4III, for the narrow roads in the park, SVM and 2D CNN method cannot extract completely. The best performance is 3D CNNs for WorldView-2 images. The 3D CNNs for WV-2 + LiDAR have some overestimation for the impervious surfaces in tree crown areas.

Fig. 4
figure 4

Impervious surface (IS) distribution of three subsets in the study area. a The true colour WV-2 image is displayed; b IS results derived from WV-2 only using SVM; c IS results derived from WV-2 only using 2D CNNs; d IS results derived from WV-2 only using 3D CNNs; e IS results derived from WV-2 + LiDAR using SVM; f IS results derived from WV-2 + LiDAR using 2D CNNs; and g IS results derived from WV-2 + LiDAR using 3D CNNs

Conclusions

In this study, a 3D CNN approach is proposed and employed for extracting urban impervious surfaces from HR WV-2 and airborne LiDAR datasets. We further evaluate the influences of different 3D CNN parameters on impervious surface extraction. Several findings are summarized.

Via deep learning, our proposed 3D CNN method can automatically extract spectral, spatial, textural, and elevation features via multi-step convolutional, ReLU, and pooling operators, which result in better extraction performance of impervious surfaces (especially for building roofs and roads).

Our results show that different 3D CNN parameters have significant effects on impervious surface extraction. The optimal combination of convolutional kernel size \( n \) and pooling dimension \( p \) is \( n \in \left[ {2p, 3p} \right] \). For input image size m, a smaller m results in lower accuracy, while a larger \( m \) requires more computation time. The range from 20 to 40 is an optimal choice for parameter \( m \). Additionally, the performance of impervious surface extraction is the most stable when the number of CNN layers \( L \) is set to 2.

We often rely on multi-source datasets such as WV-2 and LiDAR to improve the performance of pixel-based classifiers. The improvement by using multi-source data is less dramatic for the 3D CNN method. Even with only a single-source HR image, the 3D CNN model is able to extract impervious surfaces with high accuracy.