Introduction

Land-use and land-cover change (LUCC), which is closely related to global climate change and changes in ecosystems and biodiversity, reflects the effects of human activities and climate change on the ecological environment of the Earth’s surface (Blasi et al. 2008; Yang et al. 2014). Since the 1990s, the Food and Agriculture Organization, International Geosphere-Biosphere Project, International Institute for Applied Systems Analysis, and other research institutions have launched a series of LUCC-related projects (Sands and Leimbach 2003). The international community attaches importance to placing LUCC as the core content of global environmental change research. Remote sensing is an effective tool for monitoring the Earth’s surface and a basic element of applications that use classification and recognition technologies to investigate land-use status (Song et al. 2012).

Numerous types of land-use classification standards exist. These systems include several classes and account for the complex features of land-use and land-cover types, and these characteristics pose difficulties for accurate classification. In the classification of remote sensing images, determining the classification strategy first is necessary, followed by selecting the appropriate classifier. Classification strategies include supervised or unsupervised classification, the direct use of original spectral information or extraction of other features from spectral information, and hard or soft classification. In particular, classification strategies can be divided into pixel-based and object-oriented classifications due to differences in the basic unit of classification (Zheng et al. 2010). In terms of classifier selection, the traditional approach used is the statistical method for low-level feature extraction, including distance (Tzeng 2006), K-nearest neighbor (Meng et al. 2007), maximum likelihood (Bruzzone and Prieto 2001), and logistic regression (Lee 2005) classifiers. With the rapid development of aerospace, sensor, and computer technologies over the past decade, high-resolution remote sensing (HRRS) images have been increasingly applied in land-use classification (Hu et al. 2015). The diversity of objects within a given class increases as does the similarity of objects in different classes due to spectral confusion in HRRS images. These properties reduce the effectiveness of traditional classification methods based on low-level features (Paisitkriangkrai et al. 2016). Therefore, the method based on mid-level feature modeling has been developed on the basis of the low-level feature method (Bosch et al. 2007). Three types of mid-level feature extraction methods describe image semantics, namely the bag-of-visual-words (BoVW), latent Dirichlet allocation (LDA), and machine learning models. However, in practical applications, the performance of BoVW-based methods relies on the extraction of handcrafted local features (Alkhawlani et al. 2015). LDA modeling methods rely on K-means clustering to produce a visual dictionary. Thus, the expression of mid-level semantic features in an image is limited. The machine learning models independently perform data expression and feature extraction (Campsvalls 2008) and discard the pattern of the extracted features in accordance with predetermined rules (Tuia et al. 2013; Lin et al. 2017); thus, they obtain improved classification results when applied to complex images. The commonly used machine learning methods include sparse coding (Jiang et al. 2014), neural networks (Yuan et al. 2009), support vector machine (Blanzieri and Melgani 2006; Dai et al. 2007), and deep learning (Zhang et al. 2016). The deep learning networks are composed of multiple nonlinear mapping layers, which represent a new method of intelligent pattern recognition and are an important new direction in the field of remote sensing image processing (Zhao et al. 2015).

Convolutional neural networks (CNNs) are a basic deep learning model representing biologically inspired multistage architectures composed of convolutional–pooling–fully connected layers (Längkvist et al. 2016). A CNN uses the low-level features contained in an image to form a high-level feature through the multilayer abstraction mechanism (Zhao et al. 2016), which effectively reduces the gap between low-level image and high-level semantic features. Research applying CNN to remote sensing images has emerged in recent years. The Hinton team won an overwhelming victory in the ImageNet image classification competition and reduced the top-5 classification error rate of 1000 images from 26.2 to 15.3% (Krizhevsky et al. 2017). Hu et al. used a CNN model to classify HRRS images for the first time. Chen et al. adopted a CNN classification method that incorporates pixel spectral information and spatial information and studied the importance of spatial information in classifying HRRS images (Chen et al. 2016). Qi et al. (2017) presented a Multiscale Deeply Described Correlation-based algorithm that jointly incorporates appearance and spatial information at multiple scales to perform land-use-type classification. Therefore, CNN has surpassed traditional pattern recognition and machine learning algorithms and has achieved superior performance and accuracy.

In general, the classification method based on CNN is executed by pixel. The classification results can easily be confused for the transitional zones between land types because land-use types are numerous and their spatial distribution is mixed with each other. This approach is not conducive to the type identification of small land blocks. To overcome these difficulties, traditional methods increase the training set size of deep learning or increase the model depth and the number of nodes, thereby causing tremendous pressure on manual labeling (Lin et al. 2016). Object-oriented classification strategies classify objects on the basis of homogeneous multi-pixels and use spectral, spatial, shape, and other features of images to perform type judgments together, thereby breaking through the limitations of pixel-based classification. In addition, the improvement of training sample set construction and deep learning method can reduce the dependence of deep learning model on training sample size. Therefore, this study improves from two aspects: classification strategy and deep learning model. The major contributions of this research are as follows:

  1. 1.

    The object-oriented method is combined with the deep learning method. On the one hand, the object-oriented method is used to construct a multi-scale sample set to provide high-precision training data for deep learning model training. On the other hand, on the basis of the object-oriented concept, the method avoids the processing of mixed pixels in the classification process and enhances the typicality of classification objects in deep learning.

  2. 2.

    The CNN model structure is modified to improve classification performance, and the training algorithm is optimized to avoid the overfitting phenomenon that occurs during training using small datasets.

The remainder of this paper is organized as follows: The “Methodology” section introduces the proposed framework that combines object-oriented approach with deep convolutional neural network (COCNN) for use in land-use-type classification. The “Experiment and Results” section presents the experimental results and analysis. The “Conclusions” section offers concluding remarks and perspectives on future work.

Methodology

The general process of remote sensing image classification mainly consists of feature extraction and classification based on image features. The traditional object-oriented method establishes fuzzy rules in accordance with the feature differences of various class objects, focusing on the improvement of feature extraction. The object features include color, spectral characteristics [e.g., luminance value, normalized difference water index (NDWI), and normalized difference vegetation index (NDVI)], and shape–texture (e.g., boundary index, compactness, and aspect ratio) (Chen et al. 2006; Su et al. 2007; Robertson and King 2011). However, the feature extraction of the object-oriented method cannot cover all feature types. Therefore, supporting the classification and recognition of class objects that only rely on the extracted feature information is insufficient when the performance of the classifier is not improved. Deep learning combines low-level features to form a further abstract high-level representation, which has strong expressive capability and outstanding classification performance. The characteristics of a multi-band of remote sensing images are not fully considered because deep learning is often performed by RGB images. In addition, deep learning requires a large number of labels; thus, the manual identification workload is large. Table 1 compares and analyzes the advantages and disadvantages of the two methods.

Table 1 Advantages and disadvantages of object-oriented and deep learning methods

The advantage of combining object-oriented approach with deep learning method includes two aspects. On the one hand, through the object-oriented method for constructing the feature rule set, the land-use object can be initially extracted, and the training sample sets required for deep learning can be further constructed by the object. On the other hand, the performance of deep learning is affected by the number of features in practical application, especially when the size of the sample set is relatively small (Mares et al. 2016). After the combination of object-oriented method, large-scale spatial context information can be considered by extracting object units, and additional feature rules and prior knowledge can be integrated in the deep learning process (Zeiler and Fergus 2014). In addition, the classification result can be corrected in accordance with the feature rule set of the object-oriented method. The optimization of feature extraction strategy is conducive to the further improvement of the classification effect. A land-use-type classification method (COCNN) based on the technical characteristics of object-oriented and deep learning approaches is proposed on the basis of the analysis of the advantages of the two methods. This method is explained in detail in the following section.

COCNN Land-Use-Type Classification Framework

The general flowchart of the COCNN framework (Fig. 1) illustrates three features. First, after the preprocessing of remote sensing images, such as image fusion, the multi-scale segmentation algorithm is used to segment the image. Second, on the basis of the object-oriented segmentation results, the typical rule set of construction land, roads, water bodies, vegetation, and other land-use types is constructed, and the segmentation objects are classified and extracted to obtain training samples to form a typical object sample set. Finally, the CNN model training is performed in accordance with the sample set, and the multi-scale segmentation results are further classified on the basis of the training model.

Fig. 1
figure 1

Flowchart of the COCNN method

Multi-scale Image Segmentation Based on Object-Oriented Method

Image Preprocessing

Preprocessing of remote sensing images includes radiometric calibration, geometric correction, and image fusion. The main purpose of preprocessing is to express the information contained in the process of imaging synthesis further to render it closest to the actual image state. Radiation correction is used to eliminate image distortion caused by radiation errors. Geometric correction requires that the absolute error of the corrected position is less than one pixel. The image fusion algorithm selects NNDiffuse Pan Sharpening, which can effectively preserve the color, texture, and spectral information of the image.

Multi-scale Image Segmentation

The multi-scale object-oriented segmentation algorithm considers an image to be a region adjacency graph consisting of topological relationships between regions (Wang and He 2011). The algorithm can segment the image in accordance with the specified scale for ensuring that the image segmentation region (image object) with high homogeneity (or minimal heterogeneity) is generated, which is suitable for the optimal separation and representation of the object (Woodcock and Strahler 1987).

The algorithm is roughly divided into two steps during execution: (1) initial segmentation and (2) object merging. In the initial segmentation step, starting from a single pixel, the difference measure is calculated with the neighboring cells, and the image segmentation is conducted in accordance with the heterogeneity (Jin et al. 2018). This heterogeneity is determined by the difference in spectrum and geometry between objects, and the calculation for heterogeneity follows formula (1).

$$ f = w_{1} x + \left( {1 - w_{1} } \right)y. $$
(1)

In the formula, \( f \) is the heterogeneity value; w1 represents the weight, 0 ≤ w1 ≤ 1; x denotes the spectral heterogeneity; and y refers to the shape heterogeneity. The calculation of x and y follows formulas (2) and (3).

$$ x = \mathop \sum \limits_{i = 1}^{n} p_{i} \sigma_{i} , $$
(2)
$$ y = w_{2} u + \left( {1 - w_{2} } \right)v. $$
(3)

In the formula, pi is the weight of the ith image layer; σi indicates the standard deviation of the ith image layer spectral value; u represents the overall tightness of the image region; v denotes the image region boundary smoothness; and w2 stands for the weight, 0 ≤ w2 ≤ 1. The calculation of u and v follows formulas (4) and (5).

$$ x = \frac{E}{\sqrt N }, $$
(4)
$$ y = \frac{E}{L}. $$
(5)

In the formula, E is the actual boundary length of the image region; N denotes the total number of pixels of the image region; and L represents the total length of the rectangular boundary, including the range of the image region.

The object merging step starts with each region in the region adjacency graph. The region pairs that satisfy the local optimal merge condition are determined, the two regions are merged, and the feature values of all regions connected to the original two regions are updated. When the adjacent two regions are merged, the heterogeneity of the newly generated large image region object is calculated using formula (6).

$$ f^{{\prime }} = w_{1} x^{{\prime }} + \left( {1 - w_{1} } \right)y^{{\prime }} . $$
(6)

In the formula, f′ is the heterogeneity value of the newly merged large image region object; x′ and y′ represent the spectral and shape heterogeneities of the newly merged large image region, respectively. The calculation of x and y follows formulas (7) and (8).

$$ x^{{\prime }} = \mathop \sum \limits_{i = 1}^{n} p_{i} \left[ {N^{{\prime }} \sigma_{i}^{{\prime }} - \left( {N_{1} \sigma_{i1} + N_{2} \sigma_{i2} } \right)} \right], $$
(7)
$$ y^{{\prime }} = w_{2} u^{{\prime }} + \left( {1 - w_{2} } \right)v^{{\prime }} . $$
(8)

In the formula, N′ denotes the total number of pixels in the merged image region; σi′ refers to the standard deviation of the ith layer spectral value of the merged image; N1 and N2 are the total numbers of image pixels in adjacent regions 1 and 2 before the merge, respectively; σi1 and σi2 refer to the standard deviations of the spectral values of the ith layer of adjacent regions 1 and 2 before the merge, respectively. The calculation of u′ and v′ follows formulas (4) and (5).

$$ u^{{\prime }} = N^{{\prime }} \frac{{E^{{\prime }} }}{{\sqrt {N^{{\prime }} } }} - \left( {N_{1} \frac{{E_{1} }}{{\sqrt {N_{1} } }} + N_{2} \frac{{E_{2} }}{{\sqrt {N_{2} } }}} \right), $$
(9)
$$ v^{{\prime }} = N^{{\prime }} \frac{{N^{{\prime }} }}{{L^{{\prime }} }} - \left( {N_{1} \frac{{E_{1} }}{{L_{1} }} + N_{2} \frac{{E_{2} }}{{L_{2} }}} \right). $$
(10)

In the formula, E′ and L′ are the actual boundary length of the merged image region and the total length of the circumscribed rectangle boundary of the region range, respectively; E1 and L1 denote the actual boundary length of adjacent region 1 before the merge and the total boundary length of region 1s circumscribed rectangle, respectively; E2 and L2 indicate the actual boundary length of adjacent area 2 before the merge and the total boundary length of region 2s circumscribed rectangle, respectively. Figure 2 shows the results of image segmentation at different scales.

Fig. 2
figure 2

Multi-scale segmentation result diagram. a Original image, b the segmentation scale is 75, the shape weight is 0.3, and the compactness weight is 0.5, c the segmentation scale is 120, the shape weight is 0.3, and the compactness weight is 0.5, d the segmentation scale is 300, the shape weight is 0.3, and the compactness weight is 0.5

Sample Set Construction Based on Multi-scale Rules

The hierarchical structure of the sample classification is established through the correspondence between the feature information of the object and the land use. Multi-scale hierarchical segmentation is used, and different land uses are segmented by different scales. Then, classification rules are set in accordance with the spectral, geometric texture, and topological features of the land-use object.

In the large-scale segmentation layer, the index of brightness, NDWI, and NDVI are used as the basis for the assessment (Zhu et al. 2017), and the first classes, such as construction land, road, water body, and vegetation, are initially extracted. On the image objects of the first classes, the appropriate segmentation scale is selected, and the subclasses in the first classes are segmented by considering the shape index of the objects, such as the boundary index, compactness, and aspect ratio. Table 2 shows the multi-scale object rule set.

Table 2 Multi-scale object rule set

A set of typical remote sensing image features, including cultivated land, woodland, water, roads, and buildings, is established on the basis of the object judgment rules, by tracking the sample boundaries under each category, and the training sample set is obtained. Table 3 shows the training sample set example.

Table 3 Training sample set example

Construction of Deep CNN Model

Modeling Method

The deep CNN model is selected for deep learning by using the sample images in the sample set as the training data. The characteristics of the samples are automatically obtained through deep learning, and the object-oriented segmentation results are used to realize the automatic classification of typical land-use types. Therefore, the structural design of the CNN is the key issue.

A deep CNN is formed by stacking multiple basic network structures. To obtain further accurate classification results, adding additional nodes to the model is necessary. However, the model complexity requires additional training samples for support, but the training samples that can be used in practical applications are limited. The comparison of the traditional methods in CNN training shows that improving the learning algorithm in training is further effective.

The important structural parameters and training strategies are optimized to improve the classification effect of deep CNN:

  1. 1.

    Rectified linear unit (ReLU) activation function accelerates model convergence.

The ReLU function is one of the most popular neuronal activation functions in the deep learning field (Shang et al. 2016). In comparison with other activation functions, the commonly used sigmoid function is a nonlinear activation function that displays a saturation effect, thereby causing a loss of gradient information for large and small input data values (Chen et al. 2013). The output gradient of the sigmoid function is not centered on zero, resulting in convergence fluctuations during the gradient descent phase. When the number of layers is relatively large, the gradient to the front layer becomes small, and the network weight is ineffectively updated. The tanh activation function also has a small gradient value at saturation, leading to inefficient training (Nambiar et al. 2014; Gulcehre et al. 2016).

The gradient constant of the ReLU function is equal to 1 when x > 0. Thus, the problem of gradient disappearance is alleviated during backpropagation (Zhang et al. 2017). Moreover, the ReLU function is sparsely activated by simple thresholding activation. In comparison with other activation functions (e.g., sigmoid and tanh functions), the ReLU function increases the convergence speed of CNN.

  1. 2.

    Use of regularization to prevent overfitting.

Regularization reduces the model complexity by restricting the parameter’s ranges to reduce the disturbances caused by noisy inputs. This procedure reduces overfitting to a certain extent (Fanany 2017). The L2 regularization is realized by modifying the cost function, whereas the dropout technique is realized by modifying the neural network. The key concept of the dropout technique involves randomly suppressing neurons in the target layer with a certain probability during every iteration of model training (Zheng et al. 2017). This process considerably reduces the complex mutual adaptation among neurons and achieves the suppression of overfitting.

  1. 3.

    Local response normalization (LRN) layer enhancement generalization.

The LRN layer mimics the side inhibition mechanism of biological nervous systems and creates a competitive environment for the activity of local neurons (Li et al. 2015). This behavior enhances the relatively large response values and suppresses other neurons with small feedback, thereby elevating the model’s generalization capability. Furthermore, as LRN selects large feedback from the responses of multiple convolution kernels of the nearest neighbor, it applies to the ReLU activation function without an upper bound.

CNN Model Structure

Figure 3 presents the deep CNN framework constructed in this paper. The initial weights of the network are extracted from the Gaussian distribution with a standard deviation of 0.01 and a mean value of 0. At the training stage, the sliding step length is 1, and a gradient descent is performed with a constant learning rate of 0.0005. The core component of the deep CNN is composed of 7 convolutional layers (Conv1–Conv7), 1 pool layer (pooling1), and 7 LRN layers (norm1–norm7).

Fig. 3
figure 3

Structure of the deep CNN

Four of the important parameters and functions in the deep CNN-based model are described as follows: (1) size and number of the local receptive fields and activation functions. The convolution kernels are 3 × 3, 5 × 5, or 7 × 7 pixel blocks. Convolutional layers 1–7 have 8, 16, 32, 64, 128, 256, and 256 kernels. Different sizes and numbers of convolution kernels are used to investigate the effects of the characteristic sampling density on the model performance. After the convolution operation, the ReLU activation function is used. (2) Initial weight regularization. The L2 regularization is added to the initialization parameters of each layer in the network. (3) Fully connected layer with dropout. The model consists of two fully connected layers, each of which has 1024 outputs. The dropout technique is applied to the FC1–2 layer to control overfitting given that the fully connected layer FC1–2 is densely connected. The optimal dropout probability is optimized within the range of 0.5–0.9. (4) Loss function of the classification layer. The softmax loss function constructs the corresponding classifier in the classification layer. Each node in the output of the CNN represents the probability that the input information belongs to a certain class I as follows:

$$ P\left( {Y = i |x,w,b} \right) = {\mathop{\text{softmax}}\limits_{i}} \left( {wx + b} \right) = \frac{{e^{{w_{i} x + b_{i} }} }}{{\mathop \sum \nolimits_{j} e^{{w_{j} x + b_{i} }} }}, $$
(11)

where w is the weight parameter in the last layer and b denotes the corresponding bias parameter.

Experiment and Results

Experimental Data and Environment

Experimental data containing the optical remote sensing image, a high-quality land-use classification vector layer, and the classified field information are derived from the land-cover classification results of the National Geoinformation Survey. In particular, the remote sensing image has a scale of 1:10,000, measures 8386 × 5772 pixels, and has a pixel resolution of 1 m. The image (Fig. 4) shows the area surrounding Fuxian Lake in Yunnan Province, China. On the basis of the classification system of the National Geoinformation Survey, the land-use types shown in the image are divided into ten classes: residential building, industrial land, other construction lands, cultivated land, garden plots, grassland, forest land, bare land, roads, and water bodies. The sample set is constructed on the basis of multi-scale rules, and a part of it is selected as test data. Table 4 shows the data volume of the sample and test sets for different land-use types.

Fig. 4
figure 4

Remote sensing image of the study area

Table 4 Data volume of the datasets

The indexes for evaluating the experimental results include precision (P) and kappa index (K). P and K are calculated using formulas (11) and (12) (Wang et al. 2012), in which nst refers to the same quantity between annotation result s and classification result t, nt denotes the quantity of results of the classification result t, r represents the number of rows in the table, xii is the quantity of type combinations on the diagonal part of the table, xi+ indicates the number of observations in line I, x+i refers to the number of observations in column I, and N stands for the number of cells in the table.

$$ P\left( {s,t} \right) = \frac{{n_{st} }}{{n_{t} }}, $$
(12)
$$ K = \frac{{N\mathop \sum \nolimits_{i = 1}^{r} x_{ii} - \mathop \sum \nolimits_{i = 1}^{r} \left( {x_{i + } \times x_{ + i} } \right)}}{{N^{2} - \mathop \sum \nolimits_{i = 1}^{r} \left( {x_{i + } \times x_{ + i} } \right)}}. $$
(13)

The experiment uses Windows Server 2012 R2 operating system, with a NVIDIA Tesla K80 for GPU acceleration. Other major hardware elements include an Intel (R) Xeon (R) CPU E5-2630 processor and 128 GB of memory. The deep CNN model is developed on the basis of the TensorFlow open-source framework. The main versions of the software are CUDA 8.0, cuDNN 6.0, and tensorflow_gpu_1.2.0.

Setting of the Experiment

In this experiment, the COCNN method is compared with the method based solely on CNN. For the CNN method, the land-use classification of images is based solely on the deep CNN model. The window size of 30 × 30 is selected to extract spatial information in the images for land-use-type classification. Then, the entire image is scanned by moving the window.

The experimental results are compared from two perspectives. On the one hand, under the condition that the structure and parameters of the deep CNN model remain unchanged, the difference of classification accuracy between the COCNN method and the method based solely on CNN is compared. On the other hand, on the basis of the joint object-oriented method, the structure and parameters of the deep CNN model are adjusted, and the classification effects under different structural and parameter conditions are compared.

The baseline parameter configuration for COCNN (Table 5) assumes different frame selections and parameter settings that affect the classification accuracy. Comparative experiments are conducted to change some parameters while keeping the remaining settings unchanged. Model training uses 100 elements of the training data for each iteration, and 1500 training iterations are performed. The network state is tested 100 times per iteration.

Table 5 Basic parameter settings for COCNN

Results and Analysis

Influence of Classification Strategies on Classification Results

The classification results of the images are obtained, and the accuracy is evaluated on the basis of the COCNN method. The land-use map of the area surrounding Fuxian Lake (Fig. 5) contains ten land-use classes with class-specific confusion matrixes (Table 6). The P and K coefficients of the land-use classes are 96.2% and 0.96, respectively. The water body type has the highest producer’s accuracy (99.5%), whereas the industrial land type has a relatively low value of 91.2%.

Fig. 5
figure 5

Detailed land-use map of the area surrounding Fuxian Lake. a-1, b-1, c-1, and d-1 Classification of regions A, B, C, and D based on COCNN, respectively. a-2, b-2, c-2, and d-2 Classification of regions A, B, C, and D based solely on CNN, respectively

Table 6 Confusion matrix of the land-use-type classification based on COCNN

When based solely on the CNN method, the P and K of the classification results are 87.22% and 0.86, respectively. Thus, the classification accuracy (Table 7) of the CNN method is lower than that of the COCNN method. In addition to the slight decrease in the accuracy of water bodies, the classification accuracy of other land-use types has been remarkably reduced. Combined with the observation of land-use map, the classification results based on COCNN classification method are relatively complete, and few faults are found in large-scale plaques, such as vegetation and construction land. In addition, COCNN is useful for solving the problems of the incomplete extraction of linear features and the small plots of crops. The accuracy of land-use information extracted based on CNN methods is slightly insufficient, and several plaques with inaccurate classification types are found. Moreover, the classification results are relatively fragmented; thus, the classified plots have evident spatial heterogeneity. The errors are a consequence of errors among land-use types with the same natural attributes. For example, when the residential building type displays a classification error, the inaccurately selected classification type is often another construction land. Research results show that the COCNN method not only fully uses the spectral information of remote sensing images but also considers the spatial distribution characteristics and correlation of geographic objects. On the one hand, noise generated due to heterogeneity and spectral differences in pixel classification is effectively avoided. On the other hand, a multi-feature sample set is constructed by correlation rules, which can assist the deep learning effect of CNN.

Table 7 Confusion matrix of the land-use-type classification based solely on CNN

Influence of Deep CNN Structure on Classification Results

  1. 1.

    Influence of network parameters on different convolution kernel parameters.

The convolution kernel is regarded as the most sensitive element of the CNN and is responsible for directly extracting the lowest-level features from the original input image. The effects of the size and number of convolution kernels on the recognition accuracy of COCNN (Fig. 6) show that the model performance increases as the convolution kernel size decreases. When the convolution kernel size is 3 × 3, the verification accuracy reaches its highest value. When the convolution kernel size is large, the mixing of information from coarse-grained features (e.g., edge features) occurs, and excessive detail is lost from the information that passed to the convolution kernel of the high layers because the distinction between similar land-use types often depends on the description of local textures.

Fig. 6
figure 6

Effects of different sizes (a) and numbers (b) of convolution kernels on model performance

When the fixed convolution kernel size is 3 × 3, the experiment verifies that models with smaller numbers of convolution kernels and greater numbers of layers display higher classification effectiveness than models with greater numbers of convolution kernels and smaller numbers of layers. The seven-layer model with 8, 16, 32, 64, 128, 256, and 256 convolution kernels is more accurate than the four-layer model with 64, 128, 256, and 512 convolution kernels. The CNN network requires a sufficient number of low-level features to ensure the capability to fit the data to overcome the data complexity caused by factors, such as the variety of land types, because the dataset covers a relatively small number of species and samples. Therefore, increasing the depth of the CNN improves the network performance.

In the CNN, the feature map of this layer is a different combination of the feature maps extracted by the previous layer. Thus, the output data of the previous layer are the input data of this layer. To further verify that no redundancy exists in the convolution result for each layer, the entire convolution kernel is visualized (Fig. 7). No repetitive or random convolution kernel is found after comparing the visualization results. Thus, the convolution kernel is effectively trained on all cases.

Fig. 7
figure 7

Visualization result of convolution kernel

  1. 2.

    Influence of network parameters on the use of regularization and dropout to suppress overfitting.

In COCNN, regularization and dropout suppress overfitting in model training, and the effectiveness of the two methods was tested separately. The model without L2 regularization displays the effects of overfitting when trained 900 times. The L2 regularization term has no effect on the updating of bias b in each layer of the model but affects the updating of weight w (Fig. 8). When w is positive, the updated w decreases, and when w is negative, the updated w becomes large. The effect of L2 regularization is to bring w closer to 0. Thus, the weights in the network approach 0. This behavior is equivalent to reducing the weight of the network and changing the complexity of the network, thereby avoiding overfitting.

Fig. 8
figure 8

Effects of L2 regularization on model performance

The effects of the dropout probability on model performance show that the accuracy of model classification reaches its peak when the dropout probability is 0.50 (Fig. 9). When the dropout value is large and insufficient training data are used, excessive amounts of feature information extracted from the model are retained, resulting in overfitting. As the dropout value decreases, the model performance also decreases. This outcome is a consequence of excessive deleted neurons, resulting in insufficient trained subnetwork and leading to reductions in the capability of the model to fit the data. Then, the model experiences difficulties in effectively establishing mapping relationships between the image data and land-use types. By combining the effects of regularization and dropout, overfitting avoidance is limited by relying on regularization or dropout alone; improved effectiveness is achieved when regularization and dropout are combined.

Fig. 9
figure 9

Effects of different dropout probabilities on model performance

Conclusions

To realize the accurate recognition of land-use types based on remote sensing images, a method that uses COCNN is proposed. The COCNN method constructs a set of typical feature samples obtained by the general rule set on the basis of the multi-scale segmentation of images. Then, sample training and feature extraction are further performed by the deep learning method. Finally, the sample characteristics after learning are applied to the segmentation results to complete the land-use classification of remote sensing images. For the classification statistics, the P and K coefficients are 96.2% and 0.96, respectively. For the influence of deep CNN structure, increasing the depth of the CNN improves the network performance. In addition, overfitting avoidance is limited by relying on regularization or dropout alone; improved effectiveness is achieved when regularization and dropout are combined. Experimental results show that the COCNN method reasonably and efficiently combines object-oriented and deep learning approaches and can comprehensively utilize the spectral, geometric, and texture information of image objects. This method not only can fully use the correlation between neighboring pixels, to obtain the small spatial heterogeneity of the classification results, but also has strong anti-noise capability, which effectively reduces the phenomenon of pixel spectral confusion. However, the existing research remains imperfect in the general rule set construction and deep learning structure design and requires further improvement.