Introduction

As one of the world’s largest renewable natural resources, mixed forest-grassland resources directly affect the development of agriculture, forestry and other industries (Langley et al. 2001; Ma et al. 2010). According to Scurlock et al. (2002) and Dong et al. (2017b), mixed forest-grassland ecosystems are approximately 3.2 billion hectares, accounting for 40% of the total land area. Using remote sensing to classify land cover in a mixed forest-grassland ecosystem can provide detailed grassland and woodland information over large areas (Fang Fanget al. 2010).

In the past decade, UAV image analysis technology has been widely applied to the identification and classification in forest and grassland resource surveys (Chen 2019). It has increasingly become an opportunity to attach high resolution cameras (Huseyin et al. 2019), LiDAR (Yang et al. 2020), thermal infrared (Crusiol et al. 2019) and hyperspectral cameras (Clark et al. 2018) on UAV to better collect field information for land classification. Christian and Christiane (2014) compared forest point cloud data collected from UAV images and airborne LiDAR and concluded that more information was captured through UAV image data. Zhang et al. (2020a) used aerial hyperspectral images to classify tree species on forest farms in China, and obtained an accuracy of 93.1%. However, hyperspectral imaging may be limited when used on grassland areas with low-level color contrast as it creates a large amount of redundant data (Grigorieva et al. 2020). In addition, wind has considerable influence on LiDAR data which leads to noise and ghost points around the detected targets (Yun et al. 2016; Xu et al. 2018). Therefore, using UAV high resolution cameras is one of the most preferred methods to classify land cover in a mixed forest-grassland ecosystem.

Traditional segmentation methods of remote sensing involve pixel-based segmentation (Bhadoria et al. 2020) object-based analysis (José et al. 2013), and random forest segmentation (Fei et al. 2015). The analysis of pixel-based segmentation aims only at the color information among pixels, ignoring the semantic information of the classified objects, giving a poor performance in multi-object classification (Zhang et al. 2020c). Numerous researchers have studied forestry classification algorithms based on a combination of object-based analysis, random forest and manual feature extraction. Ke et al. (2010) applied an object-based approach to evaluate the synergism in high spatial resolution multispectral imagery and low-posting-density LiDAR data for forest species classification. Random forest segmentation was applied to classify tree species using satellite images of temperate forests in Austria, and the overall accuracy was 82% (Immitzer et al. 2012). In practice, the above approaches require extensive manual marking which contributes to a waste of human resources for high accuracy feature extraction (Wolf and Bochum 2013; Dalponte et al. 2015).

With the development of deep learning and convolutional neural networks (CNN) (Zhang et al. 2020b; Lou et al. 2021), numerous semantic segmentation algorithms exist for automatic classification (Fu and Qu 2018; Braga et al. 2020). U-Net (Ronneberger et al. 2015), is a semantic segmentation model based on a fully convolutional network and was initially used for biomedical image segmentation (Dong et al. 2017a; Rad et al. 2020). In comparison to other deep learning networks such as fully convolutional networks(FCN) (Long et al. 2015) and Densenet (Huang et al. 2017), U-Net has the overwhelming advantage of overall accuracy using a small number of data sets (Liu et al. 2020). In this context, U-net was used to extract complex terrain features to classify hills and ridges of the Loess Plateau in China (Li et al. 2020a). Due to unsurpassed reliability and excellent segmentation quality, some researchers have applied U-net to train hyperspectral satellite images and obtain the distribution of trees in the Sahara and Sahel regions of West Africa (Brandt et al. 2020).

Numerous studies have indicated that defects occur during U-net’s feature extraction process (Freudenberg et al. 2019; Cao and Zhang 2020; Li et al. 2020b). Since U-net’s down-sampling depends on a stack modules of Conv-BN-ReLU (CBR), this may cause extraction scales to vary at different depths, leading to more exaggerated classification errors (Cicek et al. 2016). In an effort to correct the defects, the following improvements have been made:

(1) Replace CBR modules with RCU of ResNet (He et al. 2016). ResNet directly connects the encoder and the decoder in the sample and has a capability to prevent the loss of the encoded information within different layers (Zahangir et al. 2017). For example, a building-extraction algorithm based on ResNet’s remote imaging demonstrated an outstanding performance in an urban setting (Xu et al. 2018).

(2) Add LCU to down-sampling in feature extraction. LU-net is a combination of U-net and LCU, where the number of convolution at each layer of the network was increased while the convolutional dimension was shorten. Alom et al. (2018) proposed a recurrent convolutional neural network based on U-net structure that exhibited superior performance on skin cancer segmentation tasks.

However, the above methods performed well in the form of binary classification for medical and urban building domains, but the ability to classify land cover in a complex forest and grassland ecosystem remains a major challenge. As such, in this study, applying the improved U-net model to achieve accurate land cover classification in a mixed forest-grassland ecosystem is proposed. The objectives of this study include the following: (1) To propose an LResU-net model applicable to land cover classification based on U-net framework as well as a combination of RCU and LCU; (2) To evaluate the classification accuracy of U-net, ResU-net, LU-net and LResU-net in a mixed forest-grassland ecosystem; and, (3) To calculate the actual areas of various land covers using the best model of this study.

Materials and methods

Study area

The study area (Fig. 1) is located near the green Lv Shui Animal Breeding Farm of Horqin, Xing’an League, Inner Mongolia Autonomous Region at 46°42ʹ51″ N and 120°30ʹ1″ E with an altitude of 230–300 m. The local climate is mid-temperate, semi-arid continental monsoon within an average annual temperature of 13 °C, rainfall of 420 mm, and humidity of 18%. The area consists of forests, grasslands and cultivated lands, and provides a variety of land covers such as natural grasslands, trees, roads, rivers, and buildings.

Fig. 1
figure 1

a and b Study area: natural grasslands near Xing’an League, Inner Mongolia Autonomous Region; c synthesis orthophoto using UAV image; d UAV flight path generated from satellite planning route

Field survey and acquisition of UAV image data

The field investigation, September 23rd to 28th, 2020 was near the Lv Shui Animal Breeding Farm. And involved the determination of land cover classes and UAV image data collection. The river bed has been eroded over many years and so some land cover is not identifiable. Aerial images were taken by the DJI Mavic 2 Pro drone equipped with Suha’s one-inch 20-megapixel CMOS sensor (Table 1). The flight airspace was 1210 m × 600 m at an altitude of 260 m in which the overlap of flight paths was 85% and a side overlap of 80%. A total of 798 photos was produced.

Table 1 DJI Mavic 2 Pro UAV flight parameters

Data preprocessing

The usual way to obtain an orthophoto map is three dimensional (3D) reconstruction over the entire study area. The steps of 3D reconstruction are (Fig. 2). First, applying the structure from motion (SfM) algorithm achieved detection and matching of the feature points to obtain the sparse point clouds. Second, based on sparse point clouds, the dense point cloud was systematically acquired using multi-view stereo (MVS) algorithm. Third, a 3D imagery (Fig. 3) was generated in a way of surface reconstruction and texture mapping upon dense point cloud.

Fig. 2
figure 2

3D reconstruction workflow based on OpenMVG library and OpenMVS library

Fig. 3
figure 3

An orthophoto derived from the 3D imagery using the Context Capture platform; details of roads, rivers and buildings are displayed in high resolution 3D imagery

During the reconstruction process, open multiple view geometry (OpenMVG) library was y used for the SfM algorithm to get sparse point clouds. The subsequent procedures, including MVS, surface reconstruction and texture mapping, were implemented with the open multiple view stereo (OpenMVS) library. Finally, the 3D reconstructed model was compressed to an orthophoto on the Context Capture platform. The complete orthophoto with a pixel resolution of 20,167 × 13,534 was used to establish a high resolution data set for land cover classification.

Production of data sets

To prevent data loss, the data sets were produced by overlapping and cutting the entire orthophoto. The orthophoto with resolutions of 20,167 × 13,534 pixels, was first reshaped into an original dataset in which the image resolution was modified to 1024 × 1024 pixels. Based on the ratio of 6:2:2, the data sets were then divided into training, validation and test sets to ensure the mutual independence in data and to maintain the robustness of the model. For expending the data sets as well as reducing the performance requirements to graphics processing unit (GPU), images of 128 × 128 pixel were obtained from the original data sets by cutting with step of 64. The training, validation and test set were assigned to13145, 4598 and 4596 images (Table 2).

Table 2 Land cover classification classes and related data sets

In the field investigation, some complex land covers were difficult to define as a category, such as the mixed landscape of swamps and the eroded lands around rivers. However, since the image grid is very large, it was difficult to label the whole image without any gaps; therefore, unclassified areas were defined as one of the label categories. According to visual interpretation, ten different categories of land cover were recognized using different colors as image classification objectives (Fig. 4). The original orthophoto and labeled image were then respectively divided into samples and objectives in training set, verification set and test set by Photoshop software.

Fig. 4
figure 4

a Map of the drone's orthophoto; b Map of ten different category labels

LResU-net network

The backbone of LResU-net (Fig. 5) is a combination of sampling characteristics in ResU-net and LU-net. On the one hand, due to encoder layer of U-net being relatively shallow, LCU was added to down-sampling in the feature extraction. In comparison to U-net, LCU increases model depth as well as achieve the improvement of sample details during feature extraction. On the other hand, advances in the closed-loop feedback mapping function of RCU effectively avoided the problem of gradient overflow and disappearance, i.e., when the network’s loss rate reached the lowest value, ResU-net ensures that the network of the next layer still works in the most optimal state.

Fig. 5
figure 5

a Backbone of the U-net feature extraction structure was a common CBR module (conv- > BN- > ReLU); b backbone of the ResU-net feature extraction structure with RCU added; c backbone of the LU-net feature extraction structure with LCU added; d backbone of the LResU-net feature extraction structure with RCU and LCU added

At the same time, the number of convolution kernels was modified from 64 → 128 → 256 → 512 → 1024 to 32 → 64 → 128 → 256 → 512 to decrease the overall kernel count to 50% of U-net in the whole training process. According to previous studies (Liang and Hu 2015; Alom et al. 2018), when the loop step of the LCU was 3, the feature extraction effect and training time were optimal; the entire LResU-net structure is shown in Fig. 6.

Fig. 6
figure 6

LResU-net model structure with the number of convolution kernels halved

Comparison with U-net

There are three differences of LResU-net from U-net:

  1. 1)

    The original feature extraction backbone CBR has been abandoned and replaced with RCU of Resnet.

  2. 2)

    Aiming at the feature extraction structure encoding–decoding, LCU was added to the network. Meanwhile, the number of convolutions at each layer can be changed quantitatively in accordance with the difficulty of feature extraction.

  3. 3)

    The total convolution kernels was shorten by half in the training progress.

Three advantages of LResU-net compared with U-Net as follows.

  1. 1)

    Using the modified network solved the problem of gradient overflow and gradient disappearance.

  2. 2)

    Reducing the rate of misjudgment of image segmentation with low color contrast by applying an improvement of accuracy in detailed feature extraction.

  3. 3)

    The training time was shortened in approaches of optimizing network parameters and reducing redundant convolution kernels.

Loss function and accuracy evaluation index

Loss function estimated the inconsistency between the classification data of the model and the reference data during network training progress. The pixel set and the category set, respectively, are defined as i{1,2, …,N} and c{1,2…,M}, the image set becomes \(y_{c}^{i} \left\{ {c = 1,2...M} \right\}\). After feature extraction, the probability of different categories of pixels becomes a M-dimensional tensors \(p_{c}^{i} \left\{ {c = 1,2...M} \right\}\) and the extent of [0, 1], thereby, resulting in a multi-category, cross-entropy loss function:

$$CELoss = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} \mathop \sum \limits_{c = 1}^{M} y_{c}^{i} \log \left( {p_{c}^{i} } \right)$$
(1)

In this study, producer’s and user’s accuracy (Story and Congalton 1986; Olofsson et al. 2013; Shao et al. 2019), as well as a kappa coefficient, were used to evaluate classification accuracy.

Network training

To facilitate a performance comparison among different networks, four different models were used to train and predict, U-net, ResU-net, LU-net, and LResU-net. The learning rate was adjusted to 1 × e–4 and the batch size was 32. A total of 60 epoch with 18,000 steps allowed the model accuracy to reach the maximum in the training process. For the software platform, tensorflow-gpu 1.15 and keras 2.3.1 based on a Linux operating system was used as the learning framework, and all code was written by python. For the hardware platform-(Table 3), an Intel Xeon E5-2650 processor, a Nvidia GTX-1070 GPU, a 2 T ROM and 4 of 8 GB RAM were used to train and test.

Table 3 Computer hardware attributes

Large-scale remote sensing imagery prediction and real area calculation

Given that memory overflow may be caused if the entire orthophoto is directly inputted to the model to predict, all images were cropped into a group of 128 × 128 image slices. After prediction of the slices, a composite imagery was spliced using these images in accordance with the order they were cropped. However, the splice approach of clipping-prediction-splicing can result in obvious segmentation edges. An alternate method of clipping overlapping images and ignoring edges (Wang et al. 2020) may mitigate this, i.e., apply the area ratio between the ignored edge image and the stitched image to calculate the overlapping area size among the image slices. Of real area calculation, the ratio of UAV real flight area and the pixel area was used to calculate the area of ten different land covers where the flight region was 70.54 ha.

Results

Accuracy of land cover classification using different networks

The reference data were derived from the pixel area of each land cover in the labeled orthophoto, and the classified data from the predicted pixel area of each land cover using different models.

Kappa coefficient and overall accuracy

The first step is to make an accurate assessment of the different models for image classification, including U-net, ResU-net, LU-net, and LResU-net. Table 4 and Table 5, respectively, show kappa coefficient and classification accuracy with and without undefined areas in the whole data sets. The undefined area has a strong impact on accuracy assessment because the unclassified area was predicted in other land covers.

Table 4 Classification evaluation coefficients of four models with unclassified areas
Table 5 Classification evaluation coefficients of four models without unclassified area

On the basis of the separate data analysis in Table 4, the raise of the kappa coefficient and overall accuracy generated by ResU-net and LU-net indicates that both RCU and LCU played a positive role in modifying U-net. At the same time, the accuracy assessment of LResU-net was obviously higher, which also reflects a positive effect on the combination of RCU and LCU.

When not involving the unclassified areas, the variation tendency of Table 4 and Table 5 are consistent, i.e., the kappa coefficient and overall accuracy had an improvement to different extents on the modified model adding RCU and LCU. As expected, the optimum performance of accuracy assessment of LResU-net (kappa coefficient = 0.86, overall accuracy = 93.7%) was found in a test set, which is attributed to the advancement of ResU-net and LU-net.

Producer’s and user’s accuracy derived from LResU-net

The producer’s and user’s accuracy obtained from the LResU-net model is presented in Table 6. For most categories, both demonstrate highly favorable results. For example, trees occupy the highest value (producer’s = 0.98 and user’s = 0.93), and the harvested crop second (producer’s = 0.94 and user’s = 0.91). However, there are obvious differences between producer’s and user’s accuracy among some categories, including harvested grassland (producer’s = 0.43 and user’s = 0.99), and river (producer’s = 0.85 and user’s = 0.96). Such results are due to some undefined areas that were classified to above classes.

Table 6 Population error matrix involving producer’s and user’s accuracy

Land cover classification results from different networks

Figure 7 shows the graphs of land cover classification based on the four different network models. Compared with the results of U-net (Fig. 7a), noise and misjudgment rate of ResU-net (Fig. 7b) were slightly reduced, and the capacity of building classification significantly strengthened. Similarly, LU-net (Fig. 7c) was superior to U-net in overall classification performance. Even with some noise in the harvested grassland, the classification capacity for road, river, and building was better than that of U-net. As for the results of LResU-net (Fig. 7d), it exceeded others in classification performance, especially noise suppression from the grassland, harvested grassland, tree, and river.

Fig. 7
figure 7

a U-net model prediction result; b ResU-net model prediction result; c LU-net model prediction result; d LResU-net model prediction result

The real area of various land covers

The outcome of various land covers is presented in Table 7. Regardless of unclassified area, the differences of the classification and the reference data for various land covers was insignificant. According to the results of the unclassified area, grassland (38.7%, area = 27.3 ha) is the largest proportion of the area, followed by harvested grassland (28.7%, area = 20.3 ha). The entire grassland area accounted for 67.4% of the study area. The smallest area was buildings (0.2%, area = 4.8 ha). The proportion of forest area was 8.0%, which was average among all land covers.

Table 7 Areas of each land cover in the reference data and the classified data

Discussion

Effect of unclassified areas on classification results

Unclassified areas will change the attribute of land cover. According to Fig. 8 and Table 7, about 50% of unclassified areas was likely to have the same reference data in label, which indicated that some unclassified areas are subject to distinctive features and attributes, for example, swamp. Amid other unclassified areas, the regions attached fissures were predicted to be correct land cover, but other areas were grassland, harvested grassland and forest. It is attributed to the same or similar features between the unclassified and above classes in LResU-net’s vision.

Fig. 8
figure 8

a Orthophoto map fused with label image; b orthophoto map fused with LResU-net model prediction image

Performance of RCU and LCU on the model training process

In line with the Table 8, the total parameters and training time in ResU-net were greater than that of U-net, which resulted from an addition of RCU. This was consistent with previous studies on improvements of U-net (Alom et al. 2018; Rad et al. 2020). In contrast, as the convolutional dimension decreased, LU-net significantly reduced parameters and time. As RCU and LCU are combined, the number of parameters and training time in LResU-net (25.11 million, 282 s) were slightly lower than U-net (31.05 million, 358 s).

Table 8 Four different model parameters and training time in each epoch

Curves of accuracy and loss (Fig. 9) show the overall error between the predicted data and the reference data during the training process. When training is near the 55th epoch, accuracy and loss tend to be steady and hardly need the supplement of more epochs. Thus, all training stopped at the 60th epoch. In addition, it can be seen that ResU-net provided the fastest convergence rate in comparison with the other networks, which is attributed to the decline of the encoded loss in different layers.

Fig. 9
figure 9

a Training accuracy curves of four different models; b training loss curves of four different models

Comparison of user’s accuracy on train and test sets using different network models

Figure 10a, b shows that the user’s accuracy of ResU-net and LU-net had a similar improvement in grasslands, crops, and buildings, which did not include unclassified areas. However, it is not as precise as the classification of harvested crops and river, indicating that the modification based separately on RCU or LCU still had some defects on classification in mixed forest-grassland ecosystems. At the same time, LRes-Unet produced the highest user’s accuracy, proving the positive effect of the combination of RCU and LCU on classification of land cover. Figure 10, d, proves that the above statement was still valid for unclassified areas and also recognizes the influence of unclassified area on each classification based on the four models.

Fig. 10
figure 10

a User’s accuracy without unclassified area on training set using four network models; b user’s accuracy without unclassified area on test set using four network models; c user’s accuracy with unclassified area on training set using four network models; d user’s accuracy with unclassified area on test set using four network models

Effect of background area on overall accuracy

Because of a few areas, the background effects had been ignored in previous studies (Cao and Zhang 2020; Zhang et al. 2020c). In this study, the impacts of background on classification can be analyzed by the producer’s and user’s accuracy (Table 6). The accuracy (producer’s = 0.90, user’s = 0.94) of background area was higher than that of other classes, which led to a false improvement of overall accuracy. However, the results in Fig. 10 exhibit drastically different user's accuracy of background area based on U-net and ResU-net model. This difference was linked to the effect of LCU which can deepen the depth of image feature points and further improve classification accuracy.

Failure classification

Figure 11 illustrates the error of classifying trees under shadows which is the most common classification failure in data sets. Environment problems from sampling and image mosaics were the main factors deteriorating the classification performance. Under low light or shadow conditions, the image features of some land cover change and further weaken similarities with other land covers in color level. In addition, the orthophoto obtained from the 3D reconstructed model may produce a blurry edge for images (Skabek et al. 2020), which is likely to destroy classification.

Fig. 11
figure 11

a UAV remote sensing image; b label image; c results of U-net classification; d results of LResU-net classification

Conclusions

Classifying land cover in a mixed forest-grassland ecosystem is a significant use of remote sensing technology, particularly from unmanned aerial vehicles (UAV) to manage forests and grasslands. This study presents a new method, LResU-net, to do land cover classification based on U-net, residual convolution and loop convolution network. On the basis of U-net, it adds RCU and LCU on U-net approach to improve the model and reduces the number of parameters and training time. Compared with other networks (U-net, ResU-net, LU-net), LResU-net has higher Kappa coefficients and greater accuracy in the entire data sets. The analysis of producer's and user's accuracy indicates that LResU-net had the favorable performance in various land covers. The result of classification was affected by unclassified areas, and a solution to some unclassified lands was found. The area of various land covers, which can be used for statistics and analysis of landform was calculated. However, this study does not include height data and future research should use the 3D reconstructed model to study height data of land cover classification.