Land cover classification in a mixed forest-grassland ecosystem using LResU-net and UAV imagery

Using an unmanned aerial vehicle (UAV) paired with image semantic segmentation to classify land cover within natural vegetation can promote the development of forest and grassland field. Semantic segmentation normally excels in medical and building classification, but its usefulness in mixed forest-grassland ecosystems in semi-arid to semi-humid climates is unknown. This study proposes a new semantic segmentation network of LResU-net in which residual convolution unit (RCU) and loop convolution unit (LCU) are added to the U-net framework to classify images of different land covers generated by UAV high resolution. The selected model enhanced classification accuracy by increasing gradient mapping via RCU and modifying the size of convolution layers via LCU as well as reducing convolution kernels. To achieve this objective, a group of orthophotos were taken at an altitude of 260 m for testing in a natural forest-grassland ecosystem of Keyouqianqi, Inner Mongolia, China, and compared the results with those of three other network models (U-net, ResU-net and LU-net). The results show that both the highest kappa coefficient (0.86) and the highest overall accuracy (93.7%) resulted from LResU-net, and the value of most land covers provided by the producer’s and user’s accuracy generated in LResU-net exceeded 0.85. The pixel-area ratio approach was used to calculate the real areas of 10 different land covers where grasslands were 67.3%. The analysis of the effect of RCU and LCU on the model training performance indicates that the time of each epoch was shortened from U-net (358 s) to LResU-net (282 s). In addition, in order to classify areas that are not distinguishable, unclassified areas were defined and their impact on classification. LResU-net generated significantly more accurate results than the other three models and was regarded as the most appropriate approach to classify land cover in mixed forest-grassland ecosystems.


Introduction
As one of the world's largest renewable natural resources, mixed forest-grassland resources directly affect the development of agriculture, forestry and other industries (Langley et al. 2001;Ma et al. 2010). According to Scurlock et al. (2002) and Dong et al. (2017b), mixed forest-grassland ecosystems are approximately 3.2 billion hectares, accounting for 40% of the total land area. Using remote sensing to classify land cover in a mixed forest-grassland ecosystem can Abstract Using an unmanned aerial vehicle (UAV) paired with image semantic segmentation to classify land cover within natural vegetation can promote the development of forest and grassland field. Semantic segmentation normally excels in medical and building classification, but its usefulness in mixed forest-grassland ecosystems in semi-arid to semi-humid climates is unknown. This study proposes a new semantic segmentation network of LResU-net in which residual convolution unit (RCU) and loop convolution unit (LCU) are added to the U-net framework to classify images of different land covers generated by UAV high resolution. The selected model enhanced classification accuracy by increasing gradient mapping via RCU and modifying the size of convolution layers via LCU as well as reducing convolution kernels. To achieve this objective, a group of provide detailed grassland and woodland information over large areas (Fang Fanget al. 2010).
In the past decade, UAV image analysis technology has been widely applied to the identification and classification in forest and grassland resource surveys (Chen 2019). It has increasingly become an opportunity to attach high resolution cameras (Huseyin et al. 2019), LiDAR (Yang et al. 2020), thermal infrared (Crusiol et al. 2019) and hyperspectral cameras (Clark et al. 2018) on UAV to better collect field information for land classification. Christian and Christiane (2014) compared forest point cloud data collected from UAV images and airborne LiDAR and concluded that more information was captured through UAV image data. Zhang et al. (2020a) used aerial hyperspectral images to classify tree species on forest farms in China, and obtained an accuracy of 93.1%. However, hyperspectral imaging may be limited when used on grassland areas with low-level color contrast as it creates a large amount of redundant data (Grigorieva et al. 2020). In addition, wind has considerable influence on LiDAR data which leads to noise and ghost points around the detected targets (Yun et al. 2016;Xu et al. 2018). Therefore, using UAV high resolution cameras is one of the most preferred methods to classify land cover in a mixed forestgrassland ecosystem.
Traditional segmentation methods of remote sensing involve pixel-based segmentation (Bhadoria et al. 2020) object-based analysis (José et al. 2013), and random forest segmentation (Fei et al. 2015). The analysis of pixel-based segmentation aims only at the color information among pixels, ignoring the semantic information of the classified objects, giving a poor performance in multi-object classification (Zhang et al. 2020c). Numerous researchers have studied forestry classification algorithms based on a combination of object-based analysis, random forest and manual feature extraction. Ke et al. (2010) applied an object-based approach to evaluate the synergism in high spatial resolution multispectral imagery and low-posting-density LiDAR data for forest species classification. Random forest segmentation was applied to classify tree species using satellite images of temperate forests in Austria, and the overall accuracy was 82% (Immitzer et al. 2012). In practice, the above approaches require extensive manual marking which contributes to a waste of human resources for high accuracy feature extraction (Wolf and Bochum 2013;Dalponte et al. 2015).
With the development of deep learning and convolutional neural networks (CNN) (Zhang et al. 2020b;Lou et al. 2021), numerous semantic segmentation algorithms exist for automatic classification (Fu and Qu 2018;Braga et al. 2020). U-Net (Ronneberger et al. 2015), is a semantic segmentation model based on a fully convolutional network and was initially used for biomedical image segmentation (Dong et al. 2017a;Rad et al. 2020). In comparison to other deep learning networks such as fully convolutional networks(FCN) (Long et al. 2015) and Densenet (Huang et al. 2017), U-Net has the overwhelming advantage of overall accuracy using a small number of data sets (Liu et al. 2020). In this context, U-net was used to extract complex terrain features to classify hills and ridges of the Loess Plateau in China (Li et al. 2020a). Due to unsurpassed reliability and excellent segmentation quality, some researchers have applied U-net to train hyperspectral satellite images and obtain the distribution of trees in the Sahara and Sahel regions of West Africa (Brandt et al. 2020).
Numerous studies have indicated that defects occur during U-net's feature extraction process (Freudenberg et al. 2019;Cao and Zhang 2020;Li et al. 2020b). Since U-net's downsampling depends on a stack modules of Conv-BN-ReLU (CBR), this may cause extraction scales to vary at different depths, leading to more exaggerated classification errors (Cicek et al. 2016). In an effort to correct the defects, the following improvements have been made: (1) Replace CBR modules with RCU of ResNet (He et al. 2016). ResNet directly connects the encoder and the decoder in the sample and has a capability to prevent the loss of the encoded information within different layers (Zahangir et al. 2017). For example, a building-extraction algorithm based on ResNet's remote imaging demonstrated an outstanding performance in an urban setting (Xu et al. 2018).
(2) Add LCU to down-sampling in feature extraction. LUnet is a combination of U-net and LCU, where the number of convolution at each layer of the network was increased while the convolutional dimension was shorten. Alom et al. (2018) proposed a recurrent convolutional neural network based on U-net structure that exhibited superior performance on skin cancer segmentation tasks. However, the above methods performed well in the form of binary classification for medical and urban building domains, but the ability to classify land cover in a complex forest and grassland ecosystem remains a major challenge. As such, in this study, applying the improved U-net model to achieve accurate land cover classification in a mixed forest-grassland ecosystem is proposed. The objectives of this study include the following: (1) To propose an LResU-net model applicable to land cover classification based on U-net framework as well as a combination of RCU and LCU; (2) To evaluate the classification accuracy of U-net, ResU-net, LU-net and LResU-net in a mixed forest-grassland ecosystem; and, (3) To calculate the actual areas of various land covers using the best model of this study.

Study area
The study area ( Fig. 1) is located near the green Lv Shui Animal Breeding Farm of Horqin, Xing'an League, Inner Mongolia Autonomous Region at 46°42ʹ51″ N and 120°30ʹ1″ E with an altitude of 230-300 m. The local climate is midtemperate, semi-arid continental monsoon within an average annual temperature of 13 °C, rainfall of 420 mm, and humidity of 18%. The area consists of forests, grasslands and cultivated lands, and provides a variety of land covers such as natural grasslands, trees, roads, rivers, and buildings.

Field survey and acquisition of UAV image data
The field investigation, September 23rd to 28th, 2020 was near the Lv Shui Animal Breeding Farm. And involved the determination of land cover classes and UAV image data collection. The river bed has been eroded over many years and so some land cover is not identifiable. Aerial images were taken by the DJI Mavic 2 Pro drone equipped with Suha's one-inch 20-megapixel CMOS sensor (Table 1). The flight airspace was 1210 m × 600 m at an altitude of 260 m in which the overlap of flight paths was 85% and a side overlap of 80%. A total of 798 photos was produced.

Data preprocessing
The usual way to obtain an orthophoto map is three dimensional (3D) reconstruction over the entire study area. The steps of 3D reconstruction are (Fig. 2). First, applying the structure from motion (SfM) algorithm achieved detection and matching of the feature points to obtain the sparse point clouds. Second, based on sparse point clouds, the dense point cloud was systematically acquired using During the reconstruction process, open multiple view geometry (OpenMVG) library was y used for the SfM algorithm to get sparse point clouds. The subsequent procedures, including MVS, surface reconstruction and texture mapping, were implemented with the open multiple view stereo (OpenMVS) library. Finally, the 3D reconstructed model was compressed to an orthophoto on the Context Capture platform. The complete orthophoto with a pixel resolution of 20,167 × 13,534 was used to establish a high resolution data set for land cover classification.

Production of data sets
To prevent data loss, the data sets were produced by overlapping and cutting the entire orthophoto. The orthophoto with resolutions of 20,167 × 13,534 pixels, was first reshaped into an original dataset in which the image resolution was modified to 1024 × 1024 pixels. Based on the ratio of 6:2:2, the data sets were then divided into training, validation and test sets to ensure the mutual independence in data and to maintain the robustness of the  (Table 2). In the field investigation, some complex land covers were difficult to define as a category, such as the mixed landscape of swamps and the eroded lands around rivers. However, since the image grid is very large, it was difficult to label the whole image without any gaps; therefore, unclassified areas were defined as one of the label categories. According to visual interpretation, ten different categories of land cover were recognized using different colors as image classification objectives (Fig. 4). The original orthophoto and labeled image were then respectively divided into samples and objectives in training set, verification set and test set by Photoshop software.

LResU-net network
The backbone of LResU-net (Fig. 5) is a combination of sampling characteristics in ResU-net and LU-net. On the one hand, due to encoder layer of U-net being relatively shallow, LCU was added to down-sampling in the feature extraction. In comparison to U-net, LCU increases model depth as well as achieve the improvement of sample details during feature extraction. On the other hand, advances in the closed-loop feedback mapping function of RCU effectively avoided the problem of gradient overflow and disappearance, i.e., when the network's loss rate reached the lowest value, ResU-net ensures that the network of the next layer still works in the most optimal state. At the same time, the number of convolution kernels was modified from 64 → 128 → 256 → 512 → 1024 to 32 → 64 → 128 → 256 → 512 to decrease the overall kernel count to 50% of U-net in the whole training process. According to previous studies (Liang and Hu 2015;Alom et al. 2018), when the loop step of the LCU was 3, the feature  . 4 a Map of the drone's orthophoto; b Map of ten different category labels extraction effect and training time were optimal; the entire LResU-net structure is shown in Fig. 6.

Comparison with U-net
There are three differences of LResU-net from U-net: 1) The original feature extraction backbone CBR has been abandoned and replaced with RCU of Resnet. 2) Aiming at the feature extraction structure encodingdecoding, LCU was added to the network. Meanwhile, the number of convolutions at each layer can be changed quantitatively in accordance with the difficulty of feature extraction.
3) The total convolution kernels was shorten by half in the training progress.
Three advantages of LResU-net compared with U-Net as follows.
1) Using the modified network solved the problem of gradient overflow and gradient disappearance. 2) Reducing the rate of misjudgment of image segmentation with low color contrast by applying an improvement of accuracy in detailed feature extraction.
3) The training time was shortened in approaches of optimizing network parameters and reducing redundant convolution kernels.

Loss function and accuracy evaluation index
Loss function estimated the inconsistency between the classification data of the model and the reference data during network training progress. The pixel set and the category set, In this study, producer's and user's accuracy (Story and Congalton 1986;Olofsson et al. 2013;Shao et al. 2019), as well as a kappa coefficient, were used to evaluate classification accuracy.

Network training
To facilitate a performance comparison among different networks, four different models were used to train and predict, U-net, ResU-net, LU-net, and LResU-net. The learning rate was adjusted to 1 × e -4 and the batch size was 32. A total of 60 epoch with 18,000 steps allowed the model accuracy to reach the maximum in the training process. For the software platform, tensorflow-gpu 1.15 and keras 2.3.1 based on a Linux operating system was used as the learning framework, and all code was written by python. For the hardware platform- (Table 3), an Intel Xeon E5-2650 processor, a Nvidia

Large-scale remote sensing imagery prediction and real area calculation
Given that memory overflow may be caused if the entire orthophoto is directly inputted to the model to predict, all images were cropped into a group of 128 × 128 image slices. After prediction of the slices, a composite imagery was spliced using these images in accordance with the order they were cropped. However, the splice approach of clipping-prediction-splicing can result in obvious segmentation edges. An alternate method of clipping overlapping images and ignoring edges ) may mitigate this, i.e., apply the area ratio between the ignored edge image and the stitched image to calculate the overlapping area size among the image slices. Of real area calculation, the ratio of UAV real flight area and the pixel area was used to calculate the

Accuracy of land cover classification using different networks
The reference data were derived from the pixel area of each land cover in the labeled orthophoto, and the classified data from the predicted pixel area of each land cover using different models.

Kappa coefficient and overall accuracy
The first step is to make an accurate assessment of the different models for image classification, including U-net, ResUnet, LU-net, and LResU-net. Table 4 and Table 5, respectively, show kappa coefficient and classification accuracy with and without undefined areas in the whole data sets.
The undefined area has a strong impact on accuracy assessment because the unclassified area was predicted in other land covers.
On the basis of the separate data analysis in Table 4, the raise of the kappa coefficient and overall accuracy generated by ResU-net and LU-net indicates that both RCU and LCU played a positive role in modifying U-net. At the same time, the accuracy assessment of LResU-net was obviously higher, which also reflects a positive effect on the combination of RCU and LCU.
When not involving the unclassified areas, the variation tendency of Table 4 and Table 5 are consistent, i.e., the kappa coefficient and overall accuracy had an improvement to different extents on the modified model adding RCU and LCU. As expected, the optimum performance of accuracy assessment of LResU-net (kappa coefficient = 0.86, overall accuracy = 93.7%) was found in a test set, which is attributed to the advancement of ResU-net and LU-net.

Producer's and user's accuracy derived from LResU-net
The producer's and user's accuracy obtained from the LResU-net model is presented in Table 6. For most categories, both demonstrate highly favorable results. For example, trees occupy the highest value (producer's = 0.98 and user's = 0.93), and the harvested crop second (producer's = 0.94 and user's = 0.91). However, there are obvious differences between producer's and user's accuracy among some categories, including harvested grassland (producer's = 0.43 and user's = 0.99), and river (producer's = 0.85 and user's = 0.96). Such results are due to some undefined areas that were classified to above classes. Figure 7 shows the graphs of land cover classification based on the four different network models. Compared with the results of U-net (Fig. 7a), noise and misjudgment rate of ResU-net (Fig. 7b) were slightly reduced, and the capacity of building classification significantly strengthened. Similarly, LU-net (Fig. 7c) was superior to U-net in overall classification performance. Even with some noise in the harvested grassland, the classification capacity for road, river, and building was better than that of U-net. As for the results of LResU-net (Fig. 7d), it exceeded others in classification performance, especially noise suppression from the grassland, harvested grassland, tree, and river.

The real area of various land covers
The outcome of various land covers is presented in Table 7. Regardless of unclassified area, the differences of the classification and the reference data for various land covers was insignificant. According to the results of the unclassified area, grassland (38.7%, area = 27.3 ha) is the largest proportion of the area, followed by harvested grassland (28.7%, area = 20.3 ha). The entire grassland area accounted for 67.4% of the study area. The smallest area was buildings (0.2%, area = 4.8 ha). The proportion of forest area was 8.0%, which was average among all land covers.

Effect of unclassified areas on classification results
Unclassified areas will change the attribute of land cover. According to Fig. 8 and Table 7, about 50% of unclassified areas was likely to have the same reference data in label, which indicated that some unclassified areas are subject to distinctive features and attributes, for example, swamp. Amid other unclassified areas, the regions attached fissures were predicted to be correct land cover, but other areas were grassland, harvested grassland and forest. It is attributed to the same or similar features between the unclassified and above classes in LResU-net's vision.

Performance of RCU and LCU on the model training process
In line with the Table 8, the total parameters and training time in ResU-net were greater than that of U-net, which resulted from an addition of RCU. This was consistent with previous studies on improvements of U-net (Alom et al. 2018;Rad et al. 2020). In contrast, as the convolutional dimension decreased, LU-net significantly reduced parameters and time. As RCU and LCU are combined, the number of parameters and training time in LResU-net (25.11 million, 282 s) were slightly lower than U-net (31.05 million, 358 s). Curves of accuracy and loss (Fig. 9) show the overall error between the predicted data and the reference data   Figure 10a, b shows that the user's accuracy of ResU-net and LU-net had a similar improvement in grasslands, crops, and buildings, which did not include unclassified areas. However, it is not as precise as the classification of harvested crops and river, indicating that the modification based separately on RCU or LCU still had some defects on classification in mixed forest-grassland ecosystems. At the same time, LRes-Unet produced the highest user's accuracy, proving the positive effect of the combination of RCU and LCU on classification of land cover. Figure 10, d, proves that the above statement was still valid for unclassified areas and also recognizes the influence of unclassified area on each classification based on the four models.

Effect of background area on overall accuracy
Because of a few areas, the background effects had been ignored in previous studies (Cao and Zhang 2020;Zhang et al. 2020c). In this study, the impacts of background on classification can be analyzed by the producer's and user's accuracy ( Table 6). The accuracy (producer's = 0.90, user's = 0.94) of background area was higher than that of other classes, which led to a false improvement of overall accuracy. However, the results in Fig. 10 exhibit drastically different user's accuracy of background area based on U-net and ResU-net model. This difference was linked to the effect of LCU which can deepen the depth of image feature points and further improve classification accuracy. Figure 11 illustrates the error of classifying trees under shadows which is the most common classification failure in data sets. Environment problems from sampling and image mosaics were the main factors deteriorating the classification performance. Under low light or shadow conditions, the image features of some land cover change and further weaken similarities with other land covers in color level. In addition, the orthophoto obtained from the 3D reconstructed model may produce a blurry edge for images (Skabek et al. 2020), which is likely to destroy classification.

Conclusions
Classifying land cover in a mixed forest-grassland ecosystem is a significant use of remote sensing technology, particularly from unmanned aerial vehicles (UAV) to manage forests and grasslands. This study presents a new method, LResU-net, to do land cover classification based on U-net, residual convolution and loop convolution network. On the basis of U-net, it adds RCU and LCU on U-net approach to improve the model and reduces the number of parameters and training time. Compared with other networks (U-net, ResU-net, LU-net), LResUnet has higher Kappa coefficients and greater accuracy in the entire data sets. The analysis of producer's and user's accuracy indicates that LResU-net had the favorable performance in various land covers. The result of classification was affected by unclassified areas, and a solution to some unclassified lands was found. The area of various land covers, which can be used for statistics and analysis of landform was calculated. However, this study does not include height data and future research should use the 3D reconstructed model to study height data of land cover classification.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, Fig. 10 a User's accuracy without unclassified area on training set using four network models; b user's accuracy without unclassified area on test set using four network models; c user's accuracy with unclassified area on training set using four network models; d user's accuracy with unclassified area on test set using four network models adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.