1 Introduction

Image classification for analysis of land use and land cover (LULC) was one of the earliest fundamental remote-sensing methods (Jensen 2005). There have been many studies exploring classification techniques for LULC mapping, but identifying the ideal classifier for any given application is still unclear (Pandey et al. 2021). Traditional approaches, such as maximum likelihood, fuzzy logic, and object-oriented classifications, are referred to as shallow learning. These methods extract data based on spatial, spectral, textural, morphological, and other cues (Ball et al. 2017). In contrast, deep learning is multi-layered, learns from the data itself, and results can be significantly more accurate than shallow learning (Deng 2014) and can outperform manual human editing (Zhang et al. 2016).

For LULC classification within high-resolution imagery, the most accurate classification algorithms are Decision Trees (DT), Support Vector Machines (SVM), Random Forest (RF), and CNN (Ma et al. 2019). Many studies have compared deep learning techniques to shallow learning algorithms such as RF and SVM. Kussul et al. (2017) found an ensemble of CNN outperformed fully connected multilayer perceptrons and RF classifiers for land cover and crop types using Landsat 8 and Sentinel-2 imagery. Stoian et al. (2019) compared an altered U-Net with RF for mapping various land uses with a time-series of Sentinel-2 data. They found that U-Net had a similar or better accuracy than RF but took longer to train; however, the study did not use a Graphics Processing Unit (GPU) which would have made the training of the U-Net more efficient (Krizhevsky et al. 2012). Freudenberg et al. (2019) trained U-Net using a GPU to detect palm tree plantations in Worldview imagery and found U-Net to be > 10% more accurate and 100 times more efficient than AlexNet.

The integration of textural properties through object-based segmentation techniques has significantly improved classification results for remote-sensing applications (Blaschke et al. 2015). Classifications that use spatial as well as spectral information outperform those based on purely on pixels (Pandey et al. 2021). As a result, CNN classifications assisted by enhanced computing power in recent years (Ma et al. 2016) have higher accuracy compared to SVM and RF (Ma et al. 2019).

Deep learning is a subset of machine learning and a learning algorithm based on neural networks (Schmidhuber 2015). These networks consist of many layers which can transform images into categories through learning high-level features (Litjens et al. 2017). Neural networks are described as networks which simulate the processes of the human brain—interconnected neurons which process incoming information (Jensen 2005). The solution is obtained by nonalgorithmic and unstructured methods by adjustment of weights connecting the neurons in the network (Rao & Rao 2003). They have the advantage of not requiring normal distributed data and can adaptively simulate complex and non-linear patterns (Jensen 2005; Zhu et al. 2017) such as found in high-resolution aerial photography.

Convolutional Neural Networks (CNN) are situated at the fringe between machine learning and computer vision, combining the power of deep learning with contextual image analysis. CNN have gained momentum for image classification, since the AlexNet architecture won the ImageNet contest by a wide margin in 2012 (Krizhevsky et al. 2012).

CNN have been used in applications, such as number plate reading (Mondal et al. 2017), facial recognition (Almabdy & Elrefaei 2019), and aerial image classification (Khalel & El-Saban 2018; Othman et al. 2016). Many previous studies applying deep learning to earth observation data have focused on the classification or categorisation of image chips but do not address the problem of per-pixel classification (Castelluccio et al. 2015; Cheng et al. 2015; Penatti et al. 2015). Although an advancement in computer vision for earth observation, these classifications are not useful in natural resource management applications as they do not define precise spatial location and extent of LULC features. More recent studies have addressed this problem, and per-pixel classifications using deep learning seem to be outperforming the traditional approaches in high-resolution imagery (Ma et al. 2019).

CNN are best suited to higher resolution imagery and a number of studies have found U-Net performs better using this imagery compared to medium resolution sensors such as Sentinel-2 (Liu et al. 2019; Wurm et al. 2019). Medium-resolution images (10–30 m) lack the fine structural components found in higher resolution imagery; it is difficult to apply CNN classification without this additional information on texture (Sharma et al. 2017). Wurm et al. (2019) found higher accuracy when using Quickbird imagery compared to lower resolution Sentinel-2 imagery for the detection of slums in Mumbai, India. Romero et al. (2016) found deep CNN architectures performed significantly better than shallow CNN, kernel-based Principal Component Analysis and spectral classification for land-use classification in aerial photography, multispectral and hyperspectral imagery.

As the benefits of CNN are realised within the earth observation community, the drive to find the optimal configuration for LULC classification has resulted in many suggested architectures for a wide range of applications (Huang et al. 2018; Zhang et al. 2018). Adding further complexity to the subject, existing architectures such as Inception (Szegedy et al. 2015) and ResNet (He et al. 2016) used for traditional photography classification have also been successfully adapted for mapping LULC using multispectral remote-sensing imagery (Ghosh et al. 2020; Liu et al. 2019; Mahdianpari et al. 2018).

The most important issue with some CNN architectures is they contain large receptive fields and evaluation of contextual information over multiple scales can result in classifications at a lower resolution than the original image. A remote-sensing and deep learning review article by Ma et al. (2019) suggested that creating symmetric architectures using unconvoluted layers and skip connections, such as U-Net (Ronneberger et al. 2015), results in classifications with the same spatial resolution as the input.

Originally developed for biomedical imaging (Ronneberger et al. 2015), U-Net has been adopted for use with optical earth observation data and has become one of the most popular architectures (Neupane et al. 2021) with overall accuracies of > 90% (Du et al. 2019; Flood et al. 2019; Gurumurthy et al. 2019; Kestur et al. 2019; Khryashchev et al. 2018; Wagner et al. 2019; Wei et al. 2019). There are numerous studies which have redesigned the U-Net specifically for extracting features from earth observation data (Kim et al. 2019; Ren et al. 2020); however, this has further complicated the issue as to which architecture is best to use for general earth observation applications.

Additional complexities exist when applying the U-Net architecture to earth observation data with selection of appropriate algorithmic options for kernel initialisation, activation, loss, loss optimiser, and learning rate. Further to this, the U-Net architecture can be defined with different numbers of initial filters, kernel size, and depth (number of layers); all of which can have profound impacts on how well a model learns the training data, if it is able to learn at all. For example, CNNs can experience a vanishing gradient problem when using certain activation functions with a small range of values (e.g., sigmoid) resulting in insufficient learning (Sun et al. 2019).

Globally consistent archives of regularly updated moderate-to-high spatial resolution satellite imagery are becoming increasingly available (e.g., Google Earth Engine Amazon Web Services and Microsoft Planetary Computer data stores). The data are only one part of the mapping solution—appropriate mapping methods, data storage, and computational platforms at scale are all required to conduct useful data analysis within the field “big data analysis” (Hey et al. 2009). Users have recognised the need for supply of imagery and increased their demand for spatial information accordingly (i.e., larger areas of interest, more-rapid response, and more-detailed classifications), but how the imagery relates to big data analysis is an open research question.

The objective of this study is to assist in the development of a framework to use existing LULC information from one year to automatically classify LULC features from another point in time. Developing a framework to automatically map LULC would be of benefit to mapping programs globally. The automated and efficient classification of LULC features from high spatial resolution imagery will be extremely valuable to inform response to natural disasters or biosecurity incidents, as well as studies on agricultural productivity and sustainability, land-use planning, monitoring of and investment in natural resources, biodiversity conservation, and improving water availability and quality.

In this paper, we provided some clarity on the use of U-Net for earth observation applications by running a series of trials with varying parameters of the U-Net to determine the optimal architecture configuration and algorithmic choice for LULC classification in aerial photography. Optimising the U-Net architecture for LULC mapping improves understanding and application of CNN for operational LULC mapping across large regions and assists in addressing the increased demand for rapid large-area classification using high spatial resolution imagery.

2 Method

2.1 Project Area

The project area focused on an agricultural region on the Atherton Tablelands in North Queensland, Australia (Fig. 1). The area consists of the World Heritage listed Wet Tropics rainforests in the east and encompasses the towns of Mareeba in the north, Atherton in the south, and Dimbulah to the west.

Fig. 1
figure 1

Project area located in north Queensland, Australia. Basemap provided by ESRI

The major land uses of the area include production from relatively natural environments, conservation and natural environments, and production from irrigated agriculture and plantations (Table 1) (DSITI 2017).

Table 1 Project area primary land uses (DSITI 2017)

2.2 Multispectral Aerial Imagery

Orthorectified aerial images of the study area are periodically captured through the Queensland Government Spatial Imagery Subscription Plan. For this project, we used the 2018 orthorectified mosaic with a spatial resolution of 20 cm for training the models and referred to as the 2018 training image (Fig. 2b). The image was acquired between 1st and 27th August 2018 mostly using a Vexcel Ultracam Eagle camera except for the southwestern corner which was captured by and A3 Edge camera. The two cameras have different spectral properties which can be seen in Fig. 2b.

Fig. 2
figure 2

Project orthorectified aerial imagery for 2015 (a) and 2018 (b). Data were supplied by the Queensland Government

To test the accuracy and transferability of the models, a 2015 mosaic captured between 17th July and 14th October using an A3 Edge camera at a spatial resolution of 25 cm was used and referred to as the 2015 test image (Fig. 2a).

Both images consisted of three bands (red, green, and blue) and were resampled to 50 cm using cubic convolution for spatial resolution consistency between the images and to minimise the amount of data required for training and inference.

2.3 Training Data Collection

The nine LULC classes selected for this project included banana plantations, sugarcane crops, mature tree crops, young tree crops, tea tree plantations, forestry plantations, berry crops, and vineyards with all remaining areas classified as other uses. These classes were chosen based on the ability to identify the class within the RGB imagery and in the field, and existing data from previous work (Clark & McKechnie 2020). These data were manually digitised using the 2015 test image and 2018 training image within a Geographic Information System (GIS). Table 2 lists the total class area and proportion of the project area for 2015 and 2018. The data were stored as polygons within an ESRI shapefile format.

Table 2 Class area and proportion of the project area for each class for 2015 and 2018

A field verification trip to the project area was undertaken in January 2020 and collected 1,582 observations on LULC for the eight main classes along with other land uses. The data collected in the field were used to ensure that the training data were accurate with the temporal difference between the image acquisition and field observation (17 months) taken into consideration.

2.4 Training Data Generation

Semantic segmentation using CNNs involves a patch-based training approach where image chips of a certain size are fed to the model and convolutional filters attempt to extract abstract information from the data to ultimately define features according to colour, texture, and other contextual information. The purpose of the training data generation stage was to provide the model with many different examples of LULC features.

LULC data intrinsically contain an imbalance in class area which was evident within our project (Table 2). As a result, any systematic approach (e.g., grid based) to data sampling will contain a disproportion of data between large and small area classes. To counteract this imbalance, a stratified random sampling approach was implemented.

To achieve this, we calculated the number of patches per class by multiplying the total number of patches by the log of the class area divided by the sum of the log of the project area. The result was rounded up to the nearest integer (Eq. 1)

$$N_{{cp}} = \left\lceil {N_{p} \frac{{{\text{log}}(a_{c} )}}{{\sum\nolimits_{n}^{i} {{\text{log}}} (a_{c} )}}} \right\rceil ,$$
(1)

where Ncp is the number of class patches, Np is the total number of required training patches, and ac is the class area.

For each feature, we distributed the number of class patches based on the proportion of the area that the feature represents rounded up to the nearest integer (Eq. 2)

$$N_{{fp}} = \left\lceil {N_{{cp}} \frac{{a_{f} }}{{a_{c} }}} \right\rceil ,$$
(2)

where Nfp is the number of feature patches, Ncp is the number of class patches, af is the feature area, and ac is the class area.

Once we calculated the number of patches per feature, we randomly generated patch locations ensuring that the centre of the patch was located within the feature. Table 3 shows the distribution of training patches per class. Although the centre of each class patch was within the target feature, patches also contained other surrounding class features.

Table 3 Distribution of training patches per class generated for the 2018 training data

This approach counteracts class imbalances, improving the overall classification accuracy particularly for underrepresented classes of small area. For our project, we have chosen a patch size of 512 × 512 pixels. The patches were stored as polygons within an ESRI shapefile format.

Figure 3 illustrates the geographical distribution of the patches and demonstrates the effect of the stratified sampling approach, creating a more balanced distribution between larger and smaller area classes.

Fig. 3
figure 3

Project training patches distribution for the 2018 training data generation

During the training process, a data generator uses the spatial extent of each patch to generate the training image chip and corresponding labels on demand. The labels are a rasterised version of the training data in a one-hot representation of the data where each band represents one of the nine classes. Within each band, 1 represents the presence of the class and 0 absence.

The data generator can optionally scale each batch of image chips between 0 and 255 and apply random augmentations to the data. Data scaling and augmentation is useful in capturing atmospheric, climatic, seasonal, and sensor variations present within earth observation data (Wieland et al. 2019), prevents overfitting of the data (Diakogiannis et al. 2020; Shorten and Khoshgoftaar 2019), and creates more generalised models (Kattenborn et al. 2021).

To apply the random augmentations to the patches, the python package imgaug v0.4.0 (Alexander Jung, 2020) was used. Augmentations were chosen based on their suitability to earth observation data. The augmentations included:

  1. 1.

    contrast and colourations manipulation (gamma, sigmoid, AllChannelsCLASHE, linear, multiply, allChannelsHistogramEqualization).

  2. 2.

    noise addition (salt and pepper, multiply element wise, additive Gaussian, additive poisson, multiply—different for each channel).

  3. 3.

    geometric and scale alterations (affine, elastic transformation, vertical and horizontal flips).

  4. 4.

    image blur and adding artificial clouds/fog/smoke.

One of the contrast, noise and geometric distortions, were applied; each time, the image chip was created and 50% of the time either blur or artificial clouds/fog were applied.

Figure 4 shows examples of the augmentations applied to each patch. Note that the examples provided do not represent the patch size used in this study but are intended to demonstrate examples of the augmentations applied to each patch.

Fig. 4
figure 4

Example of augmentations over an area consisting of a mature tree crop including the original image (top left) and three augmentations versions. Note that these examples do not represent the patch size used in this study but are for demonstration of the augmentations used for each patch

2.5 Training

The objective of the training was to produce a model labelling each pixel in the image as a class. To achieve this, the U-Net was chosen as the base architecture. The U-Net is a symmetrical encoder–decoder architecture and the resulting classification has the same resolution as the input data.

The training involved creating a model and iteratively providing it with batches of image chips and corresponding labels. This allowed the model to determine the relevant colour, texture, and contextual attributes of each class (Zhang et al. 2016) that maximised the overall accuracy of the classification.

Figure 5 is a graphical representation of the U-Net architecture from Clark and McKechnie (2020) showing the main components.

Fig. 5
figure 5

The U-Net architecture (Clark & McKechnie 2020; Ronneberger et al. 2015)

2.6 Prediction

During the prediction stage, the model was applied to the whole aerial image to produce pixel wise inference and classification. Similar to the input training data, the model attempted to classify each pixel as a 1 or 0 for each class. As the models are not perfect, the output predictions contain values between 0 and 1 and each class is represented by the raster bands. We simply used the class with the highest predictive value to classify each pixel resulting in a thematic classification with values between 1 and 9 representing each LULC class.

One of the limitations of a patch-based learning approach is that the edge of the patch had a lower accuracy than the centre (Sun et al. 2019). This created edge effects between the patches evident when the resulting classification was compiled. To counteract this issue, we employed a two-pass ensemble inference strategy. The strategy involved systematically iterating through the aerial image, extracting patches in a grid pattern. The inference was applied to the original patch and three rotated versions (90, 180, and 270 degrees) with the resulting classifications averaged. Once the model had been applied to the project area, we repeated the process with a half patch offset (256 pixels) to the grid, so the centre of each patch in the second pass was the join between four patches from the first. We applied a weighted average based on the distance from the centre to combine the classifications. Pixels classified towards the centre of the patch were given a higher weight than those generated at the edge. The resulting classification was a much smoother which can result in higher accuracy.

2.7 Deep Learning Trials

The objective of this study was to determine the optimal U-Net architecture configuration for earth observation applications. To achieve this objective, we performed a grid search on: the number of filters; kernel sizes; number of layers; kernel initialisations; convolution and class activations; loss functions; optimisation functions; and learning rates. There are two overall aspects in these trials which include altering the architecture (number of initial filters, filter kernel size, and number of layers) and algorithmic trials which are related to the training of the model (kernel initialisation, convolution activation function, loss function, loss optimiser, learning rate, and class activation).

Table 4 lists the different parameters, the test values, and the default value used when the parameter was not being tested. The only parameters altered within each test were the ones being trialled. All other parameters remained constant for the training. No data scaling or augmentations were applied to the parameter trials to minimise the random effects these would cause.

Table 4 Parameter values and defaults for the deep learning trials

Within each parameter trial, five models were trained for 20 epochs for each test value resulting in 385 models. It was decided that repeating each test five times captured the random variance intrinsic to the training process. Twenty iterations of the training data were not sufficient to create fully trained models; however, this provides an indication of how well the parameter values fit the data while reducing resource requirements. Further to this, as data augmentations were not applied, the models were not generalised and cannot be transferred to the 2015 test image. At this stage of the project, we compared the learning performance on the training image as an indication of effectiveness of the parameter. Applying the model on the test image occurred after the parameter trails results had been compiled and a model was fully trained using the optimised parameters.

For the deep learning trials, we produced the classification in a single pass without augmentations to avoid any data smoothing the two pass inference strategy causes. As with excluding the random augmentations in the training stage, we wanted to evaluate parameter performance without any post processing steps.

2.8 Accuracy Assessment

To evaluate the output classifications, we employed an independent accuracy assessment using 10,000 randomly generated points. Each point was assigned one of the nine classes for the 2018 training image and 2015 test image. A random sampling strategy was employed to simplify the sampling strategy, and to remove reliance on prior knowledge of class distribution.

Figure 6 shows the randomly generated points over the project area coloured according to the 2018 class.

Fig. 6
figure 6

Validation points used to assess the accuracy of the output model classifications coloured according to the 2018 observation value

The disadvantage of using a random sampling strategy in a landscape where classes have disproportionate areas was most of the points fell within classes with the largest area and few within those with small areas. As a result, 94% of our validation points fell within the ‘other uses’ class and some of the smaller classes consisted of 5 or 6 points. However, as the models are trained and assessed five times, this resulted in a minimum of 25 observations in each class for each parameter trial. Although it would be ideal to assess the accuracy using additional points, time and resource considerations limited this capacity and previous work has shown that with this number of sample points, a random sampling strategy showed no significant differences between sampling methods (Ramezan et al. 2019).

Using the random points, we assessed the accuracy of each parameter trial by calculating Cohen's kappa coefficient, Jaccard Index, user accuracy (precision), producer accuracy (recall), and F1-Score for each class. The results were ranked from one to the total number of parameter tests. This was done by ranking the average user and producer accuracies and the length of time it took to complete twenty epochs rounded to the nearest 15 min for each trial test. The parameter trials were scored by adding the user, producer, and time ranks with a final parameter rank based on this score.

2.9 Extended Training

Once the top performing parameters were determined, an optimised U-Net model was trained using these values. In addition, a model was trained using the original U-Net parameters for comparison to the optimised version. The original U-Net parameters are the same as the default parameters show in Table 4 with the exception of number of initial filters which was 64 to match the original U-Net architecture published in Ronneberger et al. (2015). The models were trained until the model improvement was < 0.05% over three epochs (63 epochs for the optimised model and 67 epochs for the original model). The prediction was applied to the 2015 test image to produce a classification on unseen data.

The accuracy of the optimised U-Net classification was compared to the original U-Net classification to determine if the performance of the U-Net for earth observation data was increased. This was achieved through analysis of the kappa, Jaccard Index, user (precision), producer (recall), and F1-Score statistics along with confusion matrices.

2.10 Computing Infrastructure and Software

The training of the models was performed on Nvidia Tesla V100 GPUs. The processing of vector and image data was conducted using the Geospatial Data Abstraction Library (GDAL) version 3.1.0 (https://gdal.org/) and the deep learning part of the project used TensorFlow 2.1.0 (Abadi et al. 2016). Image augmentations were done using the python library imgaug 0.4.0 (https://imgaug.readthedocs.io/en/latest/).

3 Results and Discussion

Appendix 2 contains the full results and lists the trials, trial parameters, average training time, kappa, F1-Score, user (precision), producer (recall) rank, and corresponding confidence intervals. Table 5 shows a summary of the results and lists the top performing models for each of the parameter trials.

Table 5 Top-performing results of the parameter trials based on the 2018 training image

3.1 Architecture Trials

When using CNNs on earth observation data, we tried to learn the attributes of the target features such as colour and roughness of the class using convolutions or filters. These filters attempt to extract abstract data from the imagery through linear operations using kernels. The number of convolutional filters, kernel size, and the number of overall convolutions or network depth are defined when building the architecture.

Results revealed more complex models with more convolution filters, larger kernel size, and a deeper architecture improve the output classification accuracy, but they take longer to train. The optimal number of initial filters when taking training time into account (Appendix 2) was found to be 56 with a kappa score of 0.87 (Table 5). Although a larger number of initial filters slightly improved the model accuracy (Fig. 7), for example 80 filters with an average kappa score of 0.89, it dramatically increased the training time over 20 epochs (56 filters: 5.2 h; 80 filters: 8.7 h) (Appendix 2). The original U-Net architecture contained 64 filters which achieved a kappa score of 0.86 and took an average of 0.9 h longer to train over 20 epochs.

Fig. 7
figure 7

Kappa statistic box and whisker plots for the initial number of filter trials. Each filter number includes the accuracy for each of the five trained models

For our data, we found a kernel size of 5 × 5 to be optimal (Table 5) with a kappa score of 0.87. Increasing the complexity of the filters beyond 5 × 5 did not have a substantial effect on accuracy (Fig. 8) but substantially increased training time from 3.9 h for a 5 × 5 kernel to 9 h for a 9 × 9 kernel (Appendix 2). The original U-Net architecture contained a kernel size of 3 × 3 and for our data achieved a kappa score of 0.79. As a result, we recommend the use of a 5 × 5 kernel size for earth observation data.

Fig. 8
figure 8

Kappa statistic box and whisker plots for kernel size trials. Each kernel size includes the accuracy for each of the five trained models

Experiments varying the number of layers or architecture depth indicated that the original U-Net architecture consisting of five layers was optimal. There were slight accuracy gains by increasing the depth of the architecture to six or seven layers (Fig. 9); however, these were not substantial (Appendix 2) as well as increasing training time.

Fig. 9
figure 9

Kappa statistic box and whisker plots for U-Net architecture depth trials. Each layer depth includes the accuracy for each of the five trained models

3.2 Algorithmic Trials

Within CNNs, we are attempting to fit an algorithm to complex non-linear data by setting the optimal weight values to thousands or millions of neurons. The weight values determine if the neuron should activate and pass the data to the next, with the magnitude of the output calculated using an activation function. The output prediction was compared to the labelled data provided by measuring the classification error using a loss function. The optimisation function was used to alter the weight values through backward propagation with the learning rate determining the extent the weight values can be changed.

The initial values of the kernel weights, the core of the convolution filters, are determined by an initialiser function. For the kernel initialiser trials, our results showed the truncated normal function achieved a slightly higher accuracy score in all metrics (Appendix 2). However, many initialisers produced similar results with only he normal, he uniform, and random normal performing poorly (Fig. 10).

Fig. 10
figure 10

Kappa statistic box and whisker plots for the kernel weights initialiser trials. Each kernel initialiser includes the accuracy for each of the five trained models

For the activation function trials, our results showed that there were numerous functions which achieved similar accuracies and, as a result, the same rank using the user, producer, and time-based score (Appendix 2). Analysing further using the kappa statistic, results shown graphically in Fig. 11 indicated the scaled exponential linear unit (selu) had the highest accuracy. However, this function was penalised as it had an average time slightly higher than the top ranked results (Appendix 2).

Fig. 11
figure 11

Kappa statistic box and whisker plots for the activation function trials. Each activation function includes the accuracy for each of the five trained models

Based on these results, there are numerous activation functions which can successfully train a deep learning model for earth observation applications and, as time was quite similar between the top performing models, we recommend the use of selu.

Similar to the kernel matrix weights initialiser, many loss functions performed well as shown in our data (Appendix 2). Our results found the Huber loss function (kappa: 0.8) marginally outperformed the other functions, but there are many potential options (Fig. 12). The functions which resulted in the lowest accuracy included hinge, categorical hinge, MAE, and MAPE.

Fig. 12
figure 12

Kappa statistic box and whisker plots for the loss function trials. Each loss function includes the accuracy for each of the five trained models

The RMSprop optimiser was the top ranked and had an average kappa score of 0.77. Adam (kappa: 0.79) and NAdam (kappa: 0.77) also performed similarly to RMSProp (Fig. 13) but took slightly longer to train (Appendix 2). Any of these three optimisers could be used for earth observation applications, but we recommend the RMSprop optimiser.

Fig. 13
figure 13

Kappa statistic box and whisker plots for the loss optimiser trials. Each optimiser includes the accuracy for each of the five trained models

The learning rate was a substantial factor in the model’s ability to learn the training data (Fig. 14; Appendix 2). We found a learning rate of 1e-4 to be optimal and choosing an incorrect learning rate resulted in some trials failing to train. Learning rates larger than 1e-4 result in too large a change in the weights to find the optimal values, whereas reduced learning rates resulted in slower learning and, as a result, a lower accuracy after 20 epochs of training.

Fig. 14
figure 14

Kappa statistic box and whisker plots for the learning rate trials. Each learning rate includes the accuracy for each of the five trained models

As our data consisted of multiple classes, the softmax class activation function was best suited and recommended for similar earth observation applications (Fig. 15; Appendix 2). However, for binary classifications, future studies should consider sigmoid or hard sigmoid functions.

Fig. 15
figure 15

Kappa statistic box and whisker plots for the class activation function trials. Each class activation includes the accuracy for each of the five trained models

3.3 Optimised U-Net and Original U-Net

Using the top performing trials, we compared an optimised U-Net architecture against the original U-Net architecture, and found the optimised version achieved a higher accuracy (kappa: + 0.12; F1-Score: + 0.08; user: + 0.06; producer: + 0.04). Table 6 shows the overall and class results for the two models along with a human-derived classification for the same image. Tables 7 and 8 show the confusion matrix for the optimised model and original model, respectively.

Table 6 Optimised model results for 2015 test image
Table 7 Optimised U-Net model confusion matrix
Table 8 Original U-Net model confusion matrix

For the optimised model, there were substantial improvements in some individual classes (Table 6) such as plantation forestry (user: + 0.63), berry crops (producer: + 0.25) and banana crops (producer: + 0.19). However, not all classes improved with user accuracy for vineyards and tea tree plantations decreasing by 0.12 and 0.06, respectively. In some cases, the U-Net models were more accurate than the human-derived classification (mature tree crops) which indicates that CNN models can account for some level of error in the training data to a certain extent (Mnih 2013). Other studies such as (Burke et al. 2021) have noted that these errors can potentially result in the model result being penalised for achieving a higher accuracy than data used to assess its performance.

Despite the changes in overall accuracies, the confusion matrices (Tables 7 and 8) showed similar confusion in the two models. The main confusion was between mature and young tree crops and misclassification of targeted classes with the other uses class, particularly for sugarcane and forestry plantation.

Although some studies found deep neural networks are quite tolerant of noise in the training data (Mnih 2013), we found confusion in the training data propagated through to the model’s ability to classify classes. Table 9 shows the accuracy assessment for the 2018 human-derived training data and Table 10 shows the corresponding confusion matrix. The tables demonstrate the confusion between young tree and mature tree crops and with various classes and other uses classes, such as sugarcane crops, young tree crops, and plantation forestry. This demonstrates that collection of some LULC classes for both training and evaluation can be subjective, for example, the threshold between young and mature tree crops, and at what stage can a newly planted sugar crop be detected as such within the image resolution.

Table 9 Accuracy assessment for the 2018 human-derived training data used to train all models
Table 10 Confusion matrix for 2018 human-derived training data

As a comparison, Table 11 presents the confusion matrix for the 2015 human-derived classification and demonstrated the same confusion existed in the 2015 classification as in the 2018 training data.

Table 11 Confusion matrix for 2015 human classification

Figure 16 shows the project area output classifications for the optimised U-Net (Fig. 16a) and original U-Net (Fig. 16b). At this scale, the confusion between sugarcane crops and other land uses is evident in both classifications. Whereas the confusion between plantation forestry and other land uses can be seen in the original U-Net classification.

Fig. 16
figure 16

Model classification comparison for the project area for human classification (a), optimised U-Net (b), and original U-Net (c)

Figure 17 shows the outputs for the human-derived classification along with the optimised and original U-Net classifications and visually show the errors discussed. The 2018 training image was captured in August, early in the sugarcane harvest season when fields were either fully mature or in fallow. The 2015 test image was captured as late as October and, in some areas, fields harvested earlier in the year contained young sugarcane. Digitising the data for 2015 was consistent with the 2018 training data and only mature sugarcane was included. However, for the accuracy assessment points, we decided to call the validation point sugarcane if the canopy of the crop was predominantly closed. Although no fields of young sugarcane were present in the training data, the models successfully classified these areas within the 2015 image (Fig. 17).

Fig. 17
figure 17

I2015 test image, 2015 human-derived, optimised U-Net, and Original U-Net classifications. The circles indicate the location of the accuracy assessment points with the colour in the imagery column representing the class according to the legend

Further to this, due to human error, some tree crop features were missed in the human classification but were successfully identified by the models (Fig. 17).

3.4 Limitations and Future Research

All hyper-parameter trials were tested independently to identify the top performing architectural configuration and learning algorithms. However, we did not test for interdependencies between the parameters. Conducting this analysis would take a considerable amount of time and was beyond the scope of this study. Future research could build upon these findings and restrict such an analysis to a smaller sample of parameters. It is recommended to avoid a simple grid search and to use an optimisation framework such as KerasTuner (https://keras.io/keras_tuner/).

This study restricted the analysis to three band 50 cm imagery. These results may not be applicable for training a model for earth observation applications for different spatial and spectral resolutions. Furthermore, this study was conducted over the same geographic area with similar LULC features present in each image. Transferring the models to a different region will present challenges particularly where there are unseen LULC features which will likely confuse the model. Future research should focus on scaling this study up to include a broader geographical area and test the ability of the models to transfer to a different region with varying LULC features.

In addition to the trials, we have conducted in this study, and there are additional hyper-parameters and tests which could be considered for future research. These include (but not limited to) the inclusion of regularisation such as dropouts and batch normalisation layers in the U-Net architecture.

Additional research could investigate different approaches to producing the output classifications. In this study, we have used overlapping patches and a weighted mean to counteract patch edge effects when producing the output classification; however, there are alternatives such as trimming of edge pixels and ensuring the patches overlap by the same number of pixels as presented in Flood et al. (2019).

4 Conclusion

In this paper, we trialled the effectiveness of different parameters for defining and training deep learning models based on the U-Net architecture. We found more complex models containing a larger number of convolutional filters and kernel size achieved higher accuracies. Based on our results, we recommend the use of 56 convolutional filters with a kernel size of 5 × 5 while maintaining the original depth of the U-Net of five layers for earth observation LULC applications at this scale. We also recommend the use of truncated normal kernel weights initialiser, a selu activation function, the Huber loss function, RMSprop loss optimiser, and a learning rate of 1e-4. The combination of these parameters increased the original U-Net model kappa score from 0.73 to 0.84 for the optimised model. The equivalent human-derived classification obtained a kappa score of 0.93, and the subjectivity of classifying some LULC classes caused some confusion and resulted in the propagation of errors through to model classification.