1 Introduction

Semantic segmentation is the process of assigning a class label to each pixel of an image [1]. Unlike classification and object detection, semantic segmentation is the high-level task that is facilitating the way toward complete scene understanding [2].

Increasing availability of remotely sensed images due to the rapid advancement of remote sensing technology expands the horizon of our choices of imagery sources [3]. The available sources are known for their differences in spectral, spatial, radiometric and temporal resolutions and thus are suitable for different purposes. Remote sensing data or data from satellite sensors provide continuous datasets that can be used to detect and monitor different earth’s phenomenon. And it is being employed to measure a variety of environmental parameters such as areas and potential yield of given crop types, height and density of forest stands, fraction of photosynthetically active radiation, soil, snow and water content, surface and cloud top reflectance [4].

Currently, the images that are captured by the satellite sensors have high spectral and spatial resolutions that can enable us to extract more information easily. The rapid advancement in remote sensing technology increases access to different remote sensing data sources. The available sources are known for their differences in spectral, spatial, radiometric and temporal characteristics and thus are suitable for different purposes [3]. A range of air-borne and space-borne sensors have been used to acquire remote sensing data. And image data recorded by these sensors have different spatial, temporal, spectral and radiometric resolutions.

This work aims to develop land cover classification system in Gambella National Park from satellite images. Gambella National Park was first established in 1973 covering an area of approximately 5061 km2, which made it the largest national park in Ethiopia at the time [5]. Its size and borders were later modified in 2011, and currently it covers an area of 4575 km2 [5]. It is a hotspot that hosts several wildlife not found elsewhere in Ethiopia. As it is documented in [6] the Park is rich in wildlife resources with 69 species of mammals, 327 species of birds, 7 species of reptile, 493 species of plants and 92 species of fish.

Currently, the park is shrinking and its natural resources are over exploiting due to agricultural expansion, fuel wood finding, deforestation and animal grazing [7]. Our motivation to do the research comprises the following actions by the settlers. The Anuak and Nuer people are among the dominant ethnic groups who are living around the national park [8]. The Anuak people are agriculturalists, fisherfolk, and hunters, whereas the Nuer people are pastoralists and Agro-pastoralists [5]. These two dominant ethnic groups who are living around GNP are always competing for resources which leads them to conflict and GNP can be preconceived as a third party in the competition for resources [5]. The population increase around the park leads to increasing demand for agricultural land and forest products, thus forcing the people to clear woodland or natural forest for settlement and expansion of farmland. Research results showed that approximately 140,000 hectares of natural forests around GNP were cleared by the residents around the park for settlement purpose [8]. Before the relocation of the park’s boundary, both regional and federal governments allotted large areas of land to foreign and local investors within the park’s old boundary. Thus, a great extent of land use change was observed in the areas; large areas of virgin land have been transformed into plantation for rice, sugar cane, and palm oil by foreign owned Agri-business ventures and companies [9].

Global monitoring for environment and security (GMES) which was later re-named as Copernicus program is the earth observation (EO) program coordinated and managed by the European Commission in partnership with the European Space Agency (ESA). Under this program, ESA developed the Sentinel program to replace older EO missions [10]. The Sentinel satellite constellation is developed to provide free data for global-level monitoring of the earth’s resources. Each Sentinel mission is based on a paired satellite model to provide datasets focused on different aspects of EO, including atmospheric, oceanic, and land monitoring.

Sentinel-1: has two C-band Synthetic Aperture Radar (SAR) satellites (Sentinel-1A and Sentinel-1B) that ensure data continuity of ERS and ENVISAT satellite missions [10, 11].

Sentinel-2: this mission is a land monitoring constellation of two satellites designed to provide high-resolution optical imagery, and ensure data continuity and enhancement of the multispectral imagery provided by Landsat and SPOT missions [12]. Sentinel-3: this satellite mission is a marine and land mission based on two satellites constellation: Sentinel-3A and Sentinel-3B, launched in February 2017 and April 2018 respectively [13].

In line with this, the following research questions were explored in this research.

  • How to build an appropriate machine learning and deep learning model for land cover classification?

  • To what extent performance is improved by machine learning classifier and deep learning predictions from models with different architectures?

The main aim of this study is to adopt land cover classification using satellite images, semantically segmenting into landcover classes, and evaluate the performance of segmentation results. To do this, we have performed semantic segmentation of the pre-processed satellite images using classic machine learning classifiers approaches. We have also used the deep neural network architecture like LinkNet with ResNet34, pre-trained on ImageNet dataset as it's backbone. The Link-Net architecture is designed for semantic image segmentation (pixel-level classification) [14].

The authors in [15] compared the performance of the pixel-based classification method called maximum likelihood classifier (MLC) and neural networks in forest cover change detection. Preprocessed, two-date satellite images from 2000 and 2009 were classified using maximum likelihood classifier (MLC) and artificial neural network (ANN). The identified land cover classes were forest, oil palm, urban area, rubber, and water bodies. And the resulting classified images were used to detect forest cover changes between 2000 and 2009 using the post-classification change detection technique.

The research works in [16, 17] have used satellite data with spatial resolutions of 30 m, 56 m, and 250 m respectively. And MLC was used as a classifier to classify input images into land cover classes in [16]. This classifier uses only the pixel reflectance values as input features to recognize the land cover class of a particular pixel, which can easily be affected by within-class spectral variability. Generally, the traditional pixel-based analysis approaches are mostly affected by spectral variability, and geo-referencing effects and are prone to errors in land cover changes detection. This is because these methods are highly dependent on the pixel values of the spectral images. And they are agnostic to the contextual information that existed around the pixels in the spectral images.

On the other hand, the works in [18,19,20] have adopted patch-based classification using deep learning models in which optimal patch size selection can be difficult. For study areas with possibilities of finding heterogeneous land cover classes within small areas like GNP, selecting optimal patch sizes can be difficult. So, pixel-level classification using machine learning approaches can be a solution to the problems that come with the usage of patch-based classification. Many researchers have been working on land cover classification from satellite images but their approaches and the satellite data they have been using were not appropriate for study areas with possibilities of finding heterogeneous land cover classes within small areas. For example, the works in [16, 17, 21] have employed low-resolution satellite images with 51 m, 30 m, and 250 m spatial resolutions in which a single pixel can contain more than one land cover class and the measured radiance of such a pixel is the integration of the radiance of all the objects present in the pixel.

The study [22] uses Functional magnetic resonance imaging (fMRI) data with an SVM classifier, which was shown to be the most common choice among academics, and it can be seen that researchers are using the same data pre-processing methods for similar data modalities. The researchers used machine learning classifiers, sample size, and accuracy by applying one-way ANOVA and the Tukey-Kramer test to studies spanning the years 2011 to April 2021, resulting in a total of 590 papers. The study [23] explores a variety of machine learning and deep learning (transfer learning) algorithms for rice disease identification, as well as three major rice diseases: bacterial blight, rice blast, and brown spot. They provide a deep comparison study of the results to show that transfer learning approaches are superior to traditional machine learning techniques. As we can see, InceptionResNetV2 comes out on top, followed by XceptionNet. The findings of the study could be used to assist farmers in detecting rice illness early. The study [24] proposed a framework for hosting high-end systems and transfer learning architectures in an ensemble learning framework to diagnose rice plant deficits. The task was completed with six TL architectures: Xception, DenseNet201, InceptionResNetV2, InceptionV3, ResNet50V2, and VGG19. With InceptionResNetV2 (90%) and Xception, the best results were attained (95.83%). The paper [25] covers numerous unsupervised machine learning algorithms and focuses on automated land use land cover (LULC) classification, which may provide an authentic database of information to policymakers in various domains. For land cover (LC) categorization of Sentinel-2 data in Assam, India, K-means, FCM, SOM, meanshift, GMM, and HMM were used. The continuous stretch of vegetation in research area 2 was appropriately categorized by Meanshift. In study area 1, however, it misclassified the vegetation and fallow land.

The [26] research focuses on autism spectrum disorder (ASD) and it considers two different scenarios. The first is the ideal case, in which the test cases have no missing data. On the pre-processed dataset, artificial neural network (ANN), support vector machine (SVM), and random forest (RF) classifiers are trained and assessed. In the second scenario, missing values for the fields 'age, "gender, 'jaundice, 'autism, ' used app before,' and their three combinations are included into the test dataset. RFE algorithm based on support vector machine, random forest, decision tree, and logistic regression.

In study [27], to implement land cover classification using satellite images, they propose a multi-scale fully convolutional network (MSFCN) with a multi-scale convolutional kernel as well as a channel attention block (CAB) and a global pooling module (GPM) in this paper to exploit discriminative representations from two-dimensional (2D) satellite images. There is capable of harnessing each land cover category’s time series interaction from the reshaped spatio-temporal remote sensing images. To verify the effectiveness of the proposed MSFCN, they conduct experiments on two spatial datasets and two spatio-temporal datasets.

The study [28], they develop a multilevel LC contextual (MLCC) framework that can adaptively integrate the effective global context with the local context for LC classification. The MLCC framework comprises two modules: a DCNN-based LC classification network (DLCN) and a multilevel context integration module (MCIM). By a well-defined deep network, DLCN could enhance the effective global context feature while weakening the ambiguous representation.

The study [29], they developed the first and largest joint optical and SAR land use classification dataset, WHU-OPT-SAR, covering an area of approximately 50,000 km2, and designed a multimodal-cross attention network (MCANet). MCANet comprises three core modules: the pseudo-siamese feature extraction module, multimodal-cross attention module, and low–high level feature fusion module, which are used for independent feature extraction of optical and SAR images, second-order hidden feature mining, and multi-scale feature fusion.

The study in [30] proposes a split depth-wise (DW) separable graph convolutional network (SGCN). First, they split DW-separable convolution to obtain channel and spatial features, to enhance the expression ability of road features. Thereafter, they present a graph convolutional network to capture global contextual road information in channel and spatial features. The Sobel gradient operator is used to construct an adjacency matrix of the feature graph.

In this article [31], a framework for scene classification network architecture search based on multi-objective neural evolution (SceneNet) is proposed. In SceneNet, the network architecture coding and searching are achieved using an evolutionary algorithm, which can implement a more flexible hierarchical extraction of the remote sensing image scene information. The effectiveness of SceneNet is demonstrated by experimental comparisons with several deep neural networks designed by human experts.

The study [32], reviews advance in fine-scale LCCMA (land use and land cover classification in open-pit mine areas) from the following aspects. Firstly, it analyzes and proposes classification thematic resolution for LCCMA. Secondly, remote sensing data sources, features, feature selection methods, and classification algorithms for LCCMA are summarized. Thirdly, three major factors that affect LCCMA are discussed: significant three-dimensional terrain features, strong LCCMA feature variability, and homogeneity of spectral-spatial features.

The study [33], proposes a hyperspectral classification framework based on a joint channel-space attention mechanism and generative adversarial network (JAGAN). To relearn feature-based weights, a higher priority was assigned to important features, which was developed by integrating a two-joint channel-space attention model to obtain the most valuable feature via the attention weight map.

The work in [21] is selected as a benchmark paper for this work. The authors have proposed a system to monitor deforestation in Sumatra during the period of 2000–2012. We have identified gaps to be filled in this work including the dataset used, the classifier adopted and the number of land cover classes identified, were not appropriate. Terra MODIS satellite data with 250 m spatial resolution was used as a dataset, the traditional pixel-based classifier called MLC was employed as a land cover classifier, and only four land cover classes (forest cover, non-forest cover, water bodies and cloud cover) were selected. In addition, deforestation was estimated in a duration of 6 years which is too coarse.

The proposed workflow comprises the following steps: downloading Sentinel-2 images of our study area, preparing novel semantic segmentation dataset, developing semantic segmentation models using both traditional machine learning classifiers with deep features and deep learning approaches, and comparing the obtained results. Here, we have tried to address the gaps in many previous works introduced due to the use of methods and satellite data that are not appropriate for study areas with possibilities of finding heterogenous land cover classes within small areas to do land cover classification. In this works, we have employed Sentinel-2 satellite images with 10 m spatial resolution which is assumed to be enough to view land cover classes within small areas and semantic segmentation (per pixel classification) of land cover classes, which can address the identified gaps in previous works.

The contributions of this work to new knowledge or science can be summarized as follows:

  • A novel semantic segmentation dataset was developed using the freely available Sentinel-2 satellite images with 10 m spatial resolution and 5-days temporal resolution.

  • Land cover semantic segmentation using models with classical machine learning classifiers such as RF and SVM was done to perform pixel-level land cover classification.

  • Land cover semantic segmentation using models with deep learning such as LinkNet-ResNet34 was done to perform pixel-level land cover classification.

  • We developed deep learning models using LinkNet-ResNet34, CNN-RF, and CNN-SVM, and compared the results.

The upcoming Sect. 2 comprises the methodologies, which include dataset details, data collection and dataset preparations for land cover classification, as well as the design of the suggested models, and performance evaluation. The results and discussion highlights of the performance metrics of the findings are presented in Sect. 3, and the paper is concluded with the possibility of future extension of this work.

2 Methodology

The methods that we have employed are: -

Data collection We have gathered multispectral satellite images of our study area (GNP) captured during leaf-off and leaf-on seasons by the Sentinel-2 satellites. The Sentinel-2 satellite images are freely available and we have downloaded Sentinel-2A and Sentinel-2B images from the USGS website.

Dataset creation we have created a labelled semantic segmentation dataset that contains 128 × 128 pixels size image patches and their corresponding 128 × 128 pixels size masks.

Data preprocessing techniques like picture normalization, label encoding, feature extraction, and feature selection were used to make our data fit to the machine learning models more easily.

Designing and developing semantic segmentation model machine learning classifiers such as Random Forest and Support Vector Machine (SVM) with CNN features, and deep learning model called LinkNet with ResNet34 as its backbone were designed and developed to perform semantic image segmentation.

Performance evaluation performance matrices such as confusion matrix, precision, recall, F-measure, and accuracy were used to evaluate the developed semantic segmentation models.

2.1 Data Collection and Dataset preparation

In this study, the Gambella National Park (GNP) was selected as our study area, and it is located in Gambella People’s National Regional State of Ethiopia, 850 km west of Ethiopia’s capital Addis Ababa. The Park is situated between 6°17′ and 8°42′ North latitude and 32°59′ and 35°23′ East longitude (Fig. 1).

Fig. 1
figure 1

Location Map of GNP

According to the Ethiopian Wildlife Conservation Authority (EWCA), in 2008, an area of approximately 438,000 hectares have been allotted to large-scale farmfing ventures in the vicinity of the park without any assessment of environmental impacts [9]. Data from the region’s investment office show that the investors are cutting and clearing the forest and savanna of the entire area that they acquired, however, they are not actively involved in operation or they operate small portion of the total land [9].

Here, we have prepared our novel dataset for the development of a semantic segmentation model that is used as a classifier to classify the land cover types in our study area. There are some semantic segmentation datasets like PASCAL VOC and Stanford background dataset which are mainly developed for segmenting object classes like road, trees, buildings, and so on. Similarly, in the field of remote sensing there is a dataset called BigEarthNet [1], which was developed using sentinel tiles covering 10 different European countries. It is mainly developed for land use land cover classification applications but we believe that such datasets are not suitable for landcover classification of specific study areas. Due to this reason, we have developed our own semantic segmentation dataset that is specific to our study area. We believe that this dataset can have many advantages over the already existed datasets in terms of using it for different applications in that specific study area. The previous dataset can be compared and includes a wide range of nations and broad its scope, but our new dataset is used for our particular study.

As a classification scheme, we have chosen to develop a semantic segmentation model to assign a particular land cover class to each pixel of a given image. So, for the development of the semantic segmentation model, we need to have a labeled dataset. To prepare our dataset, we first downloaded multispectral sentinel-2 satellite images from the USGS website, extract patches of representative areas from the downloaded sentinel-2 images, preprocess the extracted patches, and label the extracted patches manually to obtain corresponding segmentation masks that are used as ground truth label for the development of the segmentation model.

The overall dataset preparation process can be grouped into two major steps:

  1. A.

    Satellite image acquisition We have gathered multidate multispectral satellite images of our study area captured during leaf-off and leaf-on seasons; images captured at multiple dates (2–3 days) from each season by the Sentinel-2 satellites. The Sentinel-2 satellite images are freely available and we have downloaded Sentinel-2A and Sentinel-2B images from the USGS website. We have used level-1C products that provides orthorectified top-of-atmosphere reflectance in cartographic geometry (combined UTM projection and WGS84 ellipsoid) with a sub-pixel multi-spectral and multi-date registration. These products are radiometrically and geometrically corrected and the ground sampling distance of these products is 10 m, 20 m or 60 m, depending on the spectral bands.

  2. B.

    Dataset creation Using the gathered satellite images, we have created a labelled semantic segmentation dataset that contains 128 × 128 pixels size image patches and their corresponding 128 × 128 pixels size masks. Our semantic segmentation dataset contains a total of 12,250 RGB images and their corresponding ground truth labels.

The actual dataset creation process is: -

In data set creation we have followed the band selection and composting, image patches extraction, image enhancement, contrast adjustment, noise removal and smoothing, and image labelling.

We tried to select bands that are suitable for our case. Research results in [34] showed that the Blue (B2), Green (B3), and Red (B4) bands with 10 m spatial resolution outperform all other bands of the S2 images for the task of land use land cover classification. Thus, we have selected the red, green and blue bands to create RGB band composite images.

Finally, we have tried to identify and group the land cover classes that are found in our study area into the following list of classes given in Table 1.

Table 1 Landcover classes in GNP and their descriptions

To do the dense pixel image labeling operation, we have used a labeling tool called LabelMe. We ended up having segmentation masks for each 512 × 512 RGB image in our dataset [2].Fig. 2 shows sample 512 × 512 RGB images and their corresponding labels/masks.

Fig. 2
figure 2

Shows sample 512 × 512 RGB images and their corresponding labels

In the image labeling section, the 512 × 512 pixels size images are labelled and corresponding ground truth labels or masks are generated. But the images and their masks were large to be used in the semantic segmentation model development due to the scarcity of resources we had for the development. So, we needed to split both the images and their corresponding masks into non-overlapping patches of size 128 × 128 pixels. The same splitting method was applied to both the image and its mask.

2.2 Data pre-processing

In data preprocessing the steps are:

  1. A.

    Data normalization rescaling the pixel values of the images in our dataset by dividing all the pixel values in each channel of the images by 255 so that the pixel values of the images become in the range 0 to 1.

  2. B.

    Feature extraction features are extracted and used to train and test our classical machine learning models, RF and Support Vector Machine (SVM). We have used 1000 selected RGB images from our dataset for the semantic segmentation model development using classical machine learning algorithms (RF and SVM). So, the used feature extractor network has comprised three convolutional layers each with 3 × 3 kernels, zero padding, and stride = 1. And the number of filters used in the convolutional layers were 8, 16 and 32 respectively.

The network was trained from scratch. It involved 'ReLU' activation function but no pooling layer since we want to extract features for each pixel of the input image. Table 2 shows the CNN parameters used in this work with the feature extraction network.

Table 2 Convolutional neural network (CNN) parameters for the feature extraction network used in this study

As we can see in the Table 2, the input shape to the feature extraction network is (128, 128, 3) which means that our input images are RGB images of size 128 × 128. And these images are passed through the feature extraction network with three convolution layers that are used to extract deep features for each pixel of the images and produce feature vectors as output. The output at the last convolution layer (convolution 3) is the feature vector or activation map that is obtained from the feature extraction process. The output shape of the last convolution layer is (None, 128, 128, 32) which shows that each pixel has a 32-dimensional feature vector. We have used 1000 selected RGB images from our dataset for the semantic segmentation model development using classical machine learning algorithms (RF and SVM). All these images were passed through the CNN feature extractor to give an output of shape (1000, 128, 128, 32), which shows that we have a total of (1000*128*128) pixels each with 32-dimensional feature vector.

  1. C.

    Feature selection Here, we have used the recursive feature elimination technique to remove the less important features and select the optimum number of features. We have used the RFECV() feature selection tool in Scikit-learn with RandomForestClassifier() as an estimator. The feature selection tool then selected the optimum number of features with most important feature columns to the accuracy of the used model. Thus, the optimum number of features was selected to be 10. 10 elements from the 32-dimensional feature vector for each pixel which are assumed to be important to predict the target variable are selected.

  2. D.

    Data splitting The train-valid-test split approach is easy and fast to use. And 70/20/10 splitting percentage was used to split our dataset into train/valid/test sets. Random selection was used to split the samples from the original dataset into subsets. This is used to make the training, validation, and testing datasets contain samples that are representative of the original dataset.

2.3 Tools used

To perform semantic segmentation of satellite images into land cover classes, we did an experiment on semantic segmentation model development using classical machine learning classifiers. The classical machine learning models using RF and SVM classifiers were built using the sklearn library on a virtual machine from Google Colab with 12 GB RAM, 64 GB hard drive, and Tesla K80 free GPU.

2.4 Semantic segmentation model building

Support vector machine (SVM) is an effective machine learning classifier to use when dealing with small datasets and it gives good results in comparison to other types of classifiers [35]. Here, the convolutional neural network (CNN) features selected in the feature selection step were used to train a multi-class SVM classifier to identify pixel class labels.

A set of pixels from all the training images in our dataset coupled with their corresponding labels were randomly selected and used as training instances to train the SVM classifier. And hyperparameter tunning is the basis in building robust machine learning models. And RandomizedSearchCV() method from sklearn python library was used to select the optimal values for the parameters of SVM model, kernel = ‘rbf’, gamma = ‘auto’ and C = 10 which gave us better result than the other alternatives.

Similar to the SVM model, we have used the selected CNN features with the random forest classifier. Features for a set of pixels that are randomly selected from all the training images coupled with their corresponding labels were used to train a random forest classifier. Here, hyperparameters for the random forest model were tuned to optimize model performance. We did this by using the Scikit-Learn’s RandomizedSearchCV() method to evaluate a range of values for each hyperparameter and select the best. The most important settings such as number of trees (n_estimators), maximum number of features for splitting a node (max_features) and maximum number of levels in each decision tree (max_depth) were tuned to have n_estimators = 100, max_features = ‘sqrt(n_features)’ and max_depth = 10. And the other parameters were left to keep their default values.

Structure of the Link-Net ResNet34.

In this work, we use pre-trained ResNet34, see Fig. 3. The encoder starts with the initial block that performs convolution with a kernel of size 7 × 7 and stride 2 [36].

Fig. 3
figure 3

Link-Net with ResNet-34 Architecture [36]

We used the Link-Net with ResNet-34 as Backbone in Fig. 3. Currently, residual networks (ResNets) are by far state of the art CNN models with very deep layers surpassing humans in different image recognition tasks [37]. In this section, we have used ResNet-34 and we did this to incorporate the advantage of high performing capability of residual networks in image recognition tasks into the Link-Net architecture. Developing deep learning models from scratch needs much labelled data which is difficult to get. Thus, transfer learning approach can be employed to overcome such problem of scarcity of labelled data so that deep learning models can be built using the available limited labeled data. The Link-Net model was built using ResNet34, pre-trained on ImageNet dataset as its encoder. And we further finetune the ImageNet weights of the ResNet34 architecture. However, the model is limited by the reduced feature map resolution based on by repeated strided pooling or convolutions, as well as the multiple scales of the target objects.

3 Results and discussion

3.1 Performance metrics

After segmentation models are built, we need to evaluate the performance of the models. Here, we have used the following performance metrics to evaluate our segmentation models.

A confusion matrix: is an N × N matrix, where N is the number of target classes and is used for evaluating the performance of machine learning models.

Four key terms come with the confusion matrix: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). In our case, the key terms are defined as follows.

True Positive (TP): refers to a given class x, the total number of pixels classified correctly as x.

False Positive (FP): for a given class x, the total number of pixels is classified incorrectly as x.

True Negative (TN): for a given class x, the total number of pixels classified correctly as not x.

False Negative (FN): for a given class x, the total number of pixels is classified incorrectly as not x.

Pixel accuracy: refers to the proportion of pixels that are correctly classified by the model. These metrics are simple and the easiest to understand conceptually. And is formula is given as:

$$\mathrm{Accuracy }=\frac{TP+TN}{TP+TN+FP+FN}$$
(1)

Precision is the positive predictive value that refers to all the positive predictions, how many are true positive predictions. This describes the purity of our positive predictions. Its formula is given as:

$$\mathrm{Precision }=\frac{TP}{TP+FP}$$
(2)

Recall The True Positive Rate refers to all the actual positives, how many are true positive predictions. This describes the completeness of our positive predictions. Its formula is given as:

$$\mathrm{Recall }=\frac{TP}{TP+FN}$$
(3)

Dice Coefficient (F1-Score) it combines precision and recall into a single metric and is calculated as 2*the area of overlap between the prediction and ground truth divided by the total number of pixels in both the prediction and the ground truth labels. Its formula is given as:

$${\text{F}}1{\text{-Score }} = \frac{2*intersection}{{union + intersection}} = \frac{2TP}{{2TP + FP + FN}}$$
(4)

3.2 Random forest with convolutional neural network features (CNN-RF) and support vector machine with convolutional neural network features (CNN-SVM) models performance evaluation

The CNN-RF and CNN-SVM models were built using a random selection of pixels from the images in our dataset, along with their labels. Specifically, about 1.5 million data points (pixels) each with 10-dimensional feature vectors were randomly selected to build and evaluate the CNN-RF model. Similarly, the same data points were used to build and evaluate the CNN-SVM model.

And we have used 150,000 pixels (10% of the total data points) to evaluate the models and the performance measures we have obtained during the evaluation are provided in Table 3 given below.

Table 3 Summary of performance measures for CNN-RF and CNN-SVM models

As we can see from the performance report in Table 3, we have got 83% overall pixel accuracy and 82% average F1-Score for the CNN-RF model and 82% overall pixel accuracy and 81% average F1-Score for the CNN-SVM model. The forest class is the one with the highest recall and F1-Score in both models. We believe that this is because the forest class has unambiguous features and the spectral signature of the forest class is easily distinguishable relative to the other classes in our dataset. And the forest class appears in many of the images in our dataset. Conversely, both models perform poorly to identify the road class and this comes from the reason that the road class has an almost similar spectral signature to the bare-ground class thus some of the road pixels are wrongly classified to be bare-ground pixels and the number of training sample pixels for the road class were less compared to the other classes. Generally, interclass feature similarity was the problem that leads the models to perform poorly to identify some of the land cover classes. This can easily be observed from the confusion matrixes of the CNN-RF and CNN-SVM models in Fig. 4. The overall performance measures show that the CNN-RF model outperforms the CNN-SVM model by a margin of 1% overall pixel accuracy and average F1-Score. This can be because our dataset has interclassed feature similarities, and SVM performs poorly with overlapping classes.

Fig. 4
figure 4

Confusion matrix for CNN-RF model

As can be easily perceived from the confusion matrix in Fig. 4, about 0.37% of the road pixels from our testing images are predicted as bare-ground pixels. This comes from the less visual distinction between the road and bare-ground classes. The model is also confused somehow to distinguish the water bodies class from the other classes. Some of the prediction results on the water bodies pixels from our testing images are spread among many of the classes except the road class.

The confusion matrix in Fig. 5 shows per-class pixel accuracies at its diagonal. The forest class achieves 92%-pixel accuracy which makes it a highly recognized class. But the road is not recognized by the model. The road pixels from the testing set are mainly classified as bare-ground and cloud pixels. The road class remains the most difficult class to be segmented in all our experiments. We believe that this is due to two main reasons; the visual similarity of the road class with the other landcover classes such as bare-ground and the less representation of the road class in our dataset. Generally, the strong predictions at the diagonal of the confusion matrices show that the CNN-RF performs well in segmenting all main land cover classes except the road class than CNN-SVM. The unlabeled pixels are still classified as bare-ground and forest classes.

Fig. 5
figure 5

Confusion matrix for CNN-SVM model

3.3 LinkNet-ResNet34 model

This model was built to incorporate the pre-trained ResNet34 as encoder part of the Link-Net model. And it was trained and evaluated on the novel semantic segmentation dataset we have prepared in this work. Model training was performed for 50 epochs with an early stopping criterion with 10 patience of validation loss. The model uses categorical cross entropy loss as a loss function, F1-score as evaluation metrics, and Adam with learning rate 0.0001 as optimizer.

The testing set was used to evaluate the LinkNet-ResNet34 model and the performance report we have obtained is given in Table 4. This model achieves 87.4% average F1-score which shows an improvement over the performance achieved by classical machine learning classifiers. As we can see from the Table 4, all landcover classes except the road class achieve > 82% F1-score. This shows that LinkNet-ResNet34 model comes with most improved performances for all landcover classes except the road class than the other models developed above. And the model scores > 91% F1-score, Precision and Recall values for the forest class. The forest class have visually distinct features so that the model can easily distinguish them from the other classes. But the road class is still left unrecognized by the model. The model is confused to distinguish the road pixels in our testing images from the other classes with almost similar visual features. The confusion matrix given in Fig. 6 also shows how the LinkNet-ResNet34 model performs on the testing data during evaluation.

Table 4 Summary of Performance Measures for LinkNet-ResNet34 Model
Fig. 6
figure 6

Confusion matrix for LinkNet-ResNet34 model

The confusion matrix given below shows the highest accuracies that the model achieved for the main land cover classes at its diagonal. And the road class is left unrecognized, its testing pixels are predicted mainly as bare-ground and forest. 33.4% of the road pixels are predicted as bare-ground, 48.4% as forest and the rest pixels are spread among the other classes. The reason is basically the less representation in pixel count of the road class in the testing set. We can also see that most of the unlabeled pixels are classified as forest and bare-ground.

3.4 Discussion of the results

Table 5 shows that the LinkNet-ResNet34 has stood out due to its prediction accuracy.

Table 5 Comparison of models on test data

As we can see from Table 5, the LinkNet-ResNet34 model outperforms the other models in terms of precision, recall and F1-score. The LinkNet-ResNet34 model achieves 84% precision, 84% recall and 87.4% F1-score, and 88.2% accuracy. Similarly, the CNN-RF model has almost comparable performance with the LinkNet-ResNet34 model. It achieves 83% precision, 83% recall, 82% F1-score, and 83% accuracy. And the LinkNet-ResNet34 model have scored better performance when compared with the CNN-SVM and CNN-RF models. Among all the tested models, the CNN-SVM model achieves lowest performance it achieves 81% precision, 82% recall, 81% F1-score, and 82% accuracy.

In this study, we have experimented to design an appropriate landcover semantic segmentation model. During experimentation, we have designed models using RF and SVM classifiers and the LinkNet model with ResNet34 used as backbone. And the developed semantic segmentation models were evaluated using the novel dataset prepared in this work and compared to select the best model. To compare the developed semantic segmentation models, we have preferred to use the precision, recall, and F1-score values that the models achieve during the evaluation of the test data. These selected performance matrices are better to evaluate semantic segmentation models than the other matrices such as overall pixel accuracy.

The best model was also evaluated qualitatively using sample input image patches as shown in Fig. 7. The 1st and 4th columns show the input image patches. The 2nd and 3rd columns are the respective ground truth and prediction for the images in the 1st column. And the 5th and 6th columns are the ground truth and prediction for the images in the 4th column respectively. From the prediction outputs, we can see that the model performs well in segmenting the main land cover classes. We can see wrong predictions in the 1st row of the last column, grassland (dark purple) is wrongly predicted as forest area (yellow-green). Similarly, in the 1st row 3rd column, areas labeled as bare ground (navy) are predicted as cutting areas (green). This may not be a wrong prediction, but wrong labeling. Here, we see that the model is more intelligent to identify areas that are wrongly labeled. The last row 1st column in the Fig. 7 shows the input image patch taken from the leaf-off season. As we can see, the prediction output for the leaf-off input image patch is as good as the ground truth label.

Fig. 7
figure 7

Sample input images with their ground truth and predicted outputs

This model gives an improvement in performance with a margin of + 4–5% accuracy over the models with RF and SVM classifiers. Although the designed semantic segmentation model provides good performance in recognizing most of the mainland cover classes, some classes remain difficult to be recognized. This is mainly due to two basic reasons: the difficulty to discriminate classes with no strong spectral differences and the imbalance in data representation of some classes. This can be solved by increasing the data representation of the smaller classes in the dataset and adopting the class weights optimization approach during model training.

The main importance of the work is to fill the gaps from previous works who used low resolution satellite images, traditional pixel-based classification, patch-based classification when deep learning approaches are used.

4 Conclusions

This work aimed to develop a segmentation of the land cover classification system from satellite images. To achieve the outlined goal, experiments were done to build an appropriate land cover semantic segmentation model to be used as a classifier. The novel semantic segmentation dataset prepared for the study by using the freely available high-resolution Sentinel-2 satellite images of our study area was used to build the semantic segmentation models. The RF, SVM classifiers with CNN features, and LinkNet with ResNet34 as backbone achieve 83%, 82%, and 88.2% overall pixel accuracies respectively.

The experimental results show that the proposed semantic segmentation model provides promising results in classifying the main land cover classes in our study area at a pixel level. It is believed that better recognition of land cover classes approach is adopted.

5 Future works

The limitations of this work and directions for future research can be summarized as follows:

  • Visual similarity between land cover classes was a major problem in this work and using another spectral band such as NIR in addition to the red, green and blue bands can solve this problem.

  • The used dataset was relatively small in size to build large deep learning models from scratch. Increasing the size of the dataset and balancing the representations of the land cover classes in the dataset would boost the performance of the developed models.

  • The used dataset to protect the Illegal tree cutting of the satellite image using deep learning techniques.

  • The proposed method in this work considers thick clouds as one land cover class and what is under it is left masked. This means land cover changes cannot be monitored during cloud coverage. And this can be addressed by employing SAR imaging sensors that can record reflectance from the earth’s surface even in the presence of clouds. Image fusion of the Sentinel-1 SAR images and the Sentinel-2 optical images can be used to overcome this drawback.