1 Introduction

Agriculture is an important sector of the world economy and provides humanity with essential products like food (Pandey et al. 2022). The increasing global population is accompanied by the growing demand for different varieties of food. To ensure global food security, food production must substantially increase and in parallel, the negative environmental footprint of agriculture ought to be minimised (Foley et al. 2011). This aspiration is aptly captured within the second goal (“End hunger, achieve food security and improved nutrition and promote sustainable agriculture”) of the United Nation’s Sustainable Development Goals (SDGs) (United Nations 2015). To sustainably achieve food security, policymakers must design agricultural policies that incentivise farmers to use sustainable agricultural practices while ensuring a decent standard of living for the farmers.

To effectively create and monitor sustainable agricultural policies, spatially explicit information about all agricultural lands is required. Further, the dynamic nature of agricultural lands requires them to be monitored in near real-time to sustainably optimise agricultural practices and react to any emerging environmental threats (Weiss et al. 2020). Remote sensing (RS) is primed for near real-time monitoring of agricultural lands (Atzberger 2013; Weiss et al. 2020). The use of RS for mapping agricultural lands has been demonstrated in the literature at the regional (Sun et al. 2020; You et al. 2021), national (Boryan et al. 2011; Blickensdörfer et al. 2022), and continental (d’Andrimont et al. 2021) scales. The aforementioned studies classified the land-use (LU) types of agricultural lands at the pixel level. Although pixel-based image analysis is computationally efficient, especially for wide-area monitoring of agricultural lands, object-based image analysis (OBIA) generally produces more accurate LU maps as was demonstrated by some studies (Castillejo-González et al. 2009; Gilbertson et al. 2017; Belgiu and Csillik 2018). Using OBIA to map agricultural lands from RS images involves the segmentation of agricultural fields followed by the assignment of a LU type to each segmented field. The object-based crop type maps generated through OBIA enable the effective assessment of the agricultural practices being used at the field level and also allow for the accurate computation of agricultural statistics such as field sizes and shapes.

Image segmentation is the building block of OBIA (Blaschke 2010). There is a direct correlation between image segmentation quality and object-based classification accuracy (Liu and Xia 2010; Gao et al. 2011; Akcay et al. 2018). The traditional approach to segmenting agricultural fields from RS images involves the use of edge-based methods (Ji 1996; Turker and Kok 2013; Graesser and Ramankutty 2017; North et al. 2019; Wagner and Oppelt 2020), region-based methods (Möller et al. 2007; García-Pedrero et al. 2017; Belgiu and Csillik 2018; Nasrallah et al. 2018; Tetteh et al. 2020a; Luo et al. 2021), and hybrid methods (Rydberg and Borgefors 2001; Li and Xiao 2007; Yan and Roy 2014; Watkins and van Niekerk 2019) that combine edge-based and region-based methods. Choosing which method to use for image segmentation largely depends on the application needs of the user. According to Kotaridis and Lazaridou (2021), the region-based methods, particularly the multiresolution segmentation (MRS) (Baatz and Schäpe 2000) algorithm in eCognition (Trimble Germany GmbH 2019), are by far the most widely used for segmentation within the OBIA paradigm. The study of Ma et al. (2017) also revealed the popularity of the MRS algorithm.

Lately, the use of deep neural networks (DNNs) for various RS tasks like image segmentation has been gaining popularity (Ma et al. 2019). The extensive usage of DNNs in RS particularly for supporting the SDGs was recently reviewed by Persello et al. (2022). The popularity of DNNs has been facilitated by several factors including the availability of high-performance graphic cards, cloud computing, increased public availability of annotated data, and the superior performance of DNNs over shallow models (Kattenborn et al. 2021). Compared with other image segmentation methods, Kotaridis and Lazaridou (2021) showed that more peer-reviewed studies in 2020 used DNNs for segmentation. Without any manual feature engineering, DNNs, particularly deep convolutional neural networks, can exploit hierarchical relationships between high-level and low-level features in an image, thereby making them suitable for delineating agricultural fields (Waldner and Diakogiannis 2020). Different DNNs have been used in the literature to delineate agricultural fields from RS images (García-Pedrero et al. 2019; Persello et al. 2019; Lv et al. 2020; Masoud et al. 2020; Aung et al. 2020; Waldner and Diakogiannis 2020; Meyer et al. 2020; Yang et al. 2020; Taravat et al. 2021; Waldner et al. 2021; Zhang et al. 2021; Wang et al. 2022; Jong et al. 2022; Long et al. 2022). U-Net (Ronneberger et al. 2015) and its various derivatives were the most used DNNs. The U-Net model and its derivatives are geared towards semantic segmentation; hence, they do not differentiate between objects belonging to the same class. This problem can be resolved through instance segmentation. To extract agricultural fields through instance segmentation, the DNN that was mostly utilised was Mask R-CNN (He et al. 2017).

The superiority of DNNs to shallow machine learning models such as support vector machines and random forests for various RS tasks such as land-cover and land-use classification has been highlighted in the literature (Ma et al. 2019; Kattenborn et al. 2021). However, it remains to be seen how different DNNs will compare with each other and also compare with more traditional segmentation methods like MRS for the delineation of agricultural fields from RS images. Both Yang et al. (2020) and Taravat et al. (2021) compared different DNNs for the semantic segmentation of agricultural fields but instance segmentation was not evaluated. Further, both studies did not compare their results to a more traditional segmentation method like MRS. Even though Masoud et al. (2020) compared their DNN with MRS for segmenting agricultural fields, their study had a small geographical scope (only ten tiles) and they did not evaluate any DNN for instance segmentation.

In this study, we present a large-scale comparison of the MRS algorithm with three different DNNs that have already been used in the literature to segment agricultural fields. For MRS, we used the optimised approach that was proposed by Tetteh et al. (2020a) for the segmentation of agricultural fields. Regarding the three DNNs, we selected (1) U-Net for its popularity and widespread usage for semantic segmentation, (2) Mask R-CNN for being the foremost model when it comes to instance segmentation, and (3) FracTAL ResUNet (Diakogiannis et al. 2021) for its recent usage for the effective segmentation of agricultural fields on a large scale as evidenced by these studies (Waldner et al. 2021; Wang et al. 2022).

2 Study Area and Data

As the study area, we chose Lower Saxony (Fig. 1). With about 62% of its total landmass being used as agricultural land (Tetteh et al. 2020a), Lower Saxony plays an important role in Germany’s economy regarding food production. Its agricultural areas are mostly covered by grasslands, cereals, potatoes, winter rapeseed, and sugar beet (Tetteh et al. 2020a). Lower Saxony has the largest acreage of potatoes and sugar beets in Germany, which reemphasises its key contribution to food production in Germany.

Fig. 1
figure 1

The geographical location of the study area (Lower Saxony, which is labelled NI on the map above). The coordinates are in EPSG:3035. BB Brandenburg, BE Berlin, BW Baden-Württemberg, BY Bavaria, HB Bremen, HE Hesse, HH Hamburg, MV Mecklenburg-Western Pomerania, NW North Rhine-Westphalia, RP Rhineland-Palatinate, SH Schleswig-Holstein, SL Saarland, SN Saxony, ST Saxony-Anhalt, TH Thuringia

Sentinel-2 (S2) images covering Lower Saxony acquired in May of 2018 were used in this study. As suggested by Tetteh et al. (2020a), we selected May because field boundaries become more visible in this month, hence easier to delineate. Similar to Tetteh et al. (2021), the top-of-atmosphere (TOA) S2 images provided by the European Space Agency (ESA) were converted to bottom-of-atmosphere (BOA) images using the FORCE (Framework for Operational Radiometric Correction for Environmental monitoring) (Frantz 2019) processing software. With the red, green, blue, and near-infrared bands having the highest spatial resolution of S2, they were extracted from each BOA image. For each of those four bands, a mean band was created by averaging the spectral values of all pixels over the month. The four mean bands were then stacked together to create a monthly mean composite (MMC) image for May. This MMC was used in subsequent processes.

To limit the segmentation process to only agricultural areas, we masked out all non-agricultural areas from each MMC. In this study, agricultural areas equate to arable lands and grasslands. Following our previous studies (Tetteh et al. 2020a, 2020b, 2021), we extracted polygons belonging to the arable lands and grasslands in Lower Saxony from the digital landscape model (DLM) of the German Official Topographic Cartographic Information System (ATKIS). The DLM is a spatial database containing the land cover of Germany. All pixels spatially falling outside the arable lands and grasslands were removed from the MMC images.

As reference data, we used the agricultural parcels of the Geospatial Aid Application (GSAA). For farmers within the European Union (EU) to access the subsidies of the Common Agricultural Policy (CAP) (European Commission 2017), they declare the boundaries of their agricultural parcels and the corresponding LU types through the GSAA. This declaration is usually done in May of a particular year. We used the GSAA parcels of 2018. The size of the agricultural parcels ranges from 0.1 ha to 155 ha and the average size is 3 ha (Tetteh et al. 2020a).

3 Methodology

Figure 2 is the workflow that was employed in this study. The main components of the workflow will be explained in the proceeding subsections.

Fig. 2
figure 2

Overview of the workflow used in this study. ATKIS German Official Topographic Cartographic Information System, MMCs monthly mean composites, GSAA Geospatial Aid Application, DNNs deep neural networks, MRS multiresolution segmentation

3.1 Data Preparation

To ensure the efficient segmentation of the agricultural fields by the three DNNs using a graphic processing unit (NVIDIA GRID T4-16Q) with a dedicated memory of 14 GB, Lower Saxony was partitioned into 8417 tiles with each tile being 2.56 km × 2.56 km (256 pixels × 256 pixels). On average, the number of GSAA parcels per tile is 140. To ensure that there are enough GSAA parcels per tile for both the training and testing, we removed the tiles with less than 50 parcels. This brought down the number of tiles to 7169.

From the 7169 tiles, a stratified random sampling approach was used to split the tiles into 70% training tiles (5018) and 30% test tiles (2151). To do the stratification, we used two steps. Following the approach of Tetteh et al. (2021), we first computed the shape factor (SF) per tile as shown in Eq. (1);

$$SF =\frac{1}{n}\sum_{i =1}^{n} \frac{4\times\uppi\times\mathrm{Area}\left({X}_{i}\right)}{{\left(\mathrm{Perimeter}\left({X}_{i}\right)\right)}^{2}}$$
(1)

where \(X\) is a GSAA parcel and \(n\) is the number of GSAA parcels per tile. The SF, which is based on the method of Polsby and Popper (1991), is a measure of the level of compactness per tile. It ranges from 0 (lowest compactness) to 1 (highest compactness). A tile with low compactness indicates that the agricultural fields present at that tile are more elongated and a tile dominated by more circular fields will have high compactness. Second, after some visual analysis, we categorised the SFs of the tiles into three classes namely low compactness (0.0 < SF ≤ 0.4), medium compactness (0.4 < SF ≤ 0.6), and high compactness (0.6 < SF ≤ 1.0). The stratification was done based on this categorization. Figures 3 and 4, respectively, show the training and test tiles, where each tile is coloured by its corresponding SF class.

Fig. 3
figure 3

The tiles that were selected for training. Each tile is coloured by its shape factor (SF) class

Fig. 4
figure 4

The tiles that were selected for testing. Each tile is coloured by its shape factor (SF) class

For each tile, a corresponding image chip was clipped out from the masked MMC images. In all, 5018 training images and 2151 test images were created.

3.2 Segmentation Methods

3.2.1 U-Net

U-Net was initially designed for the semantic labelling of pixels in biomedical images. It is now widely used for the semantic segmentation of different types of images including RS images. Its use for delineating agricultural fields from RS images has been demonstrated in the literature (García-Pedrero et al. 2019; Aung et al. 2020; Yang et al. 2020; Taravat et al. 2021). U-Net has two parts: a contracting path (encoder) for extracting features from an input image and an expansive path (decoder) for precise localization and upsampling of the extracted features to the same dimension as the input image. The contracting path is a typical convolutional network consisting of the repeated application of two convolutions, each followed by a rectified linear unit (ReLU) and max pooling. Every step in the expansive path consists of up-convolution and concatenation followed by the application of two convolutions with a ReLU. As the final layer, a convolution is applied to translate the extracted features to the desired number of classes, and an activation function (softmax in our study) is used to assign class probabilities to each pixel. Further details about U-Net can be found in Ronneberger et al. (2015).

3.2.2 FracTAL ResUNet

Following the encoder–decoder style of U-Net, Diakogiannis et al. (2020) proposed ResUNet-a, a novel network for semantic segmentation. The encoder and decoder blocks of ResUNet-a are composed of residual blocks of convolutional layers (He et al. 2016) followed by pyramid scene parsing pooling (Zhao et al. 2017). In each residual block, multiple parallel atrous convolutions (Chen et al. 2017a, 2017b) with different dilation rates were used. A more detailed explanation of ResUNet-a can be found in Diakogiannis et al. (2020). In a change detection study, Diakogiannis et al. (2021) defined a new model by introducing a self-attention mechanism to the ResUNet-a architecture. Each residual block of ResUNet-a with the atrous convolutions was replaced by a residual block with a Fractal Tanimoto Attention Layer (FracTAL). The authors consequently named this new network FracTAL ResUNet. This network was used by Waldner et al. (2021) and Wang et al. (2022) for agricultural field delineation from satellite images.

3.2.3 Mask R-CNN

Mask R-CNN is an extension of Faster R-CNN (Ren et al. 2016). It maintains the bounding box recognition and classification branches of Faster R-CNN and in parallel adds a branch for predicting binary segmentation masks on each Region of Interest (RoI) (He et al. 2017). Therefore, Mask R-CNN is meant for instance segmentation (semantic segmentation and object detection). The Mask R-CNN architecture has two components: a backbone and a head. The backbone uses a convolutional neural network (CNN) typically Resnet-101 (He et al. 2015) and the Feature Pyramid Network (FPN) (Lin et al. 2017) to extract feature maps from the input image. The head section uses a Region Proposal Network (RPN) for extracting the RoIs, an RoI alignment layer for aligning the RoIs with the corresponding regions in the input image, fully connected layers for bounding box regression and softmax classification, and a fully convolutional network (FCN) for generating a binary segmentation mask for each user-defined class. More details about Mask R-CNN can be found in He et al. (2017). Some researchers (Lv et al. 2020; Meyer et al. 2020) have used Mask R-CNN for segmenting agricultural fields.

3.2.4 Optimised MRS

Unlike the DNNs, the MRS algorithm does not require training. It can simply be applied to any image of interest to generate corresponding segments. The outcome of the algorithm is controlled by three main parameters namely scale, shape, and compactness. With each parameter taking varying input values, an endless number of parameter combinations could be generated. Determining the optimal parameter combination to use for the segmentation of each test image could be done through supervised or unsupervised optimisation. Supervised optimisation involves the use of reference data while unsupervised optimisation involves the direct use of the image content to identify the optimal combination. We demonstrated in our previous study (Tetteh et al. 2020b) that optimising the MRS parameters in an unsupervised manner produces significantly lower segmentation accuracies when compared to supervised optimisation. Therefore, in this study, we used the supervised segmentation optimisation (SSO) approach proposed by Tetteh et al. (2020a). The core of that SSO approach is the use of the MRS algorithm in eCognition, Bayesian optimisation, and supervised segmentation evaluation. To use Bayesian optimisation, one would have to define an objective function to optimise (maximise or minimise). An objective function is a function that takes some input (here, a combination of scale, shape, and compactness) and then returns a metric. In the SSO approach, this metric was computed through supervised segmentation evaluation, which involves the geometric comparison of segments created with the MRS algorithm with their corresponding GSAA parcels. The specific metric that we computed was the area-weighted average of the Jaccard index (Jaccard 1901). The Jaccard index is popularly known as intersection over union (IoU). The parameter combination with the highest area-weighted IoU value is considered the best combination and the corresponding segmentation result is returned by the SSO.

3.3 Segmentation Experiments

To train the DNNs, we generated two classes namely field (class label = 1) and boundary (class label = 2) from the GSAA parcels per training tile. For each GSAA parcel, we applied an inward buffer of 5 m. The inwardly buffered polygons represented the field layer. The geometric difference between the GSAA parcels and the field layer constituted the boundary layer. Those two layers were subsequently rasterised to create a reference image per training tile. Using all four bands, each training and test image had a size of 256 × 256 × 4. For all DNNs, the number of training epochs was set at 50.

Following the approach of García-Pedrero et al. (2019), we compiled U-Net using the Adam optimiser (Kingma and Ba 2017) with a learning rate of 0.0001. For the loss function, we used categorical cross-entropy as is usually done in the literature when it comes to multiclass classification with DNNs. The U-Net model was trained in TensorFlow (Abadi et al. 2016) using a batch size of 20. The trained U-Net model, when applied to any test image, returns a pixel-wise probability image in which each pixel is allocated the probabilities of the field and boundary classes. The actual class label per pixel is then determined as the arg max of the probability image. The outcome of this arg max is an image in which each pixel is either assigned to a field or a boundary.

Regarding FracTAL ResUNet, we used the model and corresponding hyperparameters that were defined in Waldner et al. (2021). To train FracTAL ResUNet, three reference images must be generated for each training image. The three reference images are the extent mask, boundary mask, and distance image. The extent mask is a binary image, where all field pixels (class label 1) are one and other pixels are zero. The boundary mask is also a binary image with boundary pixels (class label 2) being one and other pixels being zero. The distance image is created by applying a distance transform to the extent mask and then normalising the resultant image between zero and one. The training of the model was done with the MXNet (Chen et al. 2015) deep learning library. Here, the batch size was reduced to four to enable MXNet to run without raising memory errors. When the trained FracTAL ResUNet model is applied to any test image, it generates three output layers namely an extent (field) probability image, a boundary probability image, and a distance image. To delineate the agricultural fields, Waldner et al. (2021) used the extent and boundary probability images as inputs to hierarchical watershed segmentation. The quality of the delineated agricultural fields depends on the specific dynamics threshold (\({t}_{b}\)) applied to the edge-weighted graph generated from the boundary probability image and the extent threshold (\({t}_{e}\)) applied to the extent probability image. Just like Waldner et al. (2021), we set \({t}_{b}\) to 0.2 and \({t}_{e}\) to 0.4. The outcome of the heirachical watershed segmentation is an image in which a unique number is assigned to all pixels belonging to each detected field instance.

When it comes to Mask R-CNN, we used the TensorFlow implementation of Abdulla (2017). To enable Mask R-CNN correctly learn the variable field sizes and shapes contained in a satellite image, we followed the approach of Meyer et al. (2020) by changing the RPN anchor scales from (32, 64, 128, 256, 512) to (8, 16, 32, 64, 128) and the anchor ratios from (0.5, 1, 2) to (0.1, 0.5, 1, 2, 4). Further, we changed the maximum number of ground truth instances to use per image from 100 to 554 to ensure that all available GSAA parcels per image are used during training. We used 554 because it is the maximum number of GSAA parcels per tile. The number of image channels to use was changed from three to four. We set the number of classes to one corresponding to class label 1, given that we are only interested in field instances. The batch size was set to eight to avoid the memory errors raised by TensorFlow when the batch size was set higher than eight. After applying the trained Mask R-CNN model to any test image, a binary image is created in which pixels belonging to each detected field instance are assigned a value of one and non-field pixels are set to zero.

To effectively use the SSO approach in delineating agricultural fields in any input image, Tetteh et al. (2021) proposed an image masking approach in which the agricultural land-cover polygons extracted from the DLM of ATKIS were first inwardly (negatively) buffered by 5 m to create a separation between adjacent polygons. These inwardly buffered polygons were then used to mask out the non-agricultural areas. This masking process pre-segmented the input image. We adopted this masking approach in this study to mask the test images before applying the SSO to segment the fields.

3.4 Evaluation of Segmentation Accuracy

In semantic and instance segmentation tasks, the accuracy of a segmentation output is usually measured at the pixel level from a confusion matrix. However, we are interested in the geometric accuracy of only the segmented fields; hence, we opted for object-based accuracy assessment (OBAA). Before the OBAA, we first created field polygons by simply vectorising only the field pixels of the segmented output image generated by each DNN. The output of MRS is a vector layer; hence, no vectorization was needed. For each method, we calculated two OBAA metrics commonly used in computer vision tasks to assess the geometric similarity between target objects (vectorised field layers) and their corresponding reference objects (GSAA parcels) per test tile. The first metric was the IoU (Eq. (2)):

$$IoU= \frac{\mathrm{Area}\left(X \cap Y\right)}{\mathrm{Area}\left(X \cup Y\right)}$$
(2)

where X refers to all GSAA parcels, Y refers to all vectorised fields, ∩ is the spatial intersection operator, and ∪ represents the spatial union operator. The IoU metric ranges from 0 (no geometric match) to 1 (complete geometric match). Smaller fields are generally more sensitive to the IoU metric than bigger fields, especially where there is a small spatial misalignment between the fields and their corresponding reference objects (Tetteh et al. 2021). Therefore, as a second metric, we computed F-score, which is captured as F1 in Eq. (3):

$${F}_{1}=2\times \frac{\mathrm{Precision}\times \mathrm{ Recall}}{\mathrm{Precision}+ \mathrm{Recall}}$$
(3)

where Precision (Eq. (4)) measures the level of under-segmentation in the segmentation output and Recall (Eq. (5)) measures the level of over-segmentation:

$$\mathrm{Precision}= \frac{\mathrm{Area}\left(X \cap Y\right)}{\mathrm{Area}\left(Y\right)}$$
(4)
$$\mathrm{Recall}= \frac{\mathrm{Area}\left(X \cap Y\right)}{\mathrm{Area}\left(X\right)}$$
(5)

The variables and symbols in Eqs. (4) and (5) have the same meaning as in Eq. (2). The F-score, precision, and recall metrics also range from 0 (worst segmentation) to 1 (perfect segmentation).

4 Results

The performance of each method averaged over the 2151 test tiles is reported in Table 1. The distribution of the precision, recall, F-score, and IoU values can, respectively, be seen in Figs. 10, 11, 12, and 13 of the appendix. From Table 1, the FracTAL ResUNet method achieved the highest average recall, F-score, and IoU values. The performance of the optimised MRS approach was close to that of FracTAL ResUNet. Except for the precision metric, Mask R-CNN obtained the worst performance in all other metrics. The optimised MRS and FracTAL ResUNet methods obtained the lowest precision with U-Net achieving the highest precision.

Table 1 The performance achieved by each method averaged over the 2151 test tiles

Based on the F-score and IoU metrics, we analysed the performance of each method for the three SF classes created at the data preparation stage. Figures 5 and 6 are violin plots, respectively, showing the distribution of the F-score and IoU values per SF class for the four methods. In both figures, the density curves of the methods, particularly for Mask R-CNN, have wider spreads at the low compactness class but narrower spreads at the medium and high compactness classes. Regardless of which method, on average, the lowest F-score and IoU values were obtained by test tiles with low compactness, and the highest F-score and IoU values were obtained by test tiles with medium or high compactness.

Fig. 5
figure 5

Violin plots showing the distribution of the F-score values for each method for the test tiles with a low compactness, b medium compactness, and c high compactness

Fig. 6
figure 6

Violin plots showing the distribution of the intersection over union (IoU) values for each method for the test tiles with a low compactness, b medium compactness, and c high compactness

A visual inspection (Figs. 7, 8, 9) of the segmentation results of the methods at three tiles, respectively, selected from the three SF classes reaffirms the results shown in Figs. 5 and 6. The segmentation outcome for a tile with low compactness is shown in Fig. 7, the outcome for a tile with medium compactness is captured by Fig. 8, and the outcome for a tile with high compactness is shown in Fig. 9. For each of those three figures, the corresponding F-score and IoU values obtained by each method are also reported in Table 2. As discernible from Table 2, the lowest accuracies were obtained in the low compactness class, and the highest accuracy was achieved in the high compactness class.

Fig. 7
figure 7

The segmentation result obtained at a tile with low compactness. a The masked MMC image of the tile, b the GSAA parcels (cyan outlines) overlaid on the masked MMC image, c the segmentation result (yellow outlines) of Mask R-CNN, d the segmentation result (orange outlines) of U-Net, e the segmentation result (blue outlines) of FracTAL ResUNet, and f the segmentation result (red outlines) of the optimised MRS

Fig. 8
figure 8

The segmentation result obtained at a tile with medium compactness. a The masked MMC image of the tile, b the GSAA parcels (cyan outlines) overlaid on the masked MMC image, c the segmentation result (yellow outlines) of Mask R-CNN, d the segmentation result (orange outlines) of U-Net, e the segmentation result (blue outlines) of FracTAL ResUNet, and f the segmentation result (red outlines) of the optimised MRS

Fig. 9
figure 9

The segmentation result obtained at a tile with high compactness. a The masked MMC image of the tile, b the GSAA parcels (cyan outlines) overlaid on the masked MMC image, c the segmentation result (yellow outlines) of Mask R-CNN, d the segmentation result (orange outlines) of U-Net, e the segmentation result (blue outlines) of FracTAL ResUNet, and f the segmentation result (red outlines) of the optimised MRS

Table 2 The F-score and IoU values obtained by each method at the three test tiles, respectively, shown in Figs. 7, 8, and 9

5 Discussion

Looking at Tables 1 and 2, a positive correlation can be established between the F-score and IoU metrics. This correlation can be linked to the similar mathematical formulations of those two metrics (Maxwell et al. 2021). The F-score values were higher than the IoU values due to the higher weight put on correctly delineated areas (intersection between the reference and target objects) by F-score. Regardless of which of those two metrics is opted for, FracTAL ResUNet proved to be the clear-cut winner among the DNNs and ultimately the best method as it also outperformed the optimised MRS. Although the respective differences in the F-Score and IoU values between the FracTAL ResUNet and optimised MRS as captured in Table 1 were the smallest, a paired t-test, respectively, performed with all the F-score and IoU values revealed that the differences were statistically significant (p value < 0.006 for F-score and p value < 0.001 for IoU). Overall, Mask R-CNN had the worst performance. Just like the segmentation results generated for France and Denmark by Meyer et al. (2020) from S2 images, the segments created with Mask R-CNN in this study as captured in Figs. 7c, 8c, and 9c were often wobbly and did not properly capture the spatial boundaries (edges) of the agricultural fields. Consequently, Mask R-CNN generally produced the most over-segmented results as can be observed in Table 1, where it had the worst average recall. With very similar precision values (Table 1), all four methods generated segmentation results with acceptable under-segmentation rates.

The impact of the size and shape of agricultural fields on the accuracy of the subsequent segmentation process has been well documented in previous studies (Tetteh et al. 2020a, 2020b, 2021). In those previous studies, it was observed that in areas where the agricultural fields were small and/or elongated (i.e. small compactness), the segmentation accuracy was low, and in areas with big and more compact fields (high compactness), the segmentation accuracies were high. This observation is coterminous with the results shown in Figs. 5 and 6, where the F-score and IoU values of all methods increased with increasing compactness. The negative impact of elongated fields on segmentation accuracies was more prominent in the results of the Mask R-CNN method. As visible in Fig. 7c, where the tile had low compactness, Mask R-CNN obtained the worst F-score and IoU values (see Table 2). A look at Fig. 7c clearly shows that Mask R-CNN was unable to detect numerous agricultural fields at that tile, which led to massive over-segmentation. Beyond the low compactness, the agricultural fields at the tile shown in Fig. 7a were mostly dominated by mowing pasture, thereby making it difficult to identify visible boundaries between the individual fields. In discussing their segmentation results, Waldner et al. (2021) noted that in areas where pasture was prevalent, the field delineation was less accurate. Therefore, tiles such as Fig. 7a will pose problems for any segmentation algorithm because the spatial resolution of S2 does not allow for the proper resolution of the agricultural fields present at such tiles (Tetteh et al. 2020a, 2020b, 2021). In Fig. 7, although the results of U-Net (Fig. 7d), FracTAL ResUNet (Fig. 7e), and the optimised MRS (Fig. 7f) had numerous instances of under-segmentation, those three methods performed fairly well as they correctly delineated most of the agricultural fields, unlike Mask R-CNN. In Figs. 8 and 9, where the agricultural fields were more compact and had different LU types, the corresponding segments generated by all methods had better geometric matches to the GSAA parcels when compared with the segmentation results of Fig. 7 as aptly captured by the F-score and IoU values of Table 2. For both the F-score and IoU metrics, the highest leap in segmentation accuracy was recorded for Mask R-CNN from low compactness to medium compactness (see Table 2). From the low compactness to the medium compactness, the segmentation accuracies of U-Net, FracTAL ResUNet, and the optimised MRS remained fairly stable (see Table 2).

In this study, the two methods that stood out were FracTAL ResUNet and the optimised MRS. The performance of FracTAL ResUNet could be linked to (1) the use of residual convolution blocks to deal with the problem of vanishing or exploding gradients while training a DNN (Diakogiannis et al. 2020), (2) the use of the self-attention mechanism to emphasise important features in convolution layers (Waldner et al. 2021), and (3) the use of conditioned multitasking whereby a distance image is first predicted, then this information is used to predict boundaries, and finally, both predictions are used as the basis to predict extents. It is important to emphasise here that FracTAL ResUNet as was used by Waldner et al. (2021) could best be described as a feature engineering method to extract features (extent probability and boundary probability images) that will be post-processed to generate the agricultural fields. It remains to be seen how FracTAL ResUNet would perform when it is directly used to extract agricultural fields through pixel-wise semantic labelling without applying any post-processing method like hierarchical watershed segmentation. As implemented by Waldner et al. (2021), the accuracy of the segmented agricultural fields would largely depend on the specific dynamics threshold (\({t}_{b}\)) and extent threshold (\({t}_{e}\)) passed to the hierarchical watershed segmentation algorithm.

The performance of the MRS algorithm could largely be linked to the direct use of the reference GSAA parcels to guide the segmentation process at each test tile. Another factor that might have helped the MRS approach was the creation of the masks from the inwardly buffered land-cover polygons extracted from ATKIS and the subsequent use of those masks to pre-segment the test images. In the study of Tetteh et al. (2020a), where the size of each tile was 10 km × 10 km, the average segmentation accuracy achieved for Lower Saxony was lower than in this study. Largely, this can be attributed to the smaller sizes (2.56 km × 2.56 km) of tiles used in this study. It was reported by Drăguţ et al. (2019) that the segmentation accuracy achieved by the MRS algorithm inversely correlates with the spatial extent of the input image.

The three DNNs used in this study are supervised methods. The DNNs can be trained on some training images, the trained model can be saved, and then the saved model can be subsequently applied to segment unseen (test) images. This concept does not apply to the MRS algorithm because it is designed for unsupervised segmentation, hence no training is required. Therefore, to ensure a fair comparison between the DNNs and the MRS algorithm, we used the SSO approach to optimise the MRS parameters. The optimal MRS parameter combination established for a tile with the SSO approach is specific to that tile, and hence cannot be transferred in space to a different tile. This informed our decision to optimise the MRS parameters at each test tile. Optimising the MRS parameters at the training tiles and then transferring the parameters to the test tiles would produce very poor segmentation results because in many instances tiles with close spatial proximity or the same SF class have completely different MRS parameters. Unlike the MRS algorithm, a trained DNN model can be transferred in space to segment unseen images.

In their review paper, Persello et al. (2022) highlighted how DNNs and earth observation data can be applied to support the SDGs of the UN. Specific to the second goal of the SDGs, fashioning out policies to sustainably achieve food security will require agricultural lands to be monitored at regional, national, and global scales. The ability of DNNs, particularly FracTAL ResUNet, once trained on sample images to generalise well on unseen images (Waldner et al. 2021; Wang et al. 2022), opens up the possibility to delineate agricultural fields on a large scale even in areas where reference data are unavailable.

6 Conclusions

To determine the optimal method for delineating agricultural fields from Sentinel-2 images acquired in Lower Saxony (Germany), we evaluated three state-of-the-art deep neural networks (DNNs), namely Mask R-CNN, U-Net, and FracTAL ResUNet against an optimised multiresolution segmentation (MRS) approach. Based on the agricultural parcels declared by farmers within the European Common Agricultural Policy (CAP) framework, the segmentation results generated by each method were evaluated using two main metrics namely F-score and intersection over union (IoU). With an average F-score of 0.808 and IoU of 0.683, FracTAL ResUNet combined with a post-processing approach called hierarchical watershed segmentation generated the best segmentation results. FracTAL ResUNet was closely followed by the optimised MRS approach with an average F-score of 0.805 and IoU of 0.678.

For researchers working on the large-scale object-based mapping of agricultural land-use types from satellite images, this study can serve as a guide regarding which segmentation method to use for the delineation of agricultural fields. Based on the outcome of this study, for large-scale segmentation of agricultural fields, we recommend the use of FracTAL ResUNet. Once the FracTAL ResUNet model has been trained, it generalises very well and can be transferred in space to effectively segment unseen images. This is in sharp contrast to the optimised MRS approach, which is not transferable in space. To segment any unseen image with the optimised MRS approach, reference data are always required.

Future work would focus on: (1) combining FracTAL ResUNet and the hierarchical watershed segmentation algorithm to delineate all agricultural fields in Germany based on multitemporal Sentinel-2 images, (2) the use of Bayesian optimisation to optimise the hyperparameters of FracTAL ResUNet and the hierarchical watershed segmentation algorithm, and (3) testing the temporal transferability of FracTAL ResUNet from one year to another year of interest.