1 Introduction

Precision agriculture is a technique that aims to increase crop productivity while reducing costs and environmental impact [1]. Sensing technology is a tool for achieving this goal by monitoring vast lands. With the advancement of convolutional neural networks (CNNs), this technology has become even more powerful [2]. CNN models are used in early disease detection, leading to reduced yield losses by applying fungicides at the right time [1]. Additionally, CNN architectures can identify trees and crops to maximize agricultural efficiency [3]. However, CNN architectures [4] have high inference time measured in frames per second (fps), which makes them impractical for real-time applications.

Precision agriculture faces the challenge of balancing high accuracy with fast inference speed. Existing research on using CNN models for real-time precision agriculture focuses on a single specialized application [5,6,7] and does not provide sufficient accuracy [8,9,10]. Therefore, it is essential to adapt recent real-time models to provide high accuracy in various precision agriculture applications [1].

Fig. 1
figure 1

Complexity-accuracy trade-off comparison on the DSTL image set in terms of Jaccard Index JI, Giga floating point operations (GFLOPs), and model parameters. The circle size indicates the number of the model parameters

Real-time CNN architectures generally adhere to an encoder–decoder framework. Architectures like SegNet [11] employ encoders based on established backbone networks. In contrast, ENet [12], LEDNet [13], and FSFNet [14] use lightweight modules to build efficient encoders, resulting in fewer parameters. These models, however, lack accuracy compared to others.

The decoder parts of real-time semantic segmentation models may also have different designs. SegNet and ESNet [15] have symmetrically designed decoders. In contrast, DFANet [16], and FASSD-Net [17] architectures have adopted asymmetric decoder structures to enhance inference speed. Recent transformer-based models such as UNetFormer [18] achieve good performance without sacrificing real-time speed.

Remarkably, by introducing a split-extract-merge bottleneck (SEM-B) in its backbone network, the real-time LMFFNet [19] architecture achieves high accuracy with fewer model parameters. A lightweight asymmetric decoder is used in the LMFFNet model to process multi-scale features, which improves inference time. However, with the challenging low latency and high accuracy requirements for various precision agriculture tasks, LMFFNet still needs improvement.

Realizing precision agriculture practices with high accuracy in real-time is challenging. In real-world remote-sensing images with high spatial resolution, capturing the intricate details of precision agriculture target objects poses considerable difficulties. This paper proposes the ResLMFFNet architecture to increase prediction accuracy and achieve a decent trade-off between high accuracy and fast inference speed. ResLMFFNet introduces the following novelties:

  • LMFFNet is the base model since it already achieves an adequate trade-off between accuracy and efficiency. However, residual connections are added to the SEM-B blocks in this study to further increase accuracy without affecting inference speed (Fig. 1). By preserving low-level features lost through deep SEM-B blocks, these connections further enhance the performance of LMFFNet. Residual connections are preferred to dense or attention connections since the element-wise summation operation does not introduce trainable weights.

  • Before upsampling, the dropout layer is used in the decoder, which allows the model to show higher generalization ability, making it better suited to a wide range of precision agriculture practices.

In the remainder of this paper, the details of the proposed architecture ResLMFFNet are described in Sect. 2. Section 3 presents the experimental results. Conclusions are given in Sect. 4.

2 Methods

Fig. 2
figure 2

ResLMFFNet architecture

The ResLMFFNET model, an improved version of the LMFFNET architecture, emerges to increase accuracy while preserving real-time capabilities, as depicted in Fig. 2. Similar to the LMFFNET design, the ResLMFFNET model is composed of three core components: SEM-B block, feature fusion module (FFM), and multiscale attention decoder (MAD).

The ResLMFFNET architecture achieves its novel contribution by implementing residual connections within the SEM-B blocks, as illustrated in Fig. 2. Furthermore, the accuracy is further boosted by the inclusion of a dropout layer in the decoder design. This section provides a detailed explanation of the essential components within the ResLMFFNET architecture.

2.1 SEM-B block

The SEM-B block is built upon the split-extract-merge bottleneck shown in Fig. 3. SEM-B applies a 3\(\times\)3 convolution, then splits the feature map into two branches, each with 1/4 channels of the input. One branch undergoes depthwise convolution, while the other employs depthwise dilated convolution so that SEM-B effectively captures fine spatial details and larger contextual information simultaneously. Following the concatenation of the branch outputs, another 3\(\times\)3 convolution is applied. This operation, leading to the original channel number, combines multi-scale features more cohesively. Finally, the output feature map is added to the input, yielding a more informative representation.

Fig. 3
figure 3

SEM-B

Fig. 4
figure 4

Feature fusion modules

As depicted in Fig. 2, the architecture employs a pair of distinct SEM-B blocks. The initial block is responsible for capturing shallow features, whereas the subsequent one focuses on extracting deep features. SEM-B Block1 is composed of \(M \left( M> 0 \right)\) SEM-Bs, while SEM-B Block2 comprises \(M{}' \left( M{}'> 0 \right)\) of these bottleneck units.

2.2 FFM modules

Two FFM modules, namely FFM-A and FFM-B (depicted in Fig. 4a and b, respectively), are employed to fuse multiscale features. Within these modules, pointwise convolution enables the extraction of valuable information with few parameters.

In Fig. 4a, the initial block applies a 3\(\times\)3 convolution with a stride of 2, followed by two more 3\(\times\)3 convolutions to the input image \(x^{i}\in \mathbb {R}^{C\times H\times W}\). The output feature map of this initial block \(x^\textrm{init}\in \mathbb {R}^{C1\times H/2\times W/2}\) is then concatenated with the downsampled feature map \(x^{i'}\in \mathbb {R}^{C\times H/2\times W/2}\). The output of the FFM-A1 module \(x^\textrm{ffma1}\in \mathbb {R}^{\left( C1+C \right) \times H/2\times W/2}\) is derived as follows:

$$\begin{aligned} x^\textrm{ffma1}=f_{1\times 1 \textrm{conv}}\left( f_\textrm{concat}\left( x^\textrm{init},x^{i'} \right) \right) , \end{aligned}$$
(1)

where \(f_{1\times 1\textrm{conv}}\) represents the pointwise convolution operation and \(f_\textrm{concat}\) denotes the concatenation operation.

Fig. 5
figure 5

Downsampling block

The downsampling block in Fig. 5 is applied on the output of the FFM-A1 block, concatenating feature maps of 3\(\times\)3 convolution (with a stride of 2) and 2\(\times\)2 max pooling operations to retain more spatial information. As shown in Fig. 4b, the resulting output, \(x^{d}\in \mathbb {R}^{C2 \times H/4\times W/4}\), serves as input for both the SEM-B block with M number of SEM-Bs and the partition-merge channel attention (PMCA) module. SEM-B Block1 is applied to this feature map \(x^{d}\) as follows:

$$\begin{aligned} x^{s1}=f_\textrm{semb1}\left( x^{d} \right) , \end{aligned}$$
(2)

where \(x^{s1}\in \mathbb {R}^{C2 \times H/4\times W/4}\) is the output of SEM-B Block1 and \(f_\textrm{semb1}\) represents the SEM-B Block1 operation.

Fig. 6
figure 6

PMCA module

The PMCA module calculates a weighted sum by applying global average pooling to the partitioned regions, then utilizing adaptively learned neural network weights, as illustrated in Fig. 6. By integrating a squeeze-and-excitation (SE) block [20], PMCA allocates more attention to the informative features. The output feature map of this module \(x^{pmca1}\in \mathbb {R}^{C2 \times H/4\times W/4}\) is obtained as:

$$\begin{aligned} x^\textrm{pmca1}=f_\textrm{pmca}\left( x^{d} \right) , \end{aligned}$$
(3)

where \(f_\textrm{pmca}\) represents the operations in PMCA module.

2.2.1 Residual connections

Due to the intricate details embedded in high-resolution remote-sensing images, the depth of the SEM-B blocks must be large enough to capture these nuanced differences in precision agriculture objects. However, increasing the number of SEM-B units in SEM-B blocks deepens the network, creating a problem of poor gradient flow during back-propagation. This trend leads to issues related to exploiting and vanishing gradients, which reduces the model’s trainability and expressiveness, thereby decreasing its performance [21].

A novel approach to the ResLMFFNet model is to incorporate a residual connection from the input feature map \(x^{d}\) to the output feature map \(x^{s1}\) of the SEM-B block to mitigate this problem. Fig. 7 illustrates this approach in which the input feature map is added to the output feature map of the SEM-B block by element-wise operation.

As information flows directly through the SEM-B blocks, residual connections facilitate the capture of intricate details in remote-sensing images with high spatial resolution and prevent vanishing/exploding gradients [22]. Moreover, matrix addition in residual connections does not add learnable parameters. Thus, ResLMFFNet uses residual connections rather than dense connections or attention mechanisms. Referencing Fig. 4b, the output feature map \(x^{s1}\) from SEM-B Block1 is updated using a residual connection to obtain \(x^\textrm{s1res}\in \mathbb {R}^{C2\times H/4\times W/4}\):

$$\begin{aligned} x^\textrm{s1res}=x^{s1}+x^{d}. \end{aligned}$$
(4)
Fig. 7
figure 7

Residual connection in SEM-B block

In the LMFFNet architecture, the output of the PMCA module \(x^\textrm{pmca1}\), the downsampled input \(x^{i''}\in \mathbb {R}^{C\times H/4\times W/4}\), and the output of SEM-B Block1 \(x^{s1}\) are concatenated. In ResLMFFNet, this concatenation includes \(x^\textrm{s1res}\) instead of \(x^{s1}\). Using pointwise convolution, the FFM-B1 block produces the output \(x^\textrm{ffmb1}\in \mathbb {R}^{\left( C3+C \right) \times H/4\times W/4}\) as follows:

$$\begin{aligned} {\begin{matrix} x^\textrm{ffmb1}= f_{1\times 1\textrm{conv}} \left( f_\textrm{concat}\left( x^\textrm{s1res},x^\textrm{pmca1},x^{i''} \right) \right) . \end{matrix}} \end{aligned}$$
(5)

Two FFM-B blocks at different levels are utilized to fuse shallow and abstract features. Using a residual connection allows for deeper network with unchanged trainable parameters, especially beneficial for preserving important features in objects of different scales. This connection involves a simple element-wise summation, avoiding parameter increase and causing only a slight inference speed rise.

2.3 MAD decoder

The attention-based MAD decoder architecture is presented in Fig. 8, designed to recover multi-scale spatial details. The output \(x^\textrm{ffmb1}\in \mathbb {R}^{\left( C3+C \right) \times H/4\times W/4}\) of the FFM-B1 block, at a quarter scale of the input, undergoes a pointwise convolution. Consequently, this process yields the output feature map \(x^\textrm{ffmb1MAD}\in \mathbb {R}^{C5\times H/4\times W/4}\) with C5 channels as follows:

$$\begin{aligned} x^\textrm{ffmb1MAD}=f_{1\times 1 \textrm{conv}}\left( x^\textrm{ffmb1} \right) . \end{aligned}$$
(6)

The output \(x^\textrm{ffmb2}\in \mathbb {R}^{\left( C4+C \right) \times H/8\times W/8}\) from the FFM-B2 block, which is at 1/8 scale of the input, undergoes pointwise convolution, reaching to C6 number of channels. Moreover, this feature map is doubled in size using upsampling, leading to \(x^\textrm{ffmb2MAD}\in \mathbb {R}^{C6\times H/4\times W/4}\) as:

$$\begin{aligned} x^\textrm{ffmb2MAD}=f_{up}\left( f_{1\times 1 \textrm{conv}} \left( x^\textrm{ffmb2} \right) \right) , \end{aligned}$$
(7)

where \(f_{up}\) represents the upsampling operation performed with bilinear interpolation. To capture more multi-scale spatial information, the feature maps \(x^\textrm{ffmb1MAD}\) and \(x^\textrm{ffmb2MAD}\) are concatenated and subjected to a 3\(\times\)3 depthwise separable convolution. This process refines the combined multi-scale information effectively. The resulting feature map is then passed through a sigmoid activation function to produce the multi-scale attention map \(M^{MAM}\in \mathbb {R}^{C\times H/4\times W/4}\) as follows:

$$\begin{aligned} {\begin{matrix} M^{\textrm{MAM}'}=( f_\textrm{concat} \left( x^\textrm{ffmb1MAD}, x^\textrm{ffmb2MAD} \right) \\ M^\textrm{MAM}=\delta \left( f_\textrm{dwconv}\left( M^{\textrm{MAM}'} \right) \right) , \end{matrix}} \end{aligned}$$
(8)

where \(f_\textrm{dwconv}\) represents depthwise separable convolution operation and \(\delta\) shows the sigmoid activation function.

Fig. 8
figure 8

Decoder of ResLMFFNet—MAD

2.3.1 Dropout

A dropout layer is incorporated into the decoder part of the ResLMFFNet architecture to enhance its generalization capability. The FFM-B2 block’s output \(x^\textrm{ffmb2}\in \mathbb {R}^{\left( C4+C \right) \times H/8\times W/8}\) is reused in a second branch beyond its role in creating the \(M^{MAM}\) attention map. While the original LMFFNet design applies 3\(\times\)3 depthwise separable convolution and upsampling to this feature map \(x^\textrm{ffmb2}\), the ResLMFFNet design (as depicted in Fig. 8) employs a dropout layer with rate of 0.5 immediately after a 3\(\times\)3 depthwise separable convolution, followed by upsampling. This process yields the \(x^\textrm{ffmb2MAD2}\in \mathbb {R}^{C\times H/4\times W/4}\) feature map as:

$$\begin{aligned} x^\textrm{ffmb2MAD2}=f_\textrm{up}\left( f_\textrm{drop} \left( f_\textrm{dwconv} \left( x^\textrm{ffmb2} \right) \right) \right) , \end{aligned}$$
(9)

where \(f_\textrm{drop}\) represents the dropout layer of 0.5 rate. The ResLMFFNet architecture fuses the attention map \(M^\textrm{MAM}\) from the first branch and the feature map \(x^\textrm{ffmb2MAD2}\) from the second branch using pointwise multiplication. The output \(x^\textrm{out}\in \mathbb {R}^{C\times H\times W}\) is acquired through upsampling after the pointwise multiplication to reach the original input size as:

$$\begin{aligned} x^\textrm{out}=f_\textrm{up}\left( M^\textrm{MAM}\odot x^\textrm{ffmb2MAD2} \right) , \end{aligned}$$
(10)

where \(\odot\) is the pointwise multiplication operation.

3 Experimental results

This section introduces the image sets, the evaluation metrics, and the implementation details. Subsequently, comprehensive experiments assess the real-time semantic segmentation performance of the ResLMFFNet architecture across various precision agriculture applications.

Fig. 9
figure 9

Image set Illustrations. a An example original image from the DSTL image set. b The corresponding ground truth image from the DSTL image set. c Original training image from the RIT-18 image set. d The corresponding ground truth image from the RIT-18 image set. e Original training image from the Wheat Yellow Rust image set. f The corresponding ground truth image from the Wheat Yellow Rust image set

3.1 Image sets

The experiments employ three remote-sensing image sets. One set comprises images obtained from satellite-based systems, while the other two consist of images acquired through UAV sensing systems. This section explains each of these image sets.

3.1.1 DSTL satellite imagery feature detection image set

The DSTL Kaggle [2] image set comprises 25 satellite images, each capturing a region of 1000 m \(\times\) 1000 m. An example image is presented in Fig. 9a, accompanied by the corresponding ground truth displayed in Fig. 9b for ten labeled classes. This study uses images with a spatial resolution of 1.24 m as input for real-time binary semantic segmentation of crop regions. The depicted light green pixels in Fig. 9b represent crops. Ground truth annotations are created by describing the target classes with polygons in GeoJSON, followed by normalizing geo-coordinates within specific ranges to obscure satellite image locations.

3.1.2 RIT-18 (The Hamlin State Beach Park) aerial image set

The RIT-18 [23] image set includes aerial images taken via an octocopter. The training image (Fig. 9c) has a 9393 \(\times\) 5642 pixel size with a high spatial resolution (0.047 m). This study uses the RIT-18 image set for real-time binary semantic segmentation of trees. Ground truth (Fig. 9d) for eighteen labeled classes shows tree pixels in blue. Ground truth annotations are created by manually delineating the target classes within each orthomosaic image utilizing ENVI software.

3.1.3 Wheat Yellow-Rust aerial image set

The Wheat Yellow-Rust [24] image set is a collection of aerial images captured by the DJI Matrice 100 (M100) quadcopter. The training image indicated in Fig. 9e possesses dimensions of 1336 \(\times\) 2991 pixels and a spatial resolution of 0.013 ms. This study performs real-time binary semantic segmentation of wheat yellow-rust disease. Affected regions, caused by the controlled introduction of yellow rust inoculum in 2 m \(\times\) 2 m regions, are highlighted in blue within the ground truth representation in Fig. 9f. Ground truth annotations are created by labeling target objects in each image using the MATLAB ImageLabeler tool.

Table 1 Performance comparison of real-time semantic segmentation architectures estimated on the RIT-18, DSTL, and Wheat Yellow-Rust image sets
Table 2 Tree semantic segmentation test results in terms of Jaccard Index (IoU) and F\(_{1}\) score for the different architectures with RIT-18 image set

3.2 Evaluation metric

The Jaccard Index, also called the intersection over union (IoU), is a metric utilized in experiments to evaluate the performance of real-time semantic segmentation models. A binary classification task involves calculating overlapping pixels of the prediction and the mask divided by the total number of pixels, as follows:

$$\begin{aligned} \text {Jaccard Index}=\frac{\textrm{TP}}{\textrm{TP}+\textrm{FP}+\textrm{FN}}, \end{aligned}$$
(11)

where TP denotes correctly predicted pixels, FP represents incorrectly predicted pixels, and FN corresponds to missed pixels in the prediction.

Additionally, F\(_{1}\) score is used as a complementary metric for showing the performance of the proposed model. F\(_{1}\) score combines precision and recall by calculating harmonic series as:

$$\begin{aligned} F_{1}=\frac{2\times \text {Precision}\times \text {Recall}}{\text {Precision}+\text {Recall}}, \end{aligned}$$
(12)

where the precision and the recall are calculated as follows:

$$\begin{aligned} \small {Precision} =\frac{\small {\text{TP}}}{\small {\text{TP}+\text{FP}}}, \;\;\; \small {Recall} =\frac{\small {\text{TP}}}{\small {\text{TP}+\text{FN}}}. \end{aligned}$$
(13)

3.2.1 Implementation details

The experiments involve training semantic segmentation architectures with the adaptive moment estimation (Adam) algorithm on the NVIDIA Quadro RTX 5000 GPU while utilizing the PyTorch framework. A manual hyperparameter tuning process is adapted separately for each image set to find the best-performing values based on Jaccard Index measurements.

The mini-batch size is 8, and the number of epochs is 70. Weight initialization follows the Xavier uniform method, while the chosen loss function is binary cross-entropy with logits. For the DSTL and RIT-18 image sets, an initial learning rate of \(10^{-4}\) is adopted and decreased by 9% every five iterations. The Wheat Yellow-rust image set employs an initial learning rate of \(5\times 10^{-5}\), which undergoes a reduction of 9% every ten iterations.

The images are partitioned into 224 \(\times\) 224 image patches, resulting in 5985 patches from the DSTL set, 1778 patches from the RIT-18 set, and 1299 patches from the Wheat Yellow-rust set. These patches are then assigned to training (72%), testing (20%), and validation (8%). The validation process utilizes 5-fold cross-validation.

The experiments are conducted using RGB and normalized difference vegetation index (NDVI) [24] images to demonstrate the generalization capacity of ResLMFFNET architecture. By normalizing the difference between near infrared and red reflectance, NDVI provides information on healthy green plants.

Fig. 10
figure 10

Real-time semantic segmentation test results. Light green represents a hit, dark green represents a miss, and red represents a false alarm. First row shows crop predictions, second row shows tree predictions and third row shows wheat-yellow rust predictions a ground-truth masks. b U-Net. c SegNet. d FSFNet. e DFANet. f FASSDNet. g ENet. h UNetFormer. i LMFFNet. j ResLMFFNet

3.3 Results

Table 1 outlines a comparative analysis between the ResLMFFNET model and state-of-the-art real-time semantic segmentation architectures using the RIT-18, DSTL and Wheat Yellow-Rust image sets. The comparison examines inference speed, computational complexity and memory requirement. Inference speed is measured using frames per second (FPS), while computational complexity is evaluated based on metrics including learnable parameters, floating-point operations per second (FLOPs), and Gigaflops (GFLOPs). Function "torch.cuda.max-memory-allocated()" calculates the maximum GPU requirement for inference. Notably, the ResLMFFNET model retains identical GFLOPs and trainable parameter values as the LMFFNET, with only negligible variations observed in the FPS and memory requirement values.

Table 3 Crop semantic segmentation test results in terms of Jaccard Index (IoU) and F\(_{1}\) score for the different architectures with DSTL image set
Table 4 Wheat Yellow-Rust semantic segmentation test results in terms of Jaccard Index (IoU) and F\(_{1}\) score for the different architectures with UAV image set
Table 5 Tree semantic segmentation test results in terms of Jaccard Index (IoU) and F\(_{1}\) score for the different architectures with DSTL image set
Table 6 Ablation experiment results on dropout layer

The tree semantic segmentation test results are shown in Table 2, measured as Jaccard Index (IoU) and F\(_{1}\) score. ResLMFFNET outperforms other architectures. Compared with LMFFNET, the proposed architecture enhances the Jaccard index for RGB by about 1% and NDVI by 2.1%, all while maintaining comparable inference speed and computational complexity.

Table 7 Ablation experiment results on different M and N parameter values
Table 8 Ablation experiment results on different L2-norm rates
Table 9 Ablation experiment results on data augmentation of scaling within [0.95 1.05] range

Table 3 shows semantic segmentation test results for the crop target object within the DSTL satellite image set. ResLMFFNET outperforms LMFFNET by achieving approximately 0.5% higher Jaccard index values for RGB and 1.4% for NDVI in segmenting large-scale crop objects.

Table 4 displays semantic segmentation test results for the Wheat Yellow-Rust aerial image set. ResLMFFNET surpasses other architectures, achieving notable improvements of approximately 11.2% for RGB and 4.6% for NDVI in the Jaccard index compared to LMFFNET. This enhancement is remarkable, considering the challenging image set with limited training data. In addition, the tree class from the DSTL image set has limited labeled data. Therefore, Table 5 lists only test results for real-time models that converge on limited training samples. The ResLMFFNet model is superior to other models and achieves improvements of 2.5% in RGB images and 3.6% in NDVI images compared to the LMFFNet model.

Fig. 11
figure 11

Accuracy curves of training and validation sets in the training stage. a LMFFNet using RIT-18 image set. b ResLMFFNet using RIT-18 image set. c LMFFNet using DSTL image set. d ResLMFFNet using DSTL image set. e LMFFNet using Wheat Yellow-Rust image set. f ResLMFFNet using Wheat Yellow-Rust image set

Figure 10 illustrates the visual comparison of prediction results from different models using sample images alongside their corresponding ground truth masks. Specifically, light green represents hit pixels, dark green denotes missed pixels, and red indicates false alarm pixels. Three lines display the prediction results for crop, tree, and wheat yellow-rust objects. The ResLMFFNet architecture, illustrated in Fig. 10 (i), demonstrates reduced false alarms and miss pixels for target objects of varying scales. These visual results indicate that the ResLMFFNet architecture improves segmentation accuracy while retaining real-time inference speed.

Figure 11 shows the accuracy curves of the ResLMFFNet and LMFFNet architectures obtained through training using the RIT-18, DSTL and Wheat Yellow-Rust image sets. According to the fluctuations, the LMFFNet architecture exhibits an unstable training process, probably due to problems like vanishing/exploding gradients. Training becomes more stable with the proposed ResLMFFNet by smoothing fluctuations, as reflected in Fig. 11a, c and e. Therefore, ResLMFFNet can overcome possible vanishing/exploding gradients, thus improving overall segmentation performance.

3.3.1 Ablation study

The first ablation study investigates how the dropout layer in the MAD decoder affects performance. Table 6 reveals that ResLMFFNet and LMFFNet perform better when the dropout rate is 0.5. With a dropout rate of 0.7, the LMFFNet model exhibits subpar performance, whereas, with a rate of 0.3, the model’s performance does not improve from the baseline. Since there is no overfitting in the ResLMFFNet model, as shown in Fig. 11, ResLMFFNet appears robust to various dropout rates. The optimal dropout rate, however, remains 0.5 based on experimental results.

Table 7 shows experimental results using various M and N parameter values corresponding to the number of SEM-Bs in SEM-B blocks. Increasing the depth of SEM-B blocks within the LMFFNet model correlates with a decline in performance, a phenomenon already noted in the LMFFNet study [19]. This trend highlights the challenge of poor gradient flow inherent in deeper blocks, as evidenced by the accuracy curves depicted in Fig. 11. The ResLMFFNet model offers a solution by introducing residual connections to better leverage the potential of deeper SEM-B blocks. Notably, ResLMFFNet demonstrates performance improvements with increased M and N values. These results from the ablation study confirm that the ResLMFFNet model enhances gradient flow within deeper SEM-B blocks, helps preserve high-level features, and thereby boosts overall performance.

Based on the findings from the ablation study presented in Table 8, L2-norm regularization does not significantly affect overall performance. Accuracy curves in Fig. 11 show that the dropout layer used in the ResLMFFNet architecture already provides sufficient regularization and eliminates overfitting.

Scale transformation is selected as a data augmentation method to distinguish detail and global content features. The region may be scaled down or up by up to 5%, yet Table 9 results indicate no notable performance enhancement.

3.3.2 Limitations

The proposed model outperforms other architectures on all image sets, yet some failure modes may affect performance. The ResLMFFNet model produces inaccurate predictions, particularly for images prone to occlusion or containing complex details within target objects (Fig. 12).

Fig. 12
figure 12

Segmentation results for ResLMFFNet in the complex detailed and occluded tree objects

To effectively address the occlusion problem, the literature employs a convolutional block attention module (CBAM) [25]. This module prioritizes the region of interest by weighting features in both spatial and channel dimensions. In the context of the ResLMFFNet model, enhancing occluded tree features could be future research by integrating CBAM into the decoder’s input maps sourced from various scales in the encoder. Using CBAM to extract global information from fine-grained features might reduce interference from background and occluded trees.

4 Conclusions

This study introduces ResLMFFNet, an improved version of the LMFFNet model. Its design promises to overcome the challenge of balancing high accuracy with fast inference speed for various precision agriculture tasks. By incorporating residual connections into SEM-B blocks and a dropout layer in the MAD decoder structure, ResLMFFNet outperforms LMFFNet in terms of the Jaccard index without changing model parameters and significantly affecting inference time. Extensive experiments demonstrate its superiority over state-of-the-art architectures for real-time segmentation of crops, trees, and wheat yellow-rust. ResLMFFNet helps to preserve low-level features, improves generalization capability and solves the possible problem of vanishing/exploding gradients. Therefore, the proposed model supports real-time precision agriculture applications with high accuracy and fast inference time. Future work can explore optimizing and quantizing the ResLMFFNet model for deployment on embedded systems such as the Jetson TX series mounted on quadcopters.