1 Introduction

Image segmentation is a fundamental task in computer vision, which divides images (or video frames) into multiple segments and objects (Minaee et al. 2021). It plays a pivotal role in many crucial applications, such as scene understanding, image analysis, robotic perception, disease diagnosis (Pradhan et al. 2023), medical images (Kaur et al. 2021), video monitoring, augmented reality, and image compression (Yu et al. 2023). Image segmentation task contains three sub-tasks: semantic segmentation, instance segmentation, and panoptic segmentation. Semantic segmentation, also called scene labeling, refers to the process of assigning a semantic label (e.g. car, people, and road) to each pixel of an image (Yu et al. 2018a, b). Instance segmentation is designed to identify and segment pixels that belong to each object instance (Gu et al. 2022). Going further, panoptic segmentation includes semantic segmentation and instance segmentation (Chuang et al. 2023). Recently, with the rapid development of modern agriculture based on information technology, image segmentation is widely used to monitor crop and soil health, predict the best time to sow, fertilize and harvest, estimate crop yields, and detect plant diseases (Elbasi et al. 2023; Wang et al. 2022a, b). However, the practical application of image segmentation in agriculture is difficult due to the complexity of the agricultural environment. Disease stage identification and labeling inconsistency are the two main challenges of image segmentation applied to plant diseases. Early, mid, and late stage diseases may appear similar in images, which makes it difficult to utilize image data for disease stage identification (Yao et al. 2022). Simultaneously, variable criteria for determining disease type and severity may result in inconsistent labeling of training datasets. Such inconsistencies can adversely affect model training and evaluation, consequently diminishing prediction accuracy (Chen et al. 2021a, b). In addition, leaf and root system environments often undergo fluctuations, leading to corresponding changes in morphological characteristics such as texture, size, and shape (Kang et al. 2021; Yan et al. 2023). In this study, the Web of Science database was used to search for strings designed to capture a wide range of articles with plant image segmentation. The following search strings are used to search for articles on database: (“Crop” OR “Plant”) AND (“Image segmentation” OR “Semantic segmentation” OR “Instance segmentation”) AND (“Deep learning”). Time limited to the past five years. We searched 2205 relevant articles and selected the top 1500. After deleting irrelevant articles such as machine learning and system design, 300 highly relevant articles were retained.

In early agriculture image segmentation, numerous traditional methods were used to address the above problems. Such methods include thresholding (Otsu 1979), edge-based segmentation (Rosenfeld 1981), region growing (Ikonomatakis et al. 1997), k-means clustering (Dhanachandra et al. 2015), watershed algorithms (Longzhe and Enchen 2011), and graph-based methods (Boykov and Jolly 2001). However, the above methods have higher requirements for image quality, and the recognition result may be significantly degraded or even invalidated if the environmental conditions change during the image acquisition process. Therefore, the generality and robustness of those methods are unsatisfactory, and the accuracy in practical application is not guaranteed. In contrast, deep learning methods, which require less preprocessing and manual selection of potential features, have improved both accuracy and robustness. To address the segmentation issues arising from diverse growth stages and overlapping plant objects, an Encoder-Decoder deep network was utilized, taking 14 various vegetation indices as input for weed/crop/background segmentation and achieving the highest Mean Intersection over Union (mIoU) value of 88.91% (Wang et al. 2020a, b). To mitigate the effects of soil disturbance and minor color differences on plant root segmentation, an attention mechanism was incorporated into the DeepLabv3 + semantic segmentation model. By training the model on a mature cotton root dataset, the model scored 98.75% Intersection over Union (IoU) (Kang et al. 2021). In addition to CNN architectures, graph convolutional networks also perform well in image segmentation tasks. Pei et al. (2023) proposed a multiscale global graph convolutional neural network (MSG-GCN) by embedding a multi-scale graph convolutional neural block into the last layer of the U-Net +  + encoder module. The MSG-GCN outperforms U-Net and U-Net +  + on the University of Tokyo Chiba Forest aerial dataset. In particular, advanced Transformers and generative adversarial networks have been widely used in plant disease detection (Douarre etl a. 2019; Wu et al. 2022a, b, c) and weed identification (Espejo-Garcia et al. 2021).

The recent surge in image segmentation development underscores its growing significance in agriculture. However, a comprehensive understanding of image segmentation in agriculture requires a thorough examination of existing literature and related studies. To this end, numerous reviews have been conducted with varying focus areas. Yu et al. (2018a, b) conducted a comprehensive review of publicly available scene annotation datasets and semantic segmentation methods based on hand-crafted features, learned features, and weakly supervised learning. Hu et al. (2018) reviewed common RGB-D datasets used for semantic segmentation, conventional machine learning approaches, and deep learning-based image segmentation technologies relevant to RGB-D segmentation. Guo et al. (2018) provided an overview of deep learning-based image semantic segmentation, categorizing it into region-based, fully convolutional network based, and weakly supervised segmentation. Asgari Taghanaki et al. (2021) organized medical and non-medical image segmentation techniques into six different modalities: deep architecture-based, data synthesis-based, loss function-based, ranking models, weak supervision, and multi-tasking, while elucidating the limitations associated with current methods and suggesting prospective research directions in semantic image segmentation. Minaee et al. (2021) offered a comprehensive review of deep learning-enabled image segmentation methods, summarizing seminal work in both semantic and instance segmentation. Additionally, they scrutinized widely utilized datasets, and performance comparisons, and deliberated on the progression and challenges faced by image segmentation technology. Gu et al. (2022) synthesized existing fully supervised, weakly supervised, and semi-supervised instance segmentation techniques, dividing strongly supervised approaches into three subcategories based on the number of stages, and delineating datasets and metrics relevant to instance segmentation.

Although numerous studies have comprehensively reviewed various methods for image segmentation (see Table 1), only four articles focus specifically on plant image segmentation. In plant image segmentation, Hamuda et al. (2016) provided a review of segmentation methods for plants in the field, including color index-based segmentation, threshold-based segmentation, and learning-based segmentation. Maheswari et al. (2021) delve into deep learning semantic segmentation for smart orchard yield estimation. This paper also discusses the challenging issues that arise in the process of intelligent fruit yield estimation, such as sampling, collection, annotation and data enhancement, fruit detection, and counting. Buckner et al. (2021) showed how the segmentation and classification methods differ due to the diversity of physical features discovered at these different scales. Luo et al. (2023) reviewed two traditional segmentation methods and three deep learning methods including U-Net, SegNet and Deeplab. In addition to the above review, numerous studies have focused on a wide range of applications of deep learning in agriculture, e.g., Hasan et al.(2020), on recent deep learning advances in plant disease detection. These studies provide a broad overview, of which image segmentation is a small part. In summary, current research is mainly focused on orchard and plant organ etc. and there is no comprehensive review in agriculture. The image segmentation methods mainly focus on traditional image segmentation methods and early methods based on deep learning, with no mention of the latest models such as Transformer and GANs. Also, there are no related articles summarizing public plant image segmentation datasets. Therefore, we comprehensively summarize the commonly used network architectures in plant image segmentation, as well as other studies in these architectures. The public plant image segmentation datasets are summarized, and four applications, plant disease identification, weed identification, plant growth monitoring, and yield estimation, are reviewed in this paper. The specific contributions are as follows:

  • Based on network structure, we categorized the widely used agricultural segmentation models into eight different categories.

  • A number of common publicly available datasets for plant image segmentation are reviewed.

  • We evaluate and compare the performance of image segmentation algorithms widely used on benchmark datasets.

  • The applications of image segmentation in agriculture are presented including disease detection, weed identification, crop growth monitoring, and yield estimation.

  • Finally, for future directions and challenges of image segmentation in agriculture are outlined.

Table 1 A summary of the review studies on image segmentation

2 Deep learning network architecture

With the rapid development in Computer Vision and Deep Learning theories and models, the process of image segmentation and classification becomes easier when compared with the traditional approaches (Solanki et al. 2023a, b). This chapter reviews the widely used theological neural network architecture in the field of computer vision, including convolution, generation of antagonistic, graph, and Transformer networks. The abbreviations for the full text are shown in Table 2.

Table 2 A summary of abbreviations

2.1 Convolutional neural networks (CNNs)

CNNs are a class of artificial neural networks that are widely used in a variety of computer vision tasks. They have achieved substantial success in areas such as image recognition, object detection, and image segmentation, solidifying their position as one of the most important components within the field of deep learning. Compared to machine learning, CNN also improved in terms of speed and accuracy (Solanki et al. 2023a, b). The CNN structure is shown in (Fig. 1). The core of CNNs includes a convolution layer, pooling layer, and fully connected layer Convolutional layers are responsible for extracting features via convolution operations, whereas pooling layers endeavor to reduce the dimensionality of feature maps whilst bolstering their robustness. Nonlinear layers serve to introduce nonlinearity, thereby endowing neural networks with the ability to learn increasingly complex patterns and relationships. The merits of CNNs include capitalizing on spatial correlations between adjacent pixels, parameter sharing, invariance, large data processing, and adaptive feature learning. Some notable CNN architectures include VGGNet (Simonyan and Zisserman 2014), GoogLeNet (Szegedy et al. 2015), and ResNet (He et al. 2016).

Fig. 1
figure 1

The architecture of CNNs (Lecun et al. 1998)

2.2 Generative adversarial networks (GANs)

GAN (Goodfellow et al. 2020) (Fig. 2) framework consists of two interconnected components: a generator and a discriminator. In traditional GAN architectures, the generator network G is responsible for learning the mapping from a latent noise vector Z to the target distribution y, approximating "real" samples in the process. Concurrently, the discriminator network D determines whether the artificially generated sample goal of the properties of genuine samples. Notable variants of GANs include, Conditional GAN (CGAN) (Mirza and Osindero 2014), Deep Convolutional GAN (DCGAN) (Radford et al. 2015), CycleGAN (Zhu et al. 2017), Wasserstein-GAN (WGAN) (Arjovsky et al. 2017), DualGAN (Yi et al. 2018), and Semi-Supervised GAN (SGAN) (Trinh and O’Brien 2020), which have been extensively applied and studied within the domain of computer vision.

Fig. 2
figure 2

The architecture of a GAN (Goodfellow et al. 2020)

2.3 Graph neural networks (GNNs)

Deep learning has demonstrated remarkable success in a variety of domains. However, researchers have recognized its limitations in addressing and solving all situations and problems. In particular, when processing graph-structured data within non-Euclidean spaces, there exists an inherent challenge in leveraging both structural and semantic information effectively. GNNs (Scarselli et al. 2009) have emerged as a prominent research focus in the realm of graph data structures due to their capacity to concurrently learn topological information and preserve structural attributes (Fig. 3). Among various GNN models, Graph Convolutional Networks (GCN) (Kipf and Welling 2017) constitute a distinctive convolutional neural network architecture that can directly operate on graphs while exploiting their inherent structural information. In addition, popular GNN techniques encompass advanced methodologies such as Graph Attention Networks (GAT) (Veličković et al. 2017) and Graph Generative Adversarial Networks (Graphical GAN) (Li et al. 2018), further enriching the landscape of graph-based learning algorithms.

Fig. 3
figure 3

The architecture of a GCN layer (Kipf and Welling 2017)

2.4 Transformer

In 2017, Google introduced the groundbreaking Transformer model (Vaswani et al. 2017), which caused a sensation in the natural language processing field. In recent years, a number of innovative research papers have successfully applied the Transformer technology to cross-disciplinary computer vision tasks, as shown in Fig. 4, ushering in a new era in the visual domain. Dosovitskiy et al. (2020) introduced a model named ViT (Vision Transformer), which is a fully self-attention-based image classification approach. ViT's approach is to divide an image into patches of fixed size, perform linear transformations and position coding on each patch, and then feed the patches into the transform encoder to perform feature extraction and classification on the entire image. Compared to traditional CNNs, the ViT model fully relies on the self-attention mechanism to capture relevant information within images, offering higher interpretability and transferability.

Fig. 4
figure 4

The architecture of Transformer (Vaswani et al. 2017)

3 Deep learning-based image segmentation models

Minaee et al. (2021) provided a comprehensive categorization of image segmentation algorithms into 11 classes, based on their network structure. This classification serves as the most detailed demarcation in existing literature. Within the context of agriculture, however, in the field of agriculture, recurrent neural network applications teach less and are gradually being replaced. Subsequently, recent trends suggest an inclination towards Transformer and GNN based image segmentation methods. In light of this, we've tailored our classifications to fit the agricultural domain, thus presenting the image segmentation algorithms based on network structures into 8 distinct classes.

3.1 Encoder-decoder network

The encoder-decoder network, a prominent deep learning architecture, has demonstrated success in various image processing tasks, including image classification, object detection, and semantic segmentation. Semantic segmentation requires assigning semantic labels to individual pixels in an image while carefully balancing granular pixel-level details and global contextual information. The architecture consists of two essential components: firstly, the encoder extracts salient features from the input image, condensing them into a compact, low-dimensional representation; secondly, the decoder expands this representation to the original dimensions, allocating meaningful labels to each pixel.

Shelhamer et al. (2017) proposed a streamlined, yet efficacious Encoder-Decoder Network architecture (Fig. 5), denoted as Fully Convolutional Network (FCN). Harnessing the VGG-16 network for feature extraction, this innovative design supplants the final fully connected layer with a convolutional layer to preserve crucial spatial information. The decoder upsamples low-resolution feature maps from the encoder output to the original image resolution and then fuses the feature maps of the final layer of the model with those of the previous layers via skip connections (Fig. 6). This sophisticated fusion facilitates the integration of semantic information derived from deeper, coarser layers with appearance information gleaned from shallower, finer layers, ultimately yielding highly precise and intricate segmentations.

Fig. 5
figure 5

The FCN learns to make pixel-accurate predictions (Shelhamer et al. 2017)

Fig. 6
figure 6

Skip connections combine coarse and fine information. From (Shelhamer et al. 2017)

Ronneberger et al. (2015) introduced the highly efficient U-Net for segmenting biological microscopy images, building upon the FCN (Fig. 7). The U-Net embodies the Encoder-Decoder paradigm, with the encoder employing pooling layers for systematic down-sampling and the decoder leveraging deconvolution for iterative up-sampling. This approach enables the progressive restoration of spatial information and edge details from the original input image, ultimately mapping low-resolution feature maps to pixel-level segmentation outcomes. To mitigate information loss during the encoding phase's down-sampling, U-Net incorporates skip connections to amalgamate corresponding feature maps from both encoder and decoder segments. This sophisticated fusion allows the decoder to access enhanced high-resolution information during up-sampling, refining the restoration of intricate details in the original image and bolstering segmentation accuracy. Exhibiting prowess in pixel-level segmentation and adeptly handling small datasets, the symmetric U-shaped architecture ensures comprehensive feature integration between the encoder and decoder. To effectively extrapolate from a limited annotated image repository, U-Net harnesses data augmentation as its training strategy cornerstone.

Fig. 7
figure 7

The U-Net model (Ronneberger et al. 2015)

Although U-Net demonstrates impressive performance, network redundancy still poses a substantial challenge. Owing to the patch-based per-pixel training approach, high similarity among adjacent patches results in redundancy and sluggish training processes. Furthermore, information loss remains pervasive within CNN architectures, irrespective of the number of optimizations applied during the down-sampling phase. U-Net +  + (Zhou et al. 2018) is an improvement of the original U-Net model, which mainly introduces dense connection and multi-scale feature fusion mechanisms. Dense connectivity enables each decoder layer to directly access feature maps from all corresponding encoder layers to improve feature propagation and information flow efficiency; The multi-scale feature fusion uses the feature maps of different levels and fuses them into the decoder through concatenation and upsampling operations to improve the model's perception of the target and segmentation accuracy. Drawing inspiration from residual and dense connections, Res-U-Net (Xiao et al. 2018a, b) and Dense-U-Net (Guan et al. 2020) individually replace each submodule of U-Net with variants incorporating respective connection types. Despite its inherent limitations, U-Net continues to be the preeminent segmentation model within the medical domain (Liu et al. 2021a, b, c).

Badrinarayanan et al. (2017) proposed SegNet, a bespoke encoder-decoder fully convolutional architecture devised explicitly for image segmentation (Fig. 8). Analogous to deconvolution networks, SegNet's core trainable segmentation engine encompasses an encoder network, topologically congruent with VGG16's 13 convolutional layers, succeeded by a corresponding decoder network and a per-pixel classification stratum. SegNet's key innovation resides in the decoder's strategy for upsampling low-resolution input feature maps, which specifically employs pooling indices computed during the respective encoder's max-pooling operation to execute nonlinear upsampling. By capitalizing on this upsampling technique to recuperate contour and positional information, SegNet adeptly extracts and retains positional data of target edge features within images restores image dimensions and fine details, and achieves high-precision image segmentation. SegNet excels in regional segmentation, rendering it particularly suitable for segmenting punctiform and block-like objects.

Fig. 8
figure 8

The SegNet model (Badrinarayanan et al. 2017)

Additional research adopting encoder-decoder frameworks for image segmentation comprises RefineNet (Lin et al. 2017a, b), premised on ResNet-inspired residual connections; global convolutional network (Peng et al. 2017), addressing the precision trade-off between localization and classification; lightweight network LinkNet (Chaurasia and Culurciello 2017); a skin lesion segmentation model implementing Jaccard distance as the loss function (Y. Yuan et al. 2017); Stack Deconvolutional Network (SDN) (Fu et al. 2019a, b); and Semantic Image Segmentation via Deep Parsing Network (DPN) (F. Yuan et al. 2019).

3.2 Multiscale model

Multiscale models represent an advanced category of computer vision architectures that are capable of handling visual information across different levels of granularity. By facilitating feature extraction at multiple scales, these models excel at recognizing the contextual foundations embedded in given images. The emergence of multiscale models has been driven primarily by challenges associated with scale variance and intricate contextual relationships, which traditional computer vision paradigms frequently fail to address due to their limited ability to extract features at fixed scales.

An excellent instance of multiscale models is the Feature Pyramid Network (FPN), as proposed by Lin et al. (2017a, b). FPN was initially created for object detection applications but has now been effectively expanded for segmentation tasks. This architecture exploits the inherent multiscale pyramid hierarchy found in deep CNNs to construct a feature pyramid with minimal supplementary computational burden. FPNs incorporate bottom-up pathways, top-down pathways, and lateral connections to amalgamate low-resolution and high-resolution features. Following this, cascaded feature maps are processed through 3 × 3 convolutional layers to produce outputs for each stage. At each point in the top-down pathway, predictions relevant to object detection are generated. For image segmentation, a pair of multilayer perceptrons (MLPs) is utilized to engender masks.

Zhao et al. (2017) introduced the Pyramid Scene Parsing Network (PSPNet), an advanced multi-scale architecture that is adept at capturing global contextual representations in scenes (Fig. 9). The PSPNet uses a ResNet for feature extraction and enhances its capabilities with atrous convolutions to extract diverse patterns from input images. Feature maps are fed into a pyramid pooling module designed to capture patterns at different scales. Maps are integrated at four unique scales, corresponding to pyramid hierarchy levels, and processed through a 1 × 1 convolutional layer to reduce dimensionality. Outputs from pyramid levels are up-sampled and fused with the initial feature map, encapsulating both local and global contextual information. A convolutional layer then generates per-pixel predictions. PSPNet outperforms state-of-the-art models such as FCN, DeepLab-v2, DPN, and CRF-RNN on multiple datasets, demonstrating superior segmentation performance. However, handling occlusions between objects remains challenging, with edge segmentation being suboptimal in partially occluded areas. A key limitation of FCN-based models is their inability to effectively exploit category cues within global scenes, resulting in reduced segmentation accuracy and insufficient contextual integration. By implementing the PSPNet, the authors effectively aggregate context from various regions, providing the model with improved global context understanding. PSPNet uses different pool sizes to expand the receptive field, fostering a more comprehensive understanding of global contextual information. The pyramid pooling module gathers hierarchical information more efficiently than global pooling. Computationally, PSPNet imposes minimal overhead on the original atrous convolution FCN network, allowing simultaneous training of global pyramid pooling modules and local FCN features during end-to-end learning.

Fig. 9
figure 9

The architecture of PSPN (Zhao et al. 2017)

Ghiasi and Fowlkes (2016) described a multi-resolution reconstruction architecture based on Laplacian pyramids, which refines segmentation boundaries reconstructed from low-resolution maps using skip connections and multiplicative gating from high-resolution feature maps. In recent years, studies have demonstrated that the use of contextual features can significantly improve the performance of deep semantic segmentation networks. Contemporary semantics-based approaches differences notable variations in their methods for constructing semantic structures. He et al. (2019a, b) proposed the Adaptive Pyramid Context Network (APCNet) specifically for semantic segmentation tasks. The APCNet adaptively assembles multi-scale context representations using a series of well-designed Adaptive Context Modules (ACMs). Each ACM estimates local affinity coefficients for individual sub-regions under the guidance of global image information and subsequently computes context vectors based on these affinities. APCNet achieves state-of-the-art results on a variety of semantic segmentation benchmark datasets.

Additional models utilize multi-scale analysis for segmentation, encompassing the likes of Unified Perceptual Parsing Network (UPerNet) (Xiao et al. 2018a, b), Context Contrasted Network with gated multiscale aggregation (CCN) (Ding et al. 2018), Multi-Scale Context Intertwining (MSCI) (Lin et al. 2018), Dynamic Multiscale Filters Network (DMNet) (He et al. 2019a, b), Enhanced Feature Pyramid Network (EFPN) (Wang et al. 2021a, b), and Feature Pyramid Aggregation Network (FPAN) (Wu et al. 2022a, b, c).

3.3 Atrous convolutional models

Atrous convolution, alternatively known as dilated convolution, has a hyper-parameter referred to as the atrous rate. This parameter defines the distance between values as the convolution kernel processes the data. For example, a 3 × 3 kernel exhibiting an atrous rate of 2 has a receptive field the size of a 5 × 5 kernel while using only nine parameters. When the atrous rate is set to 3, the kernel achieves a receptive field equivalent in size to an 8 × 8 kernel. This allows the receptive field to be expanded without incurring additional computational costs. Moreover, by manipulating different atrous rates, a wide range of receptive fields can be obtained, effectively capturing multi-scale information.

Chen et al. (2014, 2016) proposed DeepLab-v1, an innovative approach that first employs atrous convolutions to mitigate information loss stemming from pooling operations, followed by the utilization of Conditional Random Fields (CRFs) to further improve segmentation accuracy In the subsequent iteration, DeepLab-v2 (Chen et al. 2018a, b, c, d), the more powerful and expressive ResNet-101 replaces VGG16 (Fig. 10). Within this refined version, Chen et al. skillfully exploit atrous convolutions and propose the Atrous Spatial Pyramid Pooling (ASPP) module, as shown in the accompanying figure, while preserving the fully connected CRF components.

Fig. 10
figure 10

Atrous Spatial Pyramid Pooling (ASPP). To classify the center pixel (orange), ASPP exploits multi-scale features by employing multiple parallel filters with different rates. From (L.-C. Chen et al. 2018a, b, c, d)

Chen et al. (2017) advanced DeepLab-v2 by introducing DeepLab-v3, a notable improvement that eliminates the use of fully connected CRFs. They developed a deeper network architecture by cascading, replicating the last block of the ResNet (i.e., block4) multiple times (i.e., block5-block7), and appending these duplicates in a cascading fashion to the network's backend. Each block encompasses three 3 × 3 convolutional layers, where all but the last block's final convolutional layer adopts a stride of 2. Moreover, they proposed a technique termed Multi-grid, which applies atrous convolutions at different rates in blocks 4 to 7. Within the ASPP module, batch normalization was incorporated, the atrous convolution with rate = 24 was replaced by a 1 × 1 convolution, and image-level features were integrated.

Subsequently, Chen et al. (2018a) presented the DeepLab-v3 + . In contrast to its predecessors, v3 + exhibits significant architectural changes, including the assimilation of a simple but effective decoding module, as shown in Fig. 11. The v3 + model uses the v3 network for encoding purposes, substituting the ResNet101 with the more profound Xception (Chollet 2017) architecture. In addition, depth-separable convolution is introduced in both the ASPP module and decoding model to reduce network parameters. Atrous separable convolutions replace the standard atrous convolutions within the framework.

Fig. 11
figure 11

The architecture of DeepLabv3 + (Chen et al. 2018b)

Dilated convolution expands the comprehension capabilities of CNNs regarding input images, enhancing their performance in image processing tasks and garnering significant attention within the image segmentation domain. Apart from the DeepLab family, numerous related models have been developed, such as multiscale context aggregation (Yu and Koltun 2016), Locality-Sensitive Deconvolution Networks (LS-DeconvNet) (Cheng et al. 2017), Dense Upsampling Convolution and Hybrid Dilated Convolution (DUC-HDC) (Wang et al. 2018), and densely connected Atrous Spatial Pyramid Pooling (DenseASPP) (Yang et al. 2018).

3.4 Graph convolutional neural network

Contrasting with traditional grid-structured CNNs, Graph Convolutional Networks (GCNs) can process data of arbitrary geometries, encompassing unordered point clouds, complex surfaces, and intricate polygons. GCNs facilitate local feature extraction for individual nodes within a graph, while simultaneously accounting for interrelations between a node and its neighboring counterparts. This capability enables GCNs to adeptly handle multi-scale information in diverse contexts.

In the field of deep learning, advanced feature extraction often overlooks the significance of local positional information, which is crucial for semantic segmentation. To address this limitation, Lu et al. (2019) introduced a graph model initialized by an FCN, aptly named Graph-FCN (Fig. 12), explicitly designed for semantic segmentation tasks. The authors ingeniously constructed a graph network model using the intermediate feature layer derived from a semantic segmentation network, where each pixel location within the feature layer acts as a graph node connected to its neighboring nodes. The graph network model was introduced to the GCN, facilitating classification on individual nodes, thereby transforming the semantic segmentation challenge of classifying discrete pixels into a graph node classification task. This pioneering approach harnessed the prowess of Graph Convolutional Networks to tackle the graph node classification puzzle, marking a groundbreaking application of GCNs to image semantic segmentation.

Fig. 12
figure 12

The structure of the Graph-FCN. There are two outputs of the model, and two losses L1 and L2. They share the weights of the feature extracted by the convolutional layer. L1 is calculated by output1 and L2 is calculated by the output2. By minimizing L1 and L2, the FCN-16s can improve performance (Lu et al. 2019)

Demonstrating prodigious potential in scene parsing, and contextual inference using image regions beyond local convolutions has attracted considerable attention. In this pursuit, Wu et al. (2020) incorporated linguistic knowledge to facilitate contextual reasoning within image regions, formulating a Graph Interaction Unit (GI Unit) and Semantic Context Loss (SC-loss). GI Units augment high-level semantic features in convolutional networks and adaptively learn semantic coherence for individual samples. Specifically, dataset-based linguistic knowledge is initially embedded within GI Units to encourage contextual reasoning on visual graphs. This is followed by the mapping of evolved visual graph representations onto each local representation to strengthen discriminative capabilities in scene parsing. SC-loss further refines GI Units, enhancing semantic representation on sample-based semantic graphs.

GCNs have made significant progress in graph-centric tasks, such as image segmentation, resulting in the creation of numerous high-performing models. Notable works include Spatial Pyramid Based Graph Reasoning (Li et al. 2020), Graph Matching Network (GMNet) (Michieli et al. 2020), Boundary-Aware Semi-Supervised Segmentation Network (Graph-BAS3Net) (Huang et al. 2021a, b, c), Boundary-aware Graph Convolution (BGC) (Hu et al. 2021a, b), Exploit Visual Dependency Relations (EVDR) (Liu et al. 2021a, b, c), and weakly supervised image semantic segmentation predicated on image-level class labels (Pan et al. 2021). Despite Graph Convolutional Networks showcasing formidable performance across segmentation domains, they may confront computational and storage constraints when processing large-scale images and necessitate copious training data to attain superior prediction results. Moreover, GCN's susceptibility to graph topology structure may precipitate suboptimal performance under particular circumstances.

3.5 Instance segmentation network

Instance segmentation is intimately intertwined with other computer vision tasks, involving not only the per-pixel classification intricacies of semantic segmentation but also attributes of object detection, such as identifying unique instances within an image and assigning individual masks.

Hariharan et al. (2014) developed the Simultaneous Detection and Segmentation (SDS) model, which revolutionized instance segmentation research. The model's originality and capability expanded the possibilities, for instance, segmentation research. This model leverages the MCG algorithm to extract candidate regions for each image while simultaneously obtaining feature vectors for both detection bounding boxes and regional foreground via dual pathways. The region classification is performed using Support Vector Machine (SVM) (Chang and Lin 2011), and the overlapping regions are further refined with Non-Maximum Suppression (NMS) (Neubeck and Van Gool 2006) further refining the overlapping regions. Ultimately, CNN-generated features are leveraged for mask prediction, culminating in meticulously segmented images.

Distinctively, the Faster R-CNN (Ren et al. 2016) architecture (Fig. 13) includes a region proposal network (RPN) for generating bounding box candidates. The RPN obtains a region of interest (RoI), and the RoIPool layer calculates features from these proposals to infer object bounding box coordinates and categories. Expanding upon Faster R-CNN's foundation, (He et al. 2017a) developed the Mask R-CNN model, establishing it as the benchmark for instance segmentation tasks. This model adds a segmentation subnetwork to the existing object detection framework (Fig. 14). It first uses RPN to isolate RoIs of objects within input images, conducting RoI Align operations on the generated RoIs, and then predicts detection boxes, class labels, and segmentation masks for all RoIs. The language used is objective, value-neutral, and follows standard sentence structure. Technical terms are explained when first used and spelling follows American conventions.

Fig. 13
figure 13

The architecture of Fast R-CNN (Ren et al. 2016)

Fig. 14
figure 14

The architecture of DeepLabv3 + Mask R-CNN (He et al. 2017b)

Dai et al. (2016a, b) proposed a more effective approach to instance segmentation by implementing Multi-task Network Cascades (MNC) that feature cascaded structure sharing convolutional characteristics. The MNC model dissects instance segmentation into three distinct tasks. Initially, an RPN is engineered through FCN to predict bounding box positions and object scores. Then, mask estimation is executed, projecting pixel-level masks separately for each instance object. Ultimately, the process of object classification combines convolutional features from the first two stages and assigns appropriate class labels to each mask, resulting in the categorization of objects.

Liu et al. (2018) introduced the Path Aggregation Network (PANet) based on the Mask R-CNN and FPN models (Fig. 15). The network's feature extractor utilizes an FPN backbone and includes a new bottom-up path to enhance the propagation of lower-level features. Each stage in the third path takes the feature map of the preceding stage as input, which is then processed by a 3 × 3 convolutional layer. A lateral connection supplements the output by directing it to the same-level feature map on the top-down pathway, thus preparing it for the following stage.

Fig. 15
figure 15

The architecture of PANet. (a) FPN backbone. (b) Bottom-up path augmentation. (c) Adaptive feature pooling. (d) Box branch. (e) Fully-connected fusion. (Liu et al. 2018)

Huang et al. (2019a, b) introduced the Mask Scoring R-CNN (MS R-CNN) model, which expands on Mask R-CNN by including a Mask IoU evaluation branch that calculates scores using features produced by RoI Align and predicted masks. Wang et al. (2019a, b, c, 2020) analyzed the relationship between object detection and instance segmentation duties and put forward RDSNet. In this approach, images are divided into object and segmentation branches after moving through an FPN backbone network. The segmentation branch conducts pixel clustering using target embedding, resulting in target masks, while the object branch distinguishes instance object categories and location information. This approach guarantees complete feature information with minimal loss. Chen et al. (2019) proposed the instance segmentation model Mask-Lab to improve object detection by utilizing semantic and directional features predicated on Faster R-CNN. Moreover, direct mask generation instance segmentation models encompass SISDLF (De Brabandere et al. 2017), DeepMask (Chen et al. 2018a, b, c, d), and CenterMask (Lee and Park 2020). PolarMask (Xie et al. 2020) employs polar coordinates to model contour-encoded masks. Instance segmentation models predicated on positional information include InstanceFCN (Dai et al. 2016a, b) FCIS (Li et al. 2017), and SOLO (Wang et al. 2020a, b).

3.6 Generative models and adversarial training

GANs improve the accuracy and robustness of image segmentation while effectively addressing data imbalance and limited samples. By concurrently training a generator and a discriminator, GANs can produce more authentic and accurate segmented images, which can increase diagnostic and decision-making accuracy in various domains such as medical imaging, autonomous driving, and disease detection. Moreover, GANs can augment datasets, providing better assurance of a model's ability to generalize. Consequently, GANs hold substantial significance in the field of image segmentation.

Luc et al. (2016) proposed a method for semantic segmentation utilizing GANs in 2016. Their approach employed an FCN as the generator model (Fig. 16). The training process involved feeding the generated semantic segmentation images and ground truth images into the discriminator model to facilitate adversarial training. Using this paradigm, the generator model learns to create precise semantic segmentation images in increments. Meanwhile, the discriminator model improves its ability to discriminate between generated semantic segmentation images and ground truth images. This methodology has been shown to yield high-caliber semantic segmentation results, surpassing traditional unsupervised learning techniques in numerous tasks.

Fig. 16
figure 16

Overview of the proposed approach. Left: segmentation net takes RGB image as input, and produces per-pixel class predictions. Right: Adversarial net takes label map as input and produces class label (1 = ground truth, or 0 = synthetic). Adversarial optionally also takes RGB images as input (Luc et al. 2016)

Souly et al. (2017) proposed a semi-weakly supervised semantic segmentation paradigm based on GANs. This study offers a new perspective by regarding the segmentation network as a discriminator and using the GAN generator to augment the training dataset to optimize training efficacy. The model is classified into two categories: semi-supervised and weakly-supervised, based on the absence or presence of classification labels in the supplementary data. In the case of the weakly-supervised model, which employs classification labels, a conditional GAN is used by the GAN generator, with the image's classification label functioning as input.

Xue et al. (2018) presented an innovative medical image segmentation technique that utilizes a distinctive adversarial network structure, called SegAN. This architecture comprises dual components: a generator and a discriminator. The generator utilizes multiscale L1 loss for training, which produces more accurate segmentation results, while the discriminator uses adversarial loss during training to discriminate between real and generated images. Dai et al. (2018) presented the Structure Correcting Adversarial Network (SGAN), which includes a critic network that imposes structural regularities emerging from chest X-ray (CXR) imagery onto convolutional segmentation networks. This approach has undergone testing across multiple datasets and has achieved state-of-the-art outcomes in numerous instances.

Li et al. (2021) introduced a semisupervised learning and robust out-of-domain generalization methodology for semantic segmentation that uses a generative model to determine representations of unlabeled images. The model first extracts representations from labeled images and then applies them to unlabeled counterparts. This method supports better performance on unlabeled data and superior generalization to novel domains.

Alimanov and Islam (2023) proposed a new Denoising Diffusion Probability Model (DDPM), which is superior to GAN in image synthesis. The authors developed a Retina Tree (ReTree) dataset consisting of retinal images and corresponding blood vessel trees. They also trained a DDPM-based segmentation network using images from the ReTree dataset. In the first stage, the Retina Tree is generated using standard normally distributed random numbers. Then, the model is guided to generate fundus images based on a given blood vessel tree and a random distribution.

GANs offer several advantages for image segmentation in the following ways: they generate high-quality images, facilitate data augmentation, enable unsupervised/semi-supervised learning, improve feature learning, suppress overfitting, and enhance contextual understanding. These methods have shown remarkable success in tasks such as medical images (Xun et al. 2022), agriculture (Lu et al. 2022), and autonomous driving (Liu et al. 2021a, b, c) by improving segmentation accuracy and robustness. Despite these advancements, issues related to model generalization capabilities and dataset-specific problems remain significant challenges within this field. Therefore, researchers have focused on techniques such as unsupervised domain adaptation (Ma et al. 2024) and semi-supervised learning (Peláez-Vegas et al. 2023) to explore these areas further.

3.7 Attention-based models

The attention mechanism has become a pivotal and sophisticated technique utilized extensively in various deep learning tasks. It requires thorough exploration and understanding within the realm of advanced technologies. Attention mechanisms were inspired by human visual cognition, where the brain rapidly discerns the focus of attention from incoming visual signals. Consequently, when humans perceive images, they determine where to concentrate their attention in forthcoming instances, allocating less attention to peripheral regions instead of processing every pixel in the entire image instantaneously and adjusting the focal point over time. In 2014, Google Mind's team (Mnih et al. 2014) pioneered the incorporation of attention mechanisms into RNN models for image classification, which subsequently gained traction. In 2015, Bahdanau et al. (2014) were the pioneers of implementing attention mechanisms in NLP. They utilized attention components to enhance the original encoder-decoder architecture, resulting in remarkable outcomes and enhanced performance in English-French translation tasks.

Chen et al. (2014, 2016) introduced an innovative scale attention module capable of soft-weighting multi-scale features at each pixel location (Fig. 17). This adaptive technique allows for the identification of significant regions of an image, leading to improved model performance. This study stands as one of the earliest endeavors to utilize attention mechanisms in the domain of image segmentation.

Fig. 17
figure 17

The attention model makes use of features from FCNs and produces weight maps, reflecting how to do a weighted merge of the FCN-produced score maps at different scales and different positions (Chen et al. 2016)

The design of convolutional filters in CNNs to local areas hinders a comprehensive understanding of complex scenes. To address this issue, Zhao et al. (2018) devised the Point-wise Spatial Attention Network (PSANet) to alleviate the constraints of the local neighborhood. By using adaptively learned attention masks, each position on the feature map establishes connections with all others. This fosters bidirectional information propagation essential for scene parsing. Gathering information from different locations improves predictions for the current position while disseminating information from the current position bolsters predictions for other locations.

CNNs extract features using local receptive fields, which can result in different representations of identical pixel labels in the final feature map. This can cause class inconsistency and inaccurate segmentation. To address this issue, RNN-based models (and their LSTM variants) capture long-range dependencies within feature maps, improving model precision. However, the effectiveness of these methods hinges on the outcomes derived from long-term memory learning, which can be rather generic. Fu et al. (2019a, b) presented the Dual Attention Network (DANet) as a solution. DANet utilizes a self-attention mechanism to capture feature dependencies across spatial and channel dimensions independently. Additionally, it adaptively combines local features with their global dependencies.

Huang et al. (2019a, b) proposed the novel CCNet, positing that the capture of long-range dependencies in visual understanding problems could provide informative contextual information. The CCNet integrates two modules: Criss-Cross Attention (CCA) and Recurrent Attention (CA). For every pixel, the ground-breaking CCA module extracts contextual data from neighboring pixels through criss-cross paths. With further iterations, every pixel eventually captures long-range dependencies originating from all other pixels. In comparison to non-local CCA modules, Recurrent CA modules demonstrate superior computational efficiency and reduced GPU memory consumption.

Zhong et al. (2020) introduced the Squeeze-and-Attention Network (SANet), an innovative approach to semantic segmentation tasks. They postulated that semantic segmentation involves two distinct sub-tasks: pixel prediction and pixel grouping. To address the pixel grouping challenge, they developed the SA module, which improves prediction accuracy. Drawing inspiration from Squeeze-Excitation Networks (SENet), SANet expands upon this concept by mitigating local constraints imposed by convolutional kernels and integrating attention-based convolution channels. This approach effectively supplements traditional convolutional layers with attention-focused processing for pixel groupings, thereby accounting for spatial-channel dependencies.

Additional applications of attention mechanisms within the realm of semantic segmentation include the EncNet (Zhang et al. 2018a, b), real-time semantic segmentation Bilateral Segmentation Network (BiSeNet) (Yu et al. 2018a, b), Expectation–Maximization Attention Network (EMANet) (Li et al. 2019a, b), which features the Context Encoding Module, Deep Feature Aggregation Network (DFANet) (Li et al. 2019a, b), Asymmetric Non-Local Neural Networks (ANNNet) (Zhu et al. 2019), Object Context-aware OCNet (Yuan et al. 2021), and SegNeXt (Guo et al. 2022). These models focus on object recognition, feature aggregation, and attention networks to improve the quality of semantic segmentation.

3.8 Transformer-based models

Zheng et al. (2021) utilized Transformers for semantic segmentation by introducing the SETR (Segmentation Transformer) model. The model reformulates semantic segmentation as a sequence-to-sequence prediction task, mitigating the need for the model to acquire local-to-global features by lowering the resolution. SETR exclusively employs a Transformer-based architecture. Initially, the ViT (Vision Transformer) model dissects images into fixed-size patches that undergo linear transformation. Subsequently, pixel vectors and positional encodings for each patch are integrated to allow the encoder. After 24 layers of Transformer learning, global features of the image are extracted, and a decoder is used to restore the original image resolution (Fig. 18).

Fig. 18
figure 18

The architecture of SETR. (a) The images are split into fixed-size patches, each patch is linearly embedded, positional embeddings are added, and the resulting vector sequence is fed to a standard Transformer encoder. (b) Progressive upsampling. (c) Multi-level feature aggregation. (Zheng et al. 2021)

Strudel et al. (2021) introduced the Segmenter, a customized Transformer model designed for semantic segmentation tasks. Since individual patches in image segmentation often present ambiguity, the inclusion of contextual information is essential to reach a label consensus. During the encoding phase, the Segmenter utilizes a ViT architecture to split images into patches before applying linear mapping and generating an embedded sequence post-encoder processing. During the decoding phase, learned embeddings for classes are incorporated into the decoder alongside the output of the encoder. Technical terms are explained on their first use. Class labels are procured employing either a linear decoder or a masked transformer decoder, resulting in the final pixel segmentation map after executing operations such as softmax and up-sampling.

Xie et al. (2021) proposed SegFormer, which combines Transformers with lightweight Multi-Layer Perceptron (MLP) decoders. SegFormer embraces hierarchical feature representation, reducing output feature dimensions at each Transformer layer during the encoding process to capture multi-scale feature information. The position embeddings found in ViT are omitted, preventing performance degradation caused by differences between training and testing image dimensions. The proposed MLP decoder utilizes a simple MLP framework to combine features across disparate encoder layer scales, integrating local and global attention mechanisms. This research substantiated that such basic and lightweight designs are crucial for achieving efficient segmentation on Transformers.

SCTNet (Xu et al. 2024) is a new state-of-the-art real-time semantic segmentation network that combines the features of Transformer semantic information with single-branch CNN. The network not only maintains the efficiency of a lightweight single-branch CNN but also possesses rich semantic representation capabilities, enabling it to achieve the best balance between performance and speed on multiple semantic segmentation datasets. On the Cityscapes dataset, SCTNet achieved an excellent performance of 80.5% mIoU and 62.8 FPS.

Transformers exhibit impressive feature learning capabilities, making their efficiency and capacity in segmentation an important area for future research, due to their strong long-range modeling capabilities and dynamic responsiveness. A variety of Transformer-based segmentation approaches have been developed, including the pioneering end-to-end panoptic segmentation model MaX-DeepLab (Wang et al. 2021a, b), TransUNet Chen et al. 2021a, b) incorporating ViT into U-Net for medical image segmentation, Video Instance Segmentation Transformer (VisTR) (Wu, Jiang, et al. 2022a, b, c), the panoptic segmentation benchmark Panoptic SegFormer (Li et al. 2022c, a, b), distortion-aware transformers (Zhang et al. 2022), weakly supervised semantic segmentation (Xu et al. 2022), End-to-End Weakly Supervised Semantic Segmentation with Transformers (Ru et al. 2022), and One-Stage Camouflaged Instance Segmentation with Transformers (OSFormer) (Pei et al. 2022).

4 Dataset and performance comparison

In this section, we review the most prevalent plant image datasets used for training and testing DL-based image segmentation models, provide an overview of typical metrics employed for evaluating segmentation model performance, and present evaluation results of DL-based segmentation models on established benchmark datasets.

4.1 Plant datasets

In contrast to prominent image segmentation benchmarks such as PASCAL VOC 2012 (Everingham et al. 2010), PASCAL-Context (Mottaghi et al. 2014), CamVid, Cityscapes (Cordts et al. 2016), and ADE20K, the domain of plant image segmentation lacks a comprehensive and unified dataset. Collecting plant image segmentation datasets can occur through methods such as custom data collection or online retrieval. We have compiled various publicly available datasets containing plant images (shown in Table 3). These datasets are commonly utilized for training and evaluating image segmentation algorithms. A detailed analysis of each dataset's unique characteristics is provided, along with the associated parameters.

Table 3 A summary of parameters related to the plant public dataset collected

LeafSnap (Kumar et al. 2012) constitutes a substantial image repository tailored for plant identification, showcasing an extensive variety of leaf images from a multitude of tree species (Fig. 19). This dataset encompasses 30,866 leaf images distributed across 185 distinct species, primarily located along the eastern seaboard of the United States, incorporating abundant urban park, arboreal, and botanical garden specimens. Each tree species is represented by a corpus exceeding two thousand leaf images, with individual images meticulously annotated with pertinent species information and supplementary metadata. Beyond the scope of visual data, the dataset also supplies exhaustive descriptions correlated with each species, encapsulating facets such as foliar characteristics and growth habitats. The LeafSnap dataset has been widely harnessed in research endeavors spanning machine learning, computer vision, and artificial intelligence domains, thereby fostering a deeper comprehension and preservation of plant species within the natural world.

Fig. 19
figure 19

An example of LeafSnap (Kumar et al. 2012)

The Leaf Segmentation Challenge (LSC) dataset (Minervini et al. 2014, 2016) is a specialized collection designed for leaf image segmentation (Fig. 20). It includes four separate directories (A1, A2, A3, and A4) and a total of 810 digital images, which consist of 783 top-view Arabidopsis thaliana plant images and 27 high-resolution Nicotiana tabacum (tobacco) plant images. The visual data covers a wide range of growth stages and presents several challenges, including complex backgrounds, varying resolutions, and overlapping plants. Each image in the LSC dataset comes with a binary mask file that marks the foreground region of individual leaves, which is crucial for validating and benchmarking leaf image segmentation algorithms. The dataset is divided into four directories. Directory A1 contains 128 images, each with dimensions of 500 × 530 pixels, featuring complex and varied backgrounds. Directory A2 contains 31 images, with uniform dimensions of 530 × 565 pixels, and featuring homogeneous and simple backgrounds. Directory A3 consists of 27 high-resolution images of the Nicotiana tabacum plant, with dimensions of 2448 × 2048 pixels. Finally, Directory A4 comprises 624 images of the Arabidopsis thaliana plant, each with dimensions of 441 × 441 pixels.

Fig. 20
figure 20

Sample images from the LSC dataset with their corresponding ground truths (Minervini et al. 2014, 2016)

GrowliFlower (Kierdorf et al. 2022) is a georeferenced dataset of time-series captured by UAVs from two cauliflower fields (0.39 and 0.60 hectares) monitored in 2020 and 2021 (Fig. 21). The dataset comprises orthoimages with RGB and multispectral bands and coordinates of about 14,000 plants, allowing for the extraction of complete and partial image blocks. Phenotypic traits of 740 plants, such as plant sizes and developmental stages, are also included. GrowliFlower provides four subsets that can be used for distinct machine learning tasks. The GrowliFlowL subset is composed of pixel-level images that have been manually annotated by humans and possess block dimensions of 368 × 448 pixels. The dataset, which is divided among training, validation, and test sets, contains a total of 2,198 plant images, with 1,972 containing plants and 226 lacking them.

Fig. 21
figure 21

An example of GrowliFlowerL (Kierdorf et al. 2022)

The MSU-PID (Cruz et al. 2016) dataset comprises 576 annotated images of Arabidopsis thaliana and 175 images of broad bean plants. The images were captured indoors, using a Hitachi KP-F145GV CCD camera over 9 days (15 images per day) and 5 days (13 images per day), respectively. The camera integrates fluorescence, infrared, RGB color, and depth sensors, resulting in varying spatial resolutions for each mode. The dimensions of the Arabidopsis images are: 240 × 240 for fluorescence and IR, 120 × 120 for RGB, and 25 × 25 for depth. The fava bean image, on the other hand, has dimensions of 1000 × 640 for fluorescence and IR, 380 × 720 for RGB, and 90 × 190 for depth.

The Fig dataset (Fuentes-Pacheco et al. 2019) comprises aerial RGB images of fig shrubs captured by a modern RGB camera affixed to an unmanned aerial vehicle (UAV). The dataset features 10 precisely labeled RGB images with resolutions up to 2000 × 1500 pixels (Fig. 22). The corresponding labeled ground truth images were manually generated utilizing advanced image annotation tools. The leaves of most fig shrubs are overlapped due to the dry season and dust accumulation. These images pose various challenges in computer vision, such as variations in lighting and shading, the presence of weeds, diverse shades of soil, camouflaged vegetation, and an array of debris (e.g., rocks, dried branches, and tools used by agricultural workers).

Fig. 22
figure 22

An example of Fig dataset (Fuentes-Pacheco et al. 2019)

The Komatsuna dataset (Uchiyama et al. 2017) comprises of two segments. The first segment includes 300 images captured using an advanced RGB-D camera (Fig. 23). The second segment consists of a multi-view dataset generated from several RGB cameras. It comprises 600 images with a resolution of 480 × 480 pixels. These images have a 166 × 190 pixel resolution. In total, these datasets provide 1200 images. The Komatsuna dataset is divided into two distinct segments. The initial segment involves 300 images captured through a high-tech RGB-D camera, with a resolution of 166 × 190 pixels. The subsequent segment involves a multi-view dataset extracted from numerous RGB cameras, including 600 images with a 480 × 480 pixel resolution.

Fig. 23
figure 23

Sample images from the Komatsuna dataset with their corresponding ground truths (Uchiyama et al. 2017)

4.2 Metrics for image segmentation models

In image segmentation tasks, intersection over union (IoU) is a crucial evaluation statistic. By calculating the ratio between the intersection and union of the predicted and ground truth labels, it can determine the percentage of accurately predicted pixels in relation to the total number of pixels for each class. The formula is:

$${\text{IoU}}= \frac{TP}{TP+FN+FP}$$

Within this framework, TP (True Positive) refers to the pixel count that shows the same classification in both the ground truth labels and the predicted results. FN (False Negative) identifies cases where pixels are assigned to a class in the ground truth annotations but are erroneously predicted as a different category. FP (False Positive) is defined as cases where pixels are labeled as the class in the predicted results, but they differ from the corresponding ground truth classification.

Pixel accuracy, a pertinent metric in computer vision, is derived by computing the quotient of the aggregate count of accurately classified pixels and the total pixel count. This particular metric demonstrates efficacy in scenarios featuring balanced class distributions; however, it may exhibit bias when confronting class imbalances. The formula can be articulated as follows

$$\mathrm{Pixel}\;\mathrm{Accuracy}=\frac{\sum_{i=1}^NT_i}{\sum_{i=1}^N\left(T_i+F_i\right)}$$

In this context, \({T}_{i}\) represents the number of correctly predicted pixels for the i-th class, \({F}_{i}\) denotes the number of incorrectly predicted pixels for the i-th class, and N signifies the total number of classes.

Mean accuracy is a metric obtained by dividing the number of correctly classified pixels for each class by the total pixel count of that class and then averaging the results across all classes. It is more robust than pixel accuracy in situations with class imbalance. The specific formula is as follows:

$$\mathrm{Mean}\;\mathrm{Accuracy}=\frac1N\sum_{i=1}^N\frac{T_i}{T_i+F_i}$$

Indeed, the meanings of \({T}_{i}\) and \({F}_{i}\) in mean accuracy are identical to those in pixel accuracy.

Mean Intersection over Union (mIoU) is obtained by averaging the Intersection over Union (IoU) for each class and can be used to evaluate the overall performance of a model. The specific formula is as follows:

$${\text{mIoU}}= \frac{1}{N}\sum_{i=1}^{N}\frac{{TP}_{i}}{{TP}_{i}+{FN}_{i}+{FP}_{i}}$$

Among them, the meanings of \({TP}_{i}\), \({FN}_{i}\), and \({FP}_{i}\) in mIoU are consistent with their definitions in IoU.

F1 Score is an evaluation metric that combines precision and recall, suitable for assessing binary classification problems (e.g., background/foreground). It provides a comprehensive evaluation of accuracy and completeness. The specific formula is as follows:

$${{\text{F}}}_{1}\mathrm{ Score}= 2\times \frac{Precision\times Recall}{Precision++Recall}$$

In this context, Precision represents the proportion of true positive samples among all samples predicted as positive, while Recall denotes the proportion of true positive samples among all samples that are positive.

4.3 Quantitative performance of DL-based models

In this section, we list the performance of some algorithms discussed previously on popular segmentation benchmarks in Table 4. Although most published models report their performance on standard datasets and use standard metrics, some do not, making a comprehensive comparison challenging. Furthermore, only a few publications provide additional information in a reproducible manner, such as execution time and memory consumption, which is important for industrial applications that may run on embedded systems (e.g., drones, autonomous vehicles, robots, etc.) where computational capacity and storage space are limited, necessitating lightweight models.

Table 4 A summary of papers for semantic segmentation of natural images applied to Public dataset

5 Agricultural application

Presented in this section is a summary review of application studies of image segmentation in the agricultural field. These applications are loosely organized into four areas, including plant disease identification (Table 5), weed identification (Table 6), crop growth monitoring (Table 7), and crop yield estimation and counting (Table 8).

Table 5 Summary of DL approaches in plant disease detection
Table 6 Summary of DL approaches in weed identification
Table 7 Summary of DL approaches in crop growth monitoring
Table 8 Summary of DL approaches in crop yield estimation and counting

5.1 Plant disease identification

Plant diseases and pests have a significant impact on crop yield and quality, limiting plant growth and ultimately decreasing the quantity and quality of crops, thus reducing diminishes farmers' productivity and income (Hasan et al. 2022). Consequently, rapid and accurate detection of plant diseases is critical (Chouhan et al. 2019a, b). Image segmentation technology has emerged as an efficient detection tool, surpassing traditional manual inspection methods by extracting valuable information from digital images. Despite the continued relevance of traditional image segmentation techniques (Abdu et al. 2019; Zhang et al. 2018a, b), deep learning has made remarkable advancements in digital image processing, significantly surpassing traditional approaches (Liu and Wang 2021; Wang et al. 2022a, b). Plant disease and pest identification utilizing deep learning methodologies has become a central area of research. Segmentation networks address plant disease and pest detection by performing semantic or instance segmentation on affected and healthy regions. This approach enables the fine-grained delineation of diseased areas while capturing location, category, and associated geometric attribute, which include length, width, area, contour, and centroid. However, detecting plant diseases and pests in complex natural environments presents numerous challenges, including subtle distinctions between diseased regions and backgrounds, low contrast, considerable variation in the size and type of affected areas, and extensive image noise. Traditional classical methods often fail to provide accurate detection results in these situations.

CNNs demonstrate exceptional image feature extraction capabilities, making them suitable for task in plant disease recognition and segmentation. By training binary or multi-class segmentation models, CNNs can accurately differentiate between healthy and infected plant tissues in input images, allowing for effective disease localization. Wang et al. (2018) proposed an innovative method for segmenting corn leaf diseases utilizing Fully Convolutional Neural Networks (FCNNs). To addresses the limitations of traditional computer vision, methods that are susceptible to variable illumination and complex backgrounds. The proposed method achieved a remarkable segmentation accuracy of 96.26%. Chouhan et al. (2019a, b) proposed an optimization method of bacterial foraging optimization algorithm. The method is used to initialize the weights of the artificial neural network for image segmentation, and a dice similarity coefficient of 86.79% is obtained. Saleem et al. (2021) presented a full resolution convolutional network (FrCNnet) model based on CNNs for accurately segmenting mango leaf damage in computer-aided systems. The FrCNnet learns the features of each pixel in the input data after applying specific preprocessing techniques. Nagaraju et al. (2022) proposed two learning algorithms, the image preprocessing and transformation algorithm and the image masking and REC-based hybrid segmentation algorithm (IMHSA), to solve the problem of limited data sets and overfitting of convolutional neural network models in the classification process. Yao et al. (2022) implemented Mask R-CNN and Mask Scoring R-CNN for the segmentation and identification of peach diseases. Utilizing instance segmentation models enables the extraction of disease names, locations, and segmentations, with the foreground area serving as the fundamental feature for subsequent segmentation. Focal Loss addresses challenges posed by difficult and imbalanced samples and is employed in this dataset to improve segmentation precision.

By utilizing the image generation capabilities of the generator and the discernment abilities of the discriminator, GANs can generate high-quality maps for segmenting plant diseases. Douarre et al. (2019) presented a study employing Deep Convolutional Generative Adversarial Networks (DCGANs) to generate images for the segmentation of apple rust disease, a condition characterized by spots on leaves and fruits. Infrared (IR) images of vegetation were partitioned into 64 × 64 pixel sub-images and then fed into a SegNet model for pixel-level segmentation. CycleGAN was evaluated for its capability to synthesize and augment image data for tasks related to plant health detection. Nerkar and Talbar (2021) implemented CycleGAN together with U-Net as an image synthesis generator to rebalance a dataset containing nine classes of tomato diseases. The synergy between CycleGAN and U-Net exhibited superior performance in perceiving image quality metrics by capturing low-level details and realistic textures. However, the study did not investigate the effectiveness of synthetic images in plant disease identification tasks. Aiming to enhance the diversity of image generation, Cap et al. (2022) integrated a leaf segmentation module, comprising a weakly supervised segmentation network, into CycleGAN, resulting in a novel model called LeafGAN. This model is designed to transform regions of interest within plant disease images. Upon testing on a dataset containing five classes of cucumber leaf diseases, LeafGAN contributed to a 7.4% increase in diagnostic performance compared to the unmodified CycleGAN, which only resulted in a 0.7% improvement.

In plant disease segmentation tasks, Transformers demonstrate proficiency in extracting both local and global features, capitalizing on self-attention mechanisms to accomplish precise disease localization and segmentation. Wu et al. (2022a, b, c) proposed a Disease Segmentation Detection Transformer (DS-DETR) based on the DETR network for segmenting early blight and late blight of tomatoes. DS-DETR initially uses the Plant Disease Classification Dataset (PDCD) for unsupervised pre-training, which effectively resolves the problem of lengthy training cycles and slow convergence in DETR. Through pre-training the Transformer structure, leaf disease features can be acquired beforehand, and these pre-trained model weights are employed to hasten the convergence rate in DS-DETR. Next, Spatial Modulation Common Attention (SMCA) was utilized to assign Gaussian-like spatial weights to the DS-DETR query frames. This method permits the usage of query frames with varying weightings to train different areas of the image, enhancing the model's accuracy. In addition, the Transformer structure of DS-DETR introduces improved relative position coding to enhance recognition of the sequence order of input markers, further strengthening the spatial position features. Finally, testing the DS-DETR model was conducted on the self-constructed Tomato Leaf Disease Segmentation Dataset (TDSD). The experimental results indicate that DS-DETR outperforms all other models on APmask with improvements of 12.87%, 8.25%, 3.67%, 1.95%, 10.27%, and 9.52% compared to Mask RCNN, BlendMask, CondInst, SOLOv2, ISTR (Hu et al. 2021a, b) and DETR models. Furthermore, it achieved a disease classification accuracy of 0.9640. However, the method still needs to be improved for the segmentation of small light spots.

If one is able to identify a disease, but is unable to quantify its severity, then one is still far from meeting the requirements of precision agriculture. Disease symptoms look symmetric at different stages of infection, with the possibility of overlapping symptoms appearing on the same leaf. So, the wide variety of symptom characteristics in qualitative and quantitative terms makes it very challenging to collect disease samples (Hasan et al. 2023). Li et al. (2022a, b, c) proposed a lightweight network based on copy-paste and semantic segmentation for accurate disease region segmentation and severity assessment. The RSegformer model was trained using a lightweight Segformer semantic segmentation network that features an attention mechanism and an up-sampling operator. This modification enables the model to balance local and global information, accelerate the training process, and reduce overfitting. The results of this model show that RLDCP can effectively improve the accuracy and generalization performance of the semantic segmentation model compared to traditional data augmentation methods, and can improve the mIoU of the semantic segmentation model by about 5% with only twice the size of the dataset. RSegformer can achieve 85.38% mIoU with a model size of 14.36 M.

5.2 Weed identification

Weeds are one of the principal factors undermining crop yields. They sporadically appear in fields and compete with crops for water, nutrients, and light, resulting in a negative impact on crop productivity and quality. This ultimately leads to a significant decline in global crop production (Wang et al. 2019a, b, c, 2020). At present, herbicide application represents the most prevalent method of weed control worldwide (Christensen et al. 2009). Conventional weed eradication involves indiscriminately spraying herbicides across entire fields, regardless of weed density, leading to excessive herbicide use in areas without weeds. This approach results in herbicide waste and contamination of agricultural ecosystems (Zou et al. 2021a, b). Image segmentation methods offer an efficient way to attain accurate weed detection and density assessment. The primary aim for image segmentation in weed detection is to distinguish plants from the background, including soil and residue. The key task for successful weed detection is efficient vegetation segmentation.

Kamal et al. (2022) evaluated deep machine learning algorithms for differentiating weeds from crop plants, utilizing an open carrot field image database. Das and Bais (2021) introduced DeepVeg, which focuses on the smallest (damage) class without impacting other classes to address the problem of class imbalance. Mishra et al. (2022) proposed an Inception V4 architecture approach based on deep convolutional neural networks, using RGB weed and crop images. It provides data cleaning to eliminate the background and uses segmentation masks to eliminate foreground vegetation. The early rapeseed field image dataset was used to train and test the proposed model. The evaluation results show that the DeepVeg model performs better than the union score with an average intersection greater than 0.76 and an accuracy greater than 0.97 in the four categories of segmentation. The model also shows robustness in detecting unlabeled, newly grown weeds and rapeseed, and can train the model to distinguish between rapeseed and weeds with similar circular structures with a small amount of data, which is suitable for early damage and weed segmentation. However, there is room for further improvement in the segmentation of such complex and highly imbalanced datasets. To address the question of the correlation between crop and weed classes, Kim and Park (2022) proposed the multi-task semantic segmentation-convolutional neural network for detecting crops and weeds (MTS-CNN) using one-stage training. This method incorporates crop, weed, and combined (crop and weed) losses to increase the associations between crop and weed classes, and trains the object (crop and weed) region intensively. In experiments performed with three different open databases—the BoniRob dataset, a crop/weed field image dataset (CWFID), and a rice seedling and weed dataset—the mIoU values of the segmentation for the crops and weeds in the MTS-CNN are 0.9164, 0.8372, and 0.8260, respectively.

Unmanned aerial vehicles (UAVs) are frequently used for crop monitoring and weed mapping on farmland. They are preferred over ground vehicles due to their flexibility, cost-effectiveness, ease of operation, and absence of soil compaction in fields (Nong et al. 2022). Aiming at the real-time identification of rice weeds by UAV low-altitude remote sensing. Lan et al. (2021) present two refined recognition models, MobileNetV2-UNet and FFB-BiSeNetV2, for real-time detection of rice weeds via low-altitude remote sensing from UAVs. The models are built on U-Net and BiSeNetV2 semantic segmentation models. The MobileNetV2-UNet model minimizes the computation of the original model parameters, whereas the FFB-BiSeNetV2 model enhances segmentation accuracy. The real-time segmentation effect of the two improved models on rice weeds was confirmed through the collection of low-altitude remote sensing video data. The results indicate that, in comparison to the U-Net model, the MobileNetV2-UNet model reduces network parameters and model size, minimizes floating point calculations, and enhances inference speed by almost three times. The FFB-BiSeNetV2 model enhances segmentation accuracy when compared to the BiSeNetV2 model and achieves maximum pixel accuracy and average cross-over ratio. The optimized models meet the performance requirements for real-time recognition on the embedded hardware platform. This study serves as a point of reference for real-time recognition of rice weeds and precise spraying operation of plant protection UAVs. Nong et al. (2022) proposed SemiWeedNet, a semi-supervised segmentation method for weeds and crops in drone imagery that accurately identifies varying sizes of weeds in complex environments while minimizing the necessity for extensive labeled data. SemiWeedNet integrates labeled and unlabeled images when constructing a unified semi-supervised architecture based on semantic segmentation models. By amalgamating encoded features with selective kernel attention, the researchers created a multi-scale enhancement module that emphasizes salient weed and crop traits while mitigating interference from intricate backgrounds. To address challenges arising from crop-weed similarities and overlaps, they introduced Online Hard Example Mining (OHEM) for optimizing labeled data training. Results demonstrated that SemiWeedNet outperforms extant methods, showcasing the potential of its components in boosting segmentation performance. Zou et al. (2021a, b) proposed a new approach to determine field weed density and create maps. They captured field images using UAVs and applied the excess green minus excess red index combined with the minimum-error threshold segmentation method to differentiate green plants from bare ground. Employing an enhanced U-Net, they conducted crop segmentation, subsequently obtaining weed images by eliminating bare ground and crops from the field. This innovative approach effectively evaluates field weed density from drone imagery, thus offering crucial information for precision weeding initiatives.

Zou et al. (2021a, b) tackled the issue of image annotation complexity in weed semantic segmentation by introducing an image enhancement technique. The semantic segmentation network underwent a two-stage training process, encompassing pre-training and fine-tuning phases. This approach yielded an intersection over union (IoU) value of 92.91% and an average segmentation time (ST) of 51.71 ms per image. Results indicated that the refined U-Net effectively distinguished weeds from images containing a multitude of other plants. The proposed weed target image segmentation method exhibited remarkable accuracy in segmenting weeds within intricate field environments, demonstrating wide-ranging applicability. U-Net, renowned for its robust training capability with limited samples and simplistic architecture, is extensively employed in weed identification tasks. Nasiri et al. (2022) utilized the U-Net architecture, a deep encoder-decoder convolutional neural network (CNN), to achieve pixel-level semantic segmentation of sugar beets, weeds, and soil. For 1385 RGB images collected under various conditions and heights, researchers trained a U-Net architecture with ResNet50 as the encoder module. To address data imbalance and small-region segmentation challenges, they employed a combination of Dice loss and focal loss to create a custom linear loss function. Ullah et al. (2021) used Maximum Likelihood Classification (MLC) and image processing techniques to label field images into three categories: background, crops, and weeds. Sodjinou et al. (2022) resented a segmentation approach that integrated semantic segmentation and the K-means algorithm to tackle crop and weed segmentation in color images. By employing thresholding techniques, all elements besides plants were eliminated from the images.

5.3 Crop growth monitoring

Plant growth monitoring plays a pivotal role in modern agricultural production, significantly contributing to the precise evaluation of crop growth conditions, enhancement of crop yield and quality, and prevention and control of pests and diseases. Conventionally, crop surveillance predominantly entailed manual visual inspections on a field scale. However, with technological advancements, the adoption of sensor technology and automation systems has become increasingly prevalent (Krishnaswamy Rangarajan and Purushothaman 2020). These sophisticated automated systems employ computer vision techniques to monitor plant growth dynamics, overall health, and performance metrics. To evaluate plant behavior under experimental field conditions and gauge their physical responses and symptoms to external stimuli, quantitative methodologies, and algorithms are essential. The integration of these methods establishes relationships between physical plant attributes and sensor-derived data. Image segmentation techniques, pivotal in the realm of computer vision, facilitate the periodic acquisition and analysis of plant imagery. This process yields vital information about crop growth status, encompassing variables such as growth rate and leaf coloration. Consequently, these insights empower automated systems to make informed management decisions and optimize agricultural strategies.

In references Zheng et al. (2009) and Zheng et al. (2010), the mean shift algorithm was utilized for the segmentation and extraction of soybeans and non-green vegetation. Krishnaswamy Rangarajan and Purushothaman (2020) investigated leaf count estimation and chromatic attributes for nine lab-cultivated eggplants (Solanum melgena) saplings. Leaf segmentation was facilitated using a combination of particle swarm optimization and contour growth methodology, achieving an accuracy rate of 89%. The saplings were ranked based on their defect percentages. Automated equipment and Foldscope (an innovative paper-based microscope) were utilized to obtain images for conducting linear regression analysis on the estimated Normalized Green Red Difference Index (NGRDI), ultimately validating regions of health and defects. The achieved R-squared value and least mean square error (LMSE) amounted to 0.86 and 0.1 respectively.

Unmanned aerial vehicles (UAVs), as efficient data collection devices, are widely used in crop monitoring. Akiva et al. (2021) developed a UAV-based field data and ground-based sky data collection system for capturing video images at multiple time points for crop health analysis. The system was designed for crop health analysis, and dataset evaluation showed an impressive level of accuracy in predicting the internal temperature of exposed fruits evaluation showed an impressive level of accuracy. Solar irradiance prediction errors ranged from 8.41–20.36% MAPE within 5–20 min intervals. The system achieved a segmentation accuracy of 62.54% mIoU and an exposed fruit recognition count accuracy of 13.46 MAE, providing informed feedback for growers. Due to temperature and illuminance variations during UAV flights, color bias in UAV images is inevitable. Temperature and illuminance changes during UAV flights cause color cast in imagery, which can mislead crop monitoring assessments. Color calibration is essential to mitigate these effects. Current methods use semantic correspondences for color transfer but often overlook the integration of semantic segmentation and style transfer, leading to issues with semantic mismatch. To address this problem, Huang et al. (2021a, b, c) propose a multi-decoder architecture that integrates semantic segmentation and style transfer for end-to-end color transfer. Additionally, an adaptive instance normalization (AdaIN) method tailored for crops is introduced to estimate color bias in crop regions, using this information to calibrate colors across the entire image through a local-to-global attention mechanism. The objective of this study is to establish a general framework for removing the color cast in UAV imagery for crop monitoring. This framework will provide a solid foundation for subsequent data interpretation.

Crop growth monitoring enables early estimation of final yields and prediction of optimal harvesting times to improve production efficiency. Zhang et al. (2020) employed a U-Net model for pixel-level segmentation of purple rapeseed leaves at the seedling stage using UAV RGB images. Considering the limited spatial resolution and small target size of UAV-acquired rapeseed images, a careful selection of input image block sizes was performed. Experiments demonstrated that the U-Net model with a block size of 256 × 256 pixels achieved better and more stable results, with an F-measure of 90.29% and IoU of 82.41%. In order to solve the problem of changing lighting conditions of RGB images throughout the day, Fukuda et al. (2021) introduced CROP (Central Round Object Painting). This technique employs deep learning for image segmentation, utilizing a neural network architecture based on an enhanced version of U-Net. CROP identifies various central round fruit types in RGB images under diverse illumination conditions and generates corresponding masks. By quantifying mask pixels, the fruits' relative two-dimensional dimensions can be acquired, offering a non-contact approach for automatically tracking fruit growth in time-series imagery. Weyler et al. (2022) proposed a vision-based method for the concurrent instance segmentation of crop plants and leaves within breeding plots. A specialized convolutional neural network was devised to pinpoint plant-specific key points and the location of pixel groups, enabling the detection of individual leaf and plant instances. This supports vision-based systems in delivering extensive automated and regular assessments of plant growth status. Yu et al. (2022) evaluates the effectiveness of the U-Net model, particularly with Vgg16 as the feature extraction network, in accurately segmenting maize tassels from near-ground RGB and UAV images. This method addresses the current labor-intensive and error-prone manual monitoring approach. The results indicate that the U-Net model (with Vgg16) performs better than when using MobileNet, demonstrating good segmentation accuracy across different tasseling stages, maize varieties, and image resolutions. The segmented area changes from the images align well with manual measurements. Even at a resolution of 3.06 mm, the UAV RGB images showed satisfactory segmentation accuracy. Thus, the U-Net model proves to be efficient for the accurate segmentation of maize tassels in various complex scenarios, signaling potential for future use in crop phenotypic experiments.

5.4 Crop yield estimation and counting

Since Seguí et al. (2015) initiated research on CNN-based counting, there has been an increasing interest in deep learning-driven counting methodologies for the automated enumeration of agricultural products. These techniques offer valuable support to agribusinesses, enhancing the optimization and streamlining of harvest yields. Furthermore, they provide essential yield assessments and production efficiency estimates for agricultural producers. Such data is ultimately employed to forecast storage requirements, profitability, and production capability. In dense agricultural contexts, counting and yield estimation share similarities, particularly as yield estimation predominantly relies upon deep-learning-based counting techniques. Image segmentation technologies isolate crops from background elements within field images, yielding distinct crop boundaries to facilitate accurate counting and yield computation.

Wheat, rice, and corn constitute the world's three primary staple crops. Wang et al. (2019a, b, c, 2020) proposed a new method to accurately count wheat ears in field conditions using an FCN and Harris corner detection. The process includes constructing a dataset of wheat-ear images, training an FCN as the segmentation model, testing the model, processing results using the Otsu algorithm for binarization, and applying Harris corner detection. This technique offers high accuracy (0.984 on average), quick computation time (0.033 s for 256 × 256-pixel image), and improved performance under different conditions like wheat-ear occlusion and soil disturbance. The counting accuracy is also commendable with an average score of 0.974, R2 of 0.983, and RMSE of 14.043, showing an improvement of 10% compared to previous methods. This proves crucial for efficient wheat phenotyping studies. Alkhudaydi et al. (2022) introduced Spike Count, a density estimation method related to human crowd counting, specifically tailored for enumerating wheat spikelets. This counting methodology is rooted in a deep learning architecture due to its capacity for automatic feature recognition, employing transfer learning for both segmentation and counting tasks. Experimental outcomes revealed that segmentation is advantageous, as concentrating solely on regions of interest improves counting precision in the majority of scenarios. Notably, transfer learning from analogous images yielded favorable results for counting tasks throughout most stages of wheat development. Shao et al. (2021) generated a dataset comprising 3,300 rice panicle samples, representing an array of intricate situations including diverse lighting conditions, complex backgrounds, overlapping rice plants, and overlapping leaves. They also proposed a hybrid approach incorporating the location-based counting fully convolutional neural network (LC-FCN) model, founded on transfer learning, and the watershed algorithm for identifying dense rice images. This method furnishes reliable baseline data for rice yield estimation and provides researchers with a valuable dataset. Tan et al. (2019) proposed an adhesion rice separation counting algorithm for under-segmented regions, suitable for counting rice grains. This algorithm consists of a watershed algorithm, an improved corner detection algorithm, and a neural network classification algorithm. To quickly and accurately obtain the number of wheat ears in a field, Wang et al. (2022a, b) introduced a method for accurately counting wheat ears in field conditions using FCN and Harris corner detection. The process involves building a dataset with RGB images of wheat ears, training an FCN for segmentation, testing the model, binarizing the results with the Otsu algorithm, and applying Harris corner detection to count the wheat ears. The proposed model achieves high segmentation accuracy (average 0.984), is efficient (requires only 0.033 s for a 256 × 256-pixel image), and performs well under challenging conditions like wheat-ear occlusion and soil disturbance. The counting method also shows excellent results, with an average accuracy of 0.974, R2 of 0.983, and RMSE of 14.043—all metrics represent a 10% improvement over previous methods. Hence, this method provides a reliable technique for wheat phenotyping studies.

Leveraging UAVs or satellite remote sensing imagery for large-scale crop enumeration and yield estimation presents an expeditious method for acquiring comprehensive data, effectively capturing crop growth conditions and production levels across various regions. Malambo et al. (2019) presents an image analysis method using the SegNet deep learning semantic segmentation model to estimate sorghum panicle counts from unmanned aerial system (UAS) images. These counts are crucial for sorghum crop improvement. The model was trained with 462 labeled images (250 × 250) to segment UAS images into sorghum panicles, foliage, and exposed ground, which was then applied to field orthomosaics to generate field-level semantic segmentation. Individual panicle locations were identified by post-processing the segmentation output. Comparisons of model estimates with manually digitized panicle locations in 60 selected plots revealed a detection accuracy of 94%. Spearman correlation between the estimated and reference panicle counts was high, scoring 0.88, while the mean bias reached 0.65. The primary sources of panicle detection errors stemmed from misclassifications during semantic segmentation and mosaicking inaccuracies in the field orthomosaic. Despite these, the method demonstrated promising potential, which could be further enhanced through the collection of more data and comprehensive hyper-parameter tuning.

6 Challenges and prospects

6.1 Agricultural datasets are scarce

Current plant image segmentation datasets are predominantly from single or limited plant varieties, resulting in a lack of diversity. Consequently, the models derived from these datasets struggle to adapt to images of unknown plant species. Compared to datasets of images from other domains, plant image segmentation datasets are smaller in scale, producing models that may be less robust and susceptible to overfitting issues. Agricultural image data present challenges, such as complex backgrounds, varying lighting conditions, and diverse plant postures, which complicate accurate labeling. Crops often grow in intricate environments where other objects or crop types may be present. Agricultural image segmentation technology needs to adapt to different domain applications, imposing higher requirements on the associated datasets. The limitations and challenges of agricultural semantic segmentation datasets remain pronounced, necessitating further in-depth research and exploration to fulfill practical application demands. The limitations inherent in current agricultural semantic segmentation data sets are obvious. Solutions to these challenges may include the deployment of advanced deep learning techniques such as transfer learning, GAN, model architecture, physical information neural networks, and deep synthetic minority oversampling techniques (Alzubaidi et al. 2023). In addition, unsupervised and semi-supervised learning (Sect. 6.2) techniques can improve the performance of small-scale agricultural datasets. Using the above innovative techniques is a good way to overcome the current limitations and meet the practical application needs of agricultural image segmentation.

6.2 Semi-supervised and unsupervised learning

In the case of semantic segmentation, the annotations should be pixel-wise, which is costly to obtain. An alternative to supervised learning is unsupervised learning using the large amount of available unlabeled visual data. Semi-supervised learning (SSL) lies between supervised and unsupervised learning and gives some supervision in addition to unlabeled data, e.g. labeling certain samples (Souly et al. 2017). Unsupervised learning involves constructing models devoid of labeled data, capitalizing on inherent data structures and features. Collecting large-scale data sets in agriculture is difficult and expensive, especially for rare diseases. Data labeling requires a significant investment of time by experts. Semi-supervised and unsupervised learning provides an effective way to solve this problem (Li and Chao 2021). The transfer learning approach is to train a general image segmentation model on a large number of labeled samples, possibly from a common benchmark, and then fine-tune the model on a few samples from some specific target application. Transfer learning can also improve the generalization ability of the model, thus making it more suitable for different types of agricultural images.

6.3 Lightweight network

Lightweight neural networks are characterized by reducing the number of parameters and computational complexity while maintaining strong performance, helping to integrate deep learning techniques into resource-constrained or computationally constrained agricultural equipment. Examples include agricultural robotics, UAV terminals, mobile devices, and embedded systems, which contribute to the proliferation and expansion of intelligent technologies. With the growing ubiquity and demand for smart terminals in the future, lightweight networks will gain increasing significance. In the context of mobile devices, lightweight networks expedite model inference and optimize energy efficiency, thereby enhancing user experiences. For embedded systems, such networks underpin intelligent perception, control, and decision-making capabilities, promoting advancements in the Internet of Things (IoT) and Industrial IoT sectors. While primarily serving smart agriculture applications, lightweight networks possess the versatility to accommodate broader domains, encompassing healthcare, financial services, and urban transportation, ultimately bolstering the development of smart cities and fostering digital transformation. Currently, there are five methods for network lightweight: network pruning, knowledge distillation (Gou et al. 2021), tensor decomposition (Kim et al. 2015), quantization (Wang et al. 2019a, b, c, 2020), and compact convolutional filters (Chen et al. 2023). These approaches focus on reducing computational complexity and memory requirements without significantly impacting performance. They are useful for deploying image segmentation tasks on devices with limited computing resources. This helps reduce inference time and increase efficiency in real-time applications.

6.4 Pretrained foundation model

Pretrained Foundation Models (PFMs) (Zhou et al. 2023) are considered the foundation for various downstream tasks with different data modalities. A PFM is trained on large amounts of data, which provides a reasonable parameter initialization for a wide range of downstream applications. In contrast to previous approaches that use convolutional and recurrent modules to extract features, BERT learns bidirectional encoder representations from transformers trained on large datasets as contextual language models. Similarly, the Generative Pretrained Transformer (GPT) (Bagal et al. 2022) method uses transformers as feature extractors and is trained on large datasets using an autoregressive paradigm. Recently, ChatGPT has shown promising success on large language models, using an autoregressive language model with zero or few shots. The remarkable achievements of PFM have led to significant breakthroughs in various fields of AI in recent years. In the field of segmentation, the Segment Anything model (Kirillov et al. 2023) has been a huge success, even said to be the GPT of CV. Currently, pretrained foundation models have not been applied in agriculture. Pretrained foundation models are suitable for complex patterns and data distribution in agriculture due to their huge network structure and massive parameters. The application of a pretrained foundation model in agriculture image is the future development trend.

6.5 Multimodal image fusion

Most agricultural image segmentation works utilize visible (red–green–blue, RGB) images to perceive scene content. Visible light cameras cannot handle changes in scene lighting and lack the ability to penetrate complex environments. The imaging mechanism of visible light cameras makes it challenging to capture sufficient and effective scene information in poor lighting conditions and adverse weather. Additionally, visible light cameras cannot effectively deal with complex scenes featuring similar target appearances, multiple scene areas, and significant changes. Depth cameras provide the physical distance of objects from the camera's photo center in the imaging scene, while thermal infrared imagers reflect the thermal radiation characteristics of objects with temperatures above absolute zero in various lighting and weather conditions, providing accurate target contour and semantic information. Multi-modal image semantic segmentation leverages the complementary characteristics between different modal images through fusion, enhancing the segmentation model's learning and reasoning abilities in complex scenes (Zhang et al. 2021). This method has been successfully applied in remote sensing (Jin et al. 2023), automatic driving (Huang et al. 2021a, b, c), medical imaging (Zhu et al. 2023), and other fields. In agriculture, identifying plant diseases, weed infestation, and yield estimation can be done more effectively through multimodal image fusion. By combining different types of images, models can better understand complex agricultural scenarios, thus improving decision-making in crop management, pest control, and yield forecasting.

7 Conclusion

Image segmentation is a crucial technology for practitioners to comprehensively analyze and understand vegetation and diseases in farmland. This technology helps to improve the quality of crop growth, increase yields, reduce costs, and ultimately achieve sustainable agricultural production. The applications of image segmentation technology in agriculture are comprehensively reviewed in this paper. We categorize image segmentation solutions based on deep learning into eight categories and discuss their specific applications in agriculture, including disease detection, weed identification, crop growth monitoring, and yield estimation. Despite the wide development of image segmentation, the latest advances in this field have found limited application in agriculture. Most research continues to focus on earlier achievements such as U-Net, Deeplab, and other established methods. This study aims to assist researchers in selecting appropriate approaches for specific agricultural applications by offering insights and applications regarding the effectiveness of various models. Additionally, this paper presents the first comprehensive summary and analysis of the most widely used plant image segmentation datasets, providing direction and reference for related fields. By analyzing public plant image segmentation datasets, researchers can better understand the commonalities and disparities between different datasets, thereby facilitating the sharing of resources and experience among researchers. However, the current public datasets have limitations, such as a lack of diversity and a limited number of images, only containing plant images of a few species. Therefore, there is an urgent need for a larger dataset with a wider range of species in plant image segmentation. Finally, this paper discusses some open challenges and promising research directions for deep learning-based plant image segmentation in the forthcoming years.