1 Introduction

Over the past decade, fruit detection has undergone extensive research and application. Traditional fruit detection methods include single-feature analysis methods [1] (such as those based on color, geometric shape, and texture features); multi-feature fusion analysis methods (such as tomato recognition algorithms that fuse geometric shape features with color features) [2]; fruit recognition algorithms based on the fusion of color, intensity, edge, and direction features [3]; and citrus fruit recognition algorithms based on color threshold segmentation and edge perimeter [4]. Unfortunately, low robustness and time consumption issues plague both single-feature analysis approaches and multi-feature fusion methods. Detection techniques are vulnerable to a variety of environmental influences in complex environments with poor appearance [5].

Article Highlights

In recent years, the application of deep learning in fruit detection has significantly improved detection accuracy and robustness and reduced time consumption compared to traditional algorithms. For instance, Wang et al. [6] introduced an enhanced Faster R-CNN model, leveraging the attention mechanism principle. This method overcomes difficulties such fruit overlap, stem and leaf shadowing, and fluctuating light levels to accurately identify early tomato fruits. Wang et al. [7] developed an advanced Faster R-CNN model for recognizing and detecting tomato ripeness. The integration of high-dimensional semantic and low-dimensional information, enabled by the Path Aggregation Network (PANet), facilitates tomato ripeness detection in complex contexts. Afonso et al. [8] proposed a tomato detection method utilizing the Mask R-CNN model. The experimental results demonstrate that this method attains 95% accuracy for ripe tomatoes and 94% accuracy for unripe tomatoes. Fruit detection using two-stage detectors [9] is characterized by its high accuracy and robustness. However, in regions of interest, consecutive feature localization and pooling processes lead to slower detection speeds and higher processing demands [10, 11]. Conversely, the single-stage detector using the YOLO framework offers an advanced solution, balancing accuracy with inference time.

In terms of real-time detection of objects, the YOLO series has undergone multiple revisions and improvements. YOLOv1-v3 [12, 13] established a foundational single-stage detection framework for objects of various sizes. YOLOv4 [14] incorporates Mish activation functions, PANet networks, and innovative data augmentation techniques. YOLOv5 [15] introduced additional data augmentation strategies and model variants, building on the enhancements made in YOLOv4 [14]. YOLOX [16] employs a novel design featuring anchor-free detection, multiple positives, and a decoupled head. YOLOv6 [17] introduced a reparameterization method and an efficient Rep backbone network along with the Rep-PAN neck. YOLOv7 [18] concentrates on optimizing gradient paths and introduces the E-ELAN structure. YOLOv8 [19] achieves the SOTA of the current YOLO family by combining the strongest features of earlier YOLO models.

The YOLO-Tomato network was presented by Liu et al. [20]. It uses circular bounding boxes instead of the more common rectangular bounding boxes and implements a sensing architecture to improve the tomato recognition accuracy of YOLOv3. Appe et al. [21] improved the feature fusion capabilities of the network by integrating the convolutional block attention module [22] into the neck module of YOLOv5. However, instead of enhancing network performance by increasing computational and parametric capacities, an increasing number of studies are advocating for a lightweight model design to optimize adaptation to real-time detection tasks [23, 24]. Zeng et al. [25] replaced the focus layer in YOLOv5s with a downsampling convolutional layer and designed a lightweight backbone network in combination with MobileNetV3 to further minimize the size of the model. The enhanced network attains an average precision of 96.9% for tomatoes at various ripening stages, with a 64.88% improvement in inference time compared to the original YOLOv5s. Mbouembe et al. [26] introduced a lightweight YOLOv4-tiny network for tomato fruit detection. Specifically, they utilized the bottleneck module and a simplified BottleneckCSP module in place of Cross Stage Partial (CSP)Networks in the backbone; additionally, the content aware reassembly of features replaced the upsampling operation in the neck module. This approach enhances the detection accuracy while reducing the parameter count and computational complexity of the model. Although the previously listed methods successfully identify and detect tomatoes, but do not account for situations like long-range detection or when the target takes up less pixels in the image, which could result in misses or incorrect identifications of small-target tomatoes. To enhance the accuracy of YOLO networks for small target detection, Wang et al. [27] developed SM-YOLOv5, which incorporates a small target detection layer to increase the accuracy of detecting small target tomatoes.

Based on research findings from researchers both domestically and globally, this paper presents a small target object method for tomatoes based on YOLOv8 network, taking into account the significance of low-dimensional information incorporated in larger features for small target detection [28].The following are the specifics of the work: (1) In this study, PgConv is built to achieve the real-time detection performance of the model, while the IFN module is designed to achieve the effective extraction of tomato features. (2) By incorporating the GD mechanism, this method merges multi-level features globally and injects global information into upper layers, thereby facilitating efficient information exchange and improving the detection precision of small target tomatoes. (3) The adoption of Repulsion Loss has enhanced the ability of the network to accurately detect tomatoes in complex settings.

2 Materials and methods

2.1 Data acquisition and preparation

The tomato data collection was conducted at the tomato cultivation base of Shenyang Agricultural University (the coordinates of latitude 41.8° N). The dataset includes cracked tomatoes, frequently found among prematurely picked ripe tomatoes; these tomatoes hold both edible and economic value. Consequently, the algorithm could be applied in smart picking equipment to segregate cracked from healthy ripe tomatoes, potentially reducing farmers' cultivation costs. The photographs were captured with an iPhone 13 and saved in JPG format at a resolution of 1280 × 720. Images were captured at varying shooting distances, with a minimum threshold of 1.5 m for long distance shots and a maximum threshold of 0.5 m for normal distance shots. A total of 1896 tomato photographs were collected, with 1290 taken at larger distances, providing such data as small target tomato images, and the remaining 606 shot at normal distances, specifying such data as tomato images. The tomatoes in this study were classified into five maturity stages: ripe, half ripe, immature, young, and crack. Tomato photos captured at long and normal distances were allocated to the training and validation sets, respectively, at a 7:3 ratio. Figure 1 displays tomato images captured at normal (1-a) and longer distances (1-b, 1-c). The tomatoes in the dataset were labeled using LabelImg software [29], with label boxes added and saved as illustrated in 1-d.

Fig. 1
figure 1

Dataset and image annotation a normal distance capture b, c longer distance capture d image annotation

2.2 Data enhancement

The three channels of the RGB color space contain both luminance information and color information, hindering separate luminance processing in images [30]. In this study, the original images underwent two key processing steps: first, conversion from RGB to HSV color space; second, enhancement of luminance distribution using Contrast Limited Adaptive Histogram Equalization (CLAHE) to achieve more balanced and clear overall contrast. CLAHE is a more efficient method than global histogram equalization. Initially, the image is divided into small tiles, and then histogram equalization is performed separately for each tile. This localized approach enables CLAHE to boost the local contrast of the entire image without excessively enhancing specific areas, as illustrated in Fig. 2. After adjusting the brightness and contrast, the distinction between the foreground and background became more pronounced, aiding in tomato identification and detection.

Fig. 2
figure 2

Histogram equalization results

To reduce the overfitting of the model during training and enhance its robustness and generalizability, this study expands the dataset by introducing zero-mean Gaussian noise, random rotation, and mirroring. The introduction of zero-mean Gaussian noise enhances the adaptability of the model to real-world noise by adding statistically balanced random disturbances, simulating environmental noise without altering the overall brightness of the image. Additionally, the dataset is augmented by randomly rotating images by 30°, 45°, and 90° and applying mirror transformations. This data augmentation process is depicted in Fig. 3. The structure of the augmented dataset is shown in Table 1.

Fig. 3
figure 3

Data augmentation. a CLAHE b noise addition c image mirroring d rotation by 30°

Table 1 Dataset structure

2.3 YOLOv8

YOLOv8 offers five model scales (n, s, m, l, and x), with increasing network depth and width at each scale. The network structure of the YOLOv8 primarily comprises Backbone, Neck, and Head modules, as illustrated in Fig. 4.

Fig. 4
figure 4

YOLOv8 network structure

2.3.1 Backbone

YOLOv8 employs a modified CSPDarknet53 [31] as its backbone network, yielding five distinct feature scales through five downsampling steps. The structure of the backbone network is depicted in Fig. 4-1. The backbone network replaces the CSP module with the C2f module, which is designed with additional hopping layer connections and split operations, enhancing gradient flow during backpropagation and thus improving model performance. The structure of the C2f module is illustrated in Fig. 4-6 (n denotes the number of bottlenecks). The CBS module processes the input information through convolution, followed by batch normalization, and SiLU activation is used to facilitate information flow, yielding the output results, as illustrated in Fig. 4-7. YOLOv8 utilizes the Spatial Pyramid Pooling Fast (SPPF) structure, which accelerates network operations by connecting three 5 × 5 pooling kernels in sequence, as depicted in Fig. 4-4.

2.3.2 Neck and head

Integrating PANet [32] with Feature Pyramid Network (FPN) [33] YOLOv8 incorporates the PAN-FPN structure in its neck component, as depicted in Fig. 4-2. In contrast to YOLOv5 [15] and YOLOv6 [34], YOLOv8 omits the 1 × 1 convolution prior to upsampling and directly merges features from various stages of the backbone network for feature integration. The PAN-FPN establishes both top-down and bottom-up network structures, achieving complementarity between shallow positional and deep semantic information through feature fusion, thus ensuring feature diversity and completeness. The head component employs the widely used decoupled head structure, as illustrated in Fig. 4-5. This decoupled head comprises two independent branches: one for object classification and the other for bounding box prediction regression. The following distinct loss functions are applied to these tasks: binary cross-entropy (BCE) loss for object classification and distribution focal loss (DFL) [35] along with the CIoU [36] for bounding box regression. The robustness of the model is increased by the use of the task aligned assigner of TOOD [37] for dynamic sample allocation in YOLOv8 as opposed to the static sample allocation technique of YOLOv5.

2.4 RSR-YOLO

In this work, the YOLOv8 network was carefully optimized to detect tomatoes on a large scale and accurately in the plantation. Initially, YOLOv8n, known for its smaller parameter scale and higher computational efficiency, was selected as the base framework, aligning with practical application scenarios and embedded system requirements. To facilitate real-time detection, a PgConv is introduced, reducing the computational load by replacing standard convolutions in PConv with grouped convolutions. Additionally, the innovative FasterNet module was developed to replace C2f in the backbone network, enhancing both the detection accuracy and computational efficiency. Second, tomatoes photographed from a distance appear with fewer pixels in the image, presenting challenges for feature extraction and fusion. Thus, by incorporating the GD mechanism, YOLOv8n achieves efficient information exchange through the global fusion of multilevel features and upward information injection. Additionally, recognizing the importance of low-dimensional features in detecting small targets, the redesigned C2f module replaces the reparametrized convolutional blocks (RepConv) module in the Low-stage information fusion module (Low-IFM) and Information injection module(inject) modules, enhancing the ability of the model to detect small targets. Finally, to address fruit overlap and leaf occlusion, this study introduces repulsion loss, boosting model accuracy in complex environments. The integration of these optimization strategies not only bolsters the model’s ability to detect both regular and small tomatoes but also maintains computational efficiency and real-time performance, meeting the demands of practical applications. The structure of the RSR-YOLOv8 network is depicted in Fig. 5.

Fig. 5
figure 5

The structure of RSR-YOLOv8 In this diagram 'x_local' denotes local feature injection information (produced by the current level) and 'x_global' indicates global feature injection information (generated by the IFM module)

2.4.1 Lightweighting of the backbone network

To efficiently extract spatial features while minimizing redundant computations and memory usage, Chen et al. [38] introduced a streamlined approach, PConv, as illustrated in Fig. 6. PConv utilizes regular convolution for certain continuous features in the input channel, while other features are processed via constant mapping, preserving the integrity of the channel. In regular convolution, a convolution kernel spans all input channels to create an output channel, leading to complex interactions between the input and output channels that increase the computational load and parameter count. This paper introduces group convolutions in place of standard convolutions in PConv to further reduce the computational load of the backbone network. This modification leads to an enhanced version of PConv, termed Partial group Convolution (PgConv), illustrated in Fig. 6. The formula for channel allocation is delineated in Eqs. (1). Group convolution, a unique convolution operation, divides input channels into several groups, with each group independently performing convolution calculations. This approach reduces both computational demands and the number of parameters while preserving the effectiveness of the network.

$$C_{group} = \frac{{C_{in} }}{4}$$
(1)

where \(C_{group}\) is the total number of input channels and \(C_{in}\) is the number of channels after grouping.

Fig. 6
figure 6

PgConv and PConv

The original FasterNet module comprises a PConv layer followed by two sequential 1 × 1 convolutional layer. It employs normalization and activation layers only after the second convolutional layer, effectively preventing the potential reduction in feature diversity due to excessive use of normalization and activation layers, as illustrated in Fig. 7a. However, the 1 × 1 convolution has a relatively small receptive field and lacks the ability to capture global features, unlike larger receptive fields that encompass a broader area of the feature and more effectively capture targets. Additionally, increasing the model depth with this structure could lead to model degradation and feature loss. Consequently, we redesigned the FasterNet module based on PgConv by replacing the two 1 × 1 regular convolutions with 3 × 3 PgConv and adding residual connections to the final two convolutional layers. This approach not only maintains computational speed and efficiency but also enlarges the receptive field, minimizes input feature loss, and enhances detection performance. The innovative FasterNet (IFN) model is depicted in Fig. 7b.

Fig. 7
figure 7

Structure of the FasterNet module and innovative FasterNet module a FasterNet b improved FasterNet

2.4.2 Improved GD mechanism for enhancing the accuracy of small target tomato detection

Traditionally, feature levels correspond to positional information for varying object sizes. Low-dimensional features encompass texture details and positional information of smaller objects, while high-dimensional features contain semantic information and positional attributes of larger objects. In this study, images of tomatoes captured from a longer distance possessed limited feature information. Consequently, the low-dimensional information within larger features is more conducive for the network to extract and fuse features of small target objects. To prevent significant information loss during computation and to fully utilize global information for feature fusion, the GD mechanism was incorporated [39]. The GD mechanism employs a unified module to gather and integrate information from each level, subsequently distributing it across various levels, thus enhancing the feature fusion capability of the neck without notably increasing latency. The GD mechanism is composed of two branches, a low-level distribution branch and a high-level distribution branch, which extract and fuse feature information via convolution-based and attention-based blocks, respectively. The low-stage gather-and-distribute branch includes the Low-stage Feature Alignment Module (Low-FAM), Low-IFM, and Information Injection Module (Inject); the high-stage branch comprises the High-stage Feature Alignment Module (High-FAM), High-stage Information Fusion Module (High-IFM), and Information Injection Module (Inject).

To enhance the extraction and fusion of smaller features and improve tomato fruit detection accuracy, this paper presents improvements to the low-stage gather-and-distribute branch, as depicted in Fig. 5-7. The specific improvements include the following steps: First, the C2f structure is redesigned, as illustrated in Fig. 5-4. This configuration enables the model to capture more detailed information on small targets. Second, the enhanced C2f module replaces the RepBlock module in Low-IFM, boosting its capacity to extract low-dimensional information from larger features. Finally, the improved C2f module replaces the RepBlock module in the Inject, enhancing its feature fusion capability and thereby improving the detection performance of YOLOv8n. The RepBlock module is presented in Fig. 8.

Fig. 8
figure 8

RepBlock module

2.4.3 Repulsion loss

In this experiment, tomatoes were planted closely, with a spacing of 0.2 m between seedlings. In addition to fruit overlap and leaf shading within individual plants, mutual shading among plants presents a significant challenge for precise tomato detection. To address this challenge, this study employed repulsion loss [40, 41] to enhance the positioning accuracy of the bounding box. This was achieved by integrating repulsion loss with the loss function of YOLOv8, as detailed in Eqs. (2)-(7). This enhancement is vital for addressing reduced detection accuracy issues caused by leaf occlusion, fruit overlap, and mutual occlusion among seedlings. Utilizing this method, we effectively enhance the ability of the model to identify tomatoes in complex environments, thereby improving its detection performance.

$$L = \lambda_{CIoU} L_{CIoU} + \lambda_{DFL} L_{DFL} + \lambda_{BCE} L_{BCE}$$
(2)

where \(\lambda_{{CI{\text{oU}}}}\), \(\lambda_{DFL}\) and \(\lambda_{BCE}\) are the weighting coefficients assigned to the CIoU, DFL and BCE, respectively.

$$L_{R} = L + \alpha L_{{{\text{Re}} pGT}} + \beta L_{{{\text{Re}} pBox}}$$
(3)

where \(L\) is the original loss function for YOLOv8. \(L_{{{\text{Re}} pGT}}\) is the loss value generated between.

the predicted bounding box and the adjacent ground-truth bounding boxes of the object; \(L_{{{\text{Re}} pBox}}\) is the loss value generated between the predicted bounding box of the object and other adjacent predicted bounding boxes of other objects. The values of the weighting coefficients \(\alpha\) and \(\beta\) are 0.5 and 2.0 respectively.

$$L_{{{\text{Re}} pGT}} = \frac{{\sum\nolimits_{{p \in p_{1} }} {Smooth_{\ln } (I_{IoG} (B^{P} ,G_{{{\text{Re}} P}}^{P} ))} }}{{\left| {P_{1} } \right|}}$$
(4)
$$I_{IoG} (B,G) = \frac{{A_{area} (B \cap G)}}{{A_{area} (G)}}$$
(5)
$$Smooth_{\ln } = \left\{ {_{{\frac{x - \delta }{{1 - \delta }} - \ln \left( {1 - \delta } \right)x > \delta }}^{{ - \ln \left( {1 - x} \right)x \le \delta }} } \right.$$
(6)

where \(P{1}\) is the set of positive examples where the overlap between the predicted and actual bounding boxes reaches a predetermined threshold. \(B^{P}\) the predicted bounding box is derived through regression adjustment. \(G_{{{\text{Re}} p}}^{P}\) represents the ground truth bounding box that has the largest IoU ratio with the predicted bounding box, excluding the specified object. \(SmoothIn\) is a smoothing function employed to modulate the loss function's sensitivity based on the intersection size; in this study, \(\delta = 0.5\). \(A_{area} ()\) is an area calculation function.

$$L_{{{\text{Re}} pBox}} = \frac{{\sum\nolimits_{i \ne j} {Smooth_{\ln } (S_{IoU} (B_{{P_{i} }} ,B_{{P_{j} }} ))} }}{{\sum\nolimits_{i \ne j} {1[S_{IoU} (B_{{P_{i} }} ,B_{{P_{j} }} ) > 0] + \varepsilon } }}$$
(7)

where \(P_{i}\) and \(P_{{\text{j}}}\) are the predicted bounding boxes of different objects. \(B_{{P_{i} }}\) and \(B_{Pj}\) are the predicted bounding boxes of the objects regressed from the predicted bounding boxes \(P_{i}\) and \(P_{{\text{j}}}\), respectively. \(S_{IoU} (B_{{P_{i} }} ,B_{Pj} )\) is the IoU ratio between \(B_{{P_{i} }}\) and \(B_{Pj}\). \(\varepsilon\) is a very small value set to prevent the divisor from being zero.

2.5 Performance evaluation of the models

The model was evaluated using five evaluation metrics: precision (P), recall (R), mAP, F1 score and frames per second (FPS) [42]. Equations (8)-(11) can be used to determine the values of precision, recall, F1 score and mAP; the higher the mAP is, the better the performance of the model. FPS indicates the number of image frames that can be processed per second. A higher value indicates that the model processes images faster.

$$P = \frac{TP}{{(TP + FP)}}$$
(8)
$$R = \frac{TP}{{(TP + FN)}}$$
(9)
$$F1 = \frac{2 \times P \times R}{{(P + R)}}$$
(10)
$$mAP = \frac{{\sum\nolimits_{i = 1}^{C} {AP_{i} } }}{C}$$
(11)

where true positive (\(TP\)) indicates that the actual sample is positive, and the prediction is also positive; false positive (\(FP\)) indicates that the actual sample is negative, but the prediction is positive; false negative (\(FN\)) indicates that the actual sample is positive, but the prediction is negative. The \(AP\) indicates the average precision; \(mAP\) is the average \(AP\) of the five categories; \(C\) indicates the number of categories.

3 Results and discussion

3.1 Experimental configuration

The examinations were performed on an Ubuntu 20.04 computer using an Intel® Xeon(R) Gold 6240 CPU @ 2.60 GHz and a TITAN RTX graphics card. The framework for deep learning was PyTorch 1.9.0 and CUDA 11.1, accelerated by cuDNN version 8.0.5. In this experiment, all the models were tested in the same environment and under the same hyperparameter settings: the learning rate, batch size, momentum, weight decay, and number of iterations were 0.01, 16, 0.937, 0.0005, and 500 epochs, respectively; the resolution was set to 640 pixels.

3.2 Tomato detection results based on the RSR-YOLO network

This experiment evaluates the detection and deployment performance of the RSR-YOLO model using a validation set. The RSR-YOLO model achieves the mAP@0.5 of 90.7% and the FPS of 76, ensuring real-time detection capabilities along with high accuracy. Detection results of the RSR-YOLO model for tomatoes, including small target ones, are illustrated in Fig. 9. Images 9-b, 9-d, 9-n, and 9-p showcase the ability of model to detect tomato ripeness, while 9-f, 9-h, 9-j, and 9-l demonstrate detection of small target tomatoes. The detection abilities of the model for small target tomatoes obscured by leaves and overlapping fruits are shown in images 9-e, 9-g, 9-i, and 9-k. Additionally, images 9-a, 9-c, 9-m, and 9-o exhibit the model’s capability to negate surface color effects and classify tomatoes by varying ripeness levels effectively. Table 2 presents the average precision for tomatoes at various stages of maturity. Table 2 reveals that the enhanced network adeptly detects tomatoes across various ripeness levels, exhibiting minimal variation in average precision across categories. The findings suggest that the RSR-YOLO model effectively mitigates the impact of surface color and features on detection outcomes, ensuring precise identification of tomatoes at varying ripeness levels.

Fig. 9
figure 9

Detection results

Table 2 Detection results for tomatoes at different maturities

3.3 Lightweight study of the RSR-YOLO backbone network

The lightweight design of network substantially lowers computational complexity, thereby decreasing reliance on high-performance hardware. The feature extraction capability of backbone is crucial for the final detection results. Thus, creating a lightweight yet powerful backbone network is essential to minimize computational complexity and boost detection speed. This study achieves a balance between detection and deployment performance by replacing the C2f module with the IFN module, significantly reducing computational complexity and enhancing accuracy. Comparative experiments with various numbers of IFN modules are detailed in Table 3. The RSR-YOLO model, utilizing IFN in its backbone network, achieves a 90.7% mAP@0.5, marking a 0.3% increase and a 12.3% reduction in computational complexity compared to using C2f. Experimental results show that the lightweight PgConv significantly decreases the computational complexity, thereby enhancing deployment performance of network. The use of a 3 × 3 convolutional receptive field in the IFN structure and residual connections in the last two convolutional layers enhance the ability of the backbone network to capture features of tomatoes and reduces the loss of input features. Consequently, the use of IFN modules enhances the computational efficiency of model while maintaining detection accuracy.

Table 3 Comparative experiments of IFN replacing C2f

3.4 Exploring the effect of the improved C2f module on the feature fusion capability of the Neck module

To enhance the feature extraction and fusion capabilities of the Neck component for small target tomatoes, this study utilizes the enhanced C2f to replace the RepBlock module in the Low-IFM and Inject. According to Table 4, the adoption of the C2f module in the RSR-YOLO network, as compared to the original GD mechanism, significantly enhances the extraction and fusion of small target tomato features. The improved GD mechanism results in a 1.1% increase in mAP@0.5 score and a 21.8% reduction in computational complexity. Experimental results show that, in comparison to the RepBlock module formed by stacking RepConv, the C2f module, using skip connections and split operations, more efficiently captures gradient flow information for small target tomatoes. This enhances the ability of GD mechanism to extract and integrate features of small target tomatoes. PgConv uses group convolution to convolve continuous features in the input channel, simultaneously applying identity mapping to the remaining features. This efficiently decreases the computational complexity of the enhanced C2f module, optimizing the deployment performance of network. In summary, the enhanced GD mechanism optimizes the detection performance of model without incurring additional delay.

Table 4 Performance comparison

3.5 Comparative experiment to verify the effectiveness of the RSR-YOLO network structure

This study conducted a series of comparative experiments using the YOLOv8n model to validate the effectiveness and feasibility of the RSR-YOLO model for detecting tomatoes and small target tomatoes. As indicated in Table 5, YOLOv8n + IFN achieves an mAP@0.5 of 88.0%, indicating a 0.9% improvement over the original model. The results demonstrate that the lightweight IFN module enhances the feature extraction capacity of the backbone and computational efficiency while maintaining detection accuracy. Compared to YOLOv8n + IFN, the enhanced GD mechanism at the neck significantly improves the detection performance of the network, increasing the precision, recall, F1 score, and mAP@0.5 by 1.4%, 4.1%, 2.8%, and 2.2%, respectively. The results indicate that the improved GD mechanism effectively boosts the detection performance of the model for small target tomatoes. RSR-YOLO achieves a precision, recall, F1 score, and mAP@0.5 of 91.6%, 85.9%, 88.7%, and 90.7%, respectively, surpassing YOLOv8n + IFN + GD by 1.3%, 1.0%, 1.2%, and 0.5%, respectively. The increased precision highlights the effectiveness of repulsion loss in improving the detection of models of tomatoes with leaf shading and overlapping fruits. In conclusion, the network structure of the RSR-YOLO model is reasonable and effective. Figure 10 displays a graph comparing the mAP@0.5 values with the sequential addition of each module. Figure 11 shows the results of the feature map visualization.

Table 5 Comparison experiment
Fig. 10
figure 10

Performance comparison

Fig. 11
figure 11

Feature map visualization

This paper uses the GradCAM method with a confidence setting of 0.6. After several training and testing sessions, heatmaps were generated from the output of the last C2f layer of YOLOv8n, YOLOv8n + IFN, YOLOv8n + IFN + GD, and RSR-YOLO. In Fig. 11a, f, k are the original images; (b), (g), and (l) are the output results of YOLOv8n; (c), (h), and (m) are the output results of YOLOv8n + IFN; (d), (i), and (n) are the output results of YOLOv8n + IFN + GD; (e), (j), and (o) are the output results of RSR-YOLO. The visualization of feature maps has progressively improved with model enhancements, and the performance of RSR-YOLO surpasses that of YOLOv8n, particularly in the detection of small target tomatoes.

3.6 Comparison of the RSR-YOLO results against those of other models

Target detection requires striking a balance between detection and deployment performance in order to assess the effectiveness of the algorithm [43]. The precision, recall, F1 score, and mAP were utilized in this work to evaluate the detection performance of RSR-YOLO, while the FPS and computing complexity were used to analyze the deployment performance. Comparative data for ten algorithms are presented in Table 6. The RSR-YOLO has a mAP@0.5 of 90.7%, which is an improvement of 3.6%, 1.1%, 0.1%, 3.9%, 2.3%, 2.2%, 0.6%, and 0.3% compared to YOLOv8n, YOLOv8s, YOLOv8m, YOLOv5s, YOLOv5m, YOLOv5l, YOLOv6s, and YOLOv6m, respectively. The FPS of the improved model increases by 50% and its computational complexity decreases by 83.8% when compared to YOLOv7, despite the fact that its mAP@0.5 is 0.5% lower. RSR-YOLO significantly enhances deployment performance and reduces hardware dependency with minimal sacrifice in accuracy compared to YOLOv7. In conclusion, this paper shows that RSR-YOLO adeptly balances detection and deployment performance, outperforming other methods in real-time detection and accurate detection of both regular and small-target tomatoes. Figure 12 displays the mAP@0.5 comparison graphs for the ten algorithms, while Fig. 13 shows their FPS and inference times.

Table 6 Performance comparison
Fig. 12
figure 12

Comparison of detection results

Fig. 13
figure 13

Comparative chart of deployment performance

The figure shows the mAP@0.5 scores of 10 algorithms at epochs 0, 50, 100, and 499.The best mAP@0.5 scores for the 10 algorithms are listed as follows: YOLOv5l: at epoch 356, mAP@0.5 is 0.885; YOLOv5s: at epoch 377, mAP@0.5 is 0.868; YOLOv5m: at epoch 385, mAP@0.5 is 0.884; YOLOv8s: at epoch 412, mAP@0.5 is 0.896; YOLOv8m: at epoch 421, mAP@0.5 is 0.906; YOLOv6m: at epoch 456, mAP@0.5 is 0.904; YOLOv6s: at epoch 437, mAP@0.5 is 0.901; YOLOv7: at epoch 407, mAP@0.5 is 0.912; YOLOv8n: at epoch 355, mAP@0.5 is 0.871; RSR-YOLO: at epoch 480, mAP@0.5 is 0.907. To better analyze the fluctuations in the mAP@0.5 values of RSR-YOLO, an area graph is plotted in Fig. 12. The graph shows minimal fluctuation in the convergence process of the model, indicating a trend toward stabilization. In conclusion, RSR-YOLO exhibits robust detection capabilities.

In the chart, FPS and inference time exhibit an inverse relationship: higher FPS values denote shorter inference times, reflecting accelerated model operation on hardware. From the figure, it can be seen that although RSR-YOLO is slightly inferior to YOLOv8n, YOLOv5s, and YOLOv8s in FPS and inference time (with a significant difference in mAP compared to these algorithms), its deployment performance is superior to YOLOv8m, YOLO6m, and YOLOv7 (whose mAP scores are relatively close to that of RSR-YOLO). In conclusion, RSR-YOLO excels in deployment performance while maintaining high accuracy.

3.7 Applications of RSR-YOLO

To better evaluate the performance of the model, this paper develops a GUI visualization interface and conducted field tests on a tomato plantation. The GUI is shown in Fig. 14. Figure 14a displays the general layout of the interface. Figure 14b shows the functional area setup of the interface, with the local button on the left for detecting local files or videos and the camera button for connecting a visual sensor. On the right, the Model button selects weight files for different models, the Conf button sets the confidence threshold, and the IoU button manually adjusts the intersection ratio threshold. Figure 14c illustrates the process of selecting a local file for detection. Figure 14d presents the tomato detection results, including statistics on tomato categories, count, and FPS displayed at the top of the area.

Fig. 14
figure 14

GUI visualization interface

3.8 Limitations of the RSR-YOLO

This research proposes an RSR-YOLO model to achieve long-range small-target tomato detection. The RSR-YOLO model successfully balances detection performance with deployment performance, demonstrating excellent real-time detection performance while accurately detecting tomato fruit. RSR-YOLO does, however, have two drawbacks. One is that the model experiences issues like leaking and incorrect detection throughout the detection process because of limited features and anchor frames of the small target tomato. Secondly, the detection capability of the model in real-world applications is easily impacted by changes in illumination, which makes it difficult to accurately locate and identify tomato features. Based on the above two limitations, we will further design and optimize the network structure of the RSR-YOLO model to improve the detection ability of small target tomatoes and enhance the generalization ability of the network in the following work as follows: (1) Design a primary and secondary feature extraction network, the secondary feature extraction network focuses on extracting the texture and detail features of the small target tomatoes contained in the low-dimensional feature layer, and then fuses the features with the primary feature extraction network. network for feature fusion. (2) In the subsequent work, the diversity of the dataset will be further expanded to improve the generalization ability of the model in poor visual appearance environments. In addition, the idea of codebook priors to reconstruct images is borrowed to design an adaptive feature recovery network branch. When the IOU is lower than a certain threshold, the model will retrieve the pre-trained high-quality feature information in the codebook priors based on the contextual information of the environmentally affected features, thus further overcoming the influence of environmental factors on the generalization ability of the model.

4 Conclusions

This research proposes an RSR YOLO model to achieve precise tomato detection. First, YOLOv8n, which has a smaller parameter size and higher computing efficiency, is chosen as the fundamental framework for this experiment in consideration of real-world application situations and embedded system needs. This study designs PgConv, which replaces regular convolution in PConv with group convolution, thereby reducing the computational cost of the model and enabling real-time detection of the model. In addition, the IFN module is intended to take the place of the C2f module of the backbone network, guaranteeing detection accuracy while enhancing processing efficiency. Second, this research reconstructs the YOLOv8n neck module in conjunction with the Gather and Distribute (GD) mechanism, taking into account the critical role of low-dimensional features for small target tomatoes. To efficiently extract and merge the tomato features at different levels, this work employs the new C2f module, which ensures the lightness of the model while obtaining deeper information about small targets. Finally, Repulsion Loss is added to increase the accuracy of the model for detecting tomato leaf shade and fruit overlap. The RSR-YOLO has a mAP@0.5 of 90.7%, which is an improvement of 3.6%, 1.1%, 0.1%, 3.9%, 2.3%, 2.2%, 0.6%, and 0.3% compared to YOLOv8n, YOLOv8s, YOLOv8m, YOLOv5s, YOLOv5m, YOLOv5l, YOLOv6s, and YOLOv6m, respectively. Despite a slight decrease of 0.5% in mAP@0.5 compared to YOLOv7, the Improved model demonstrates a 50 unit increase in FPS and an 83.8% reduction in computational complexity. RSR-YOLO significantly improves network deployment performance and minimizes hardware dependency with only a slight reduction in accuracy compared to YOLOv7. To sum up, the RSR-YOLO network exhibits outstanding real-time detection performance and reliably detects tomato fruit by skillfully balancing detection performance and deployment performance. Future work will concentrate on enhancing the detection of small-target tomatoes with low-dimensional information, designing a network better suited for small-target detection, and further optimizing deployment performance while increasing accuracy.